Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Sliding Range Join" execution plan #1880

Merged
merged 48 commits into from
Jul 28, 2023
Merged

Conversation

nicktobey
Copy link
Contributor

@nicktobey nicktobey commented Jul 18, 2023

(The original PR was accidentally merged. I fixed the history but there doesn't seem to be a way to "unmerge" the PR)

This is the draft implementation of the "Sliding Range Join" execution plan. This allows for more performant joins when the join condition checks that the column on one table is within a range specified by two columns on the other table.

@nicktobey nicktobey force-pushed the nicktobey/sliding-join branch from 2f401f5 to 3648249 Compare July 20, 2023 00:52
rel := &memo.SlidingRangeJoin{
JoinBase: join.Copy(),
}
// TODO: Remove the filter that was used to create the sliding range because it's no longer
Copy link
Contributor

@max-hoffman max-hoffman Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can include a filter in the plan representation that you drop converting to the exec format. not ideal if you don't have to tho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed for now. If a range join uses an index with multiple key expressions, removing the right set of filters is no longer simple.

We can attempt this if it ever proves to be a performance bottleneck, but I doubt it will.

@nicktobey nicktobey marked this pull request as ready for review July 25, 2023 01:03
nicktobey and others added 24 commits July 24, 2023 18:05
(Once this is complete, SlidingRange can choose whether to use an index or a sort.)
(Eventually, SlidingRange will be able to choose which option is best.)
Note that although this tracks whether the inequalities are open or closed, we don't do anything with that information. Turns out it doesn't matter for correctness: It's okay for the secondary row iter assumes the ranges are closed and return extra rows, because the filters still get checked in the parent.

We probably shouldn't depend on this though. I'll track the open/closeness of the ranges in the Node in a follow-up commit.
@nicktobey nicktobey force-pushed the nicktobey/sliding-join branch from 4821280 to 3a58b0b Compare July 25, 2023 01:05
…ed. Now that we can build RangeJoins over an index that matches multiple filters, the logic for detecting unneeded filters is no longer trivial.
@nicktobey nicktobey force-pushed the nicktobey/sliding-join branch from 3a58b0b to 3403ed9 Compare July 25, 2023 01:12
@nicktobey nicktobey requested a review from jycor July 25, 2023 01:17
Copy link
Contributor

@max-hoffman max-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, great work. Added test suggestions for a bunch of queries users will try to break. I think rearranging the execution into its own package will have meaningful perf impact, though I'd want it for package separation regardless. I think it's still great to ship this and then follow up and grab a bunch more perf by refactoring the analyzer nodes that capture execution state (including CachedResults for hash joins). Adding the unit test framework for something small like between pattern matching would be a good upfront investment for the additions and refactors we'll have to do in the future.

if err != nil {
if errors.Is(err, io.EOF) {
// We've already imported every range into the priority queue.
s.pendingRow = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd be surprised if this line was necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed for the case where the RHS table is exhausted before the LHS table. We still need to accept every remaining LHS row, and maintain the heap, but we need to track the fact that there's no RHS rows to process.

}

// Every active row must match the accepted row.
return sql.RowsToRowIter(s.activeRanges.rows...), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really think there would be a noticeable perf difference, not creating 900k new arrays and iter objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the profile, even accounting for the difference between running locally and running on hosted, we spend less than a second creating and iterating over these objects.

Now, this might be responsible for extra time spent in the garbage collector. We'll see when we refactor. But the creation and iteration itself is not an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I meant GC specifically, iirc the profile GC time more than doubled

// SlidingRange is a Node that wraps a table with min and max range columns. When used as a secondary provider in Join
// operations, it can efficiently compute the rows whose ranges bound the value from the other table. When the ranges
// don't overlap, the amortized complexity is O(1) for each result row.
type SlidingRange struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you feel about RangeHeap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

// satisfiesScalarRefs returns true if all GetFields in the expression
// are columns provided by |grp|
func satisfiesScalarRefs(e memo.ScalarExpr, grp *memo.ExprGroup) bool {
// |grp| provides all tables referenced in |e|
return e.Group().ScalarProps().Tables.Difference(grp.RelProps.OutputTables()).Len() == 0
}

func getColumnRefFromScalar(s memo.ScalarExpr) *memo.ColRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

closedOnLowerBound, closedOnUpperBound bool
}

func getRangeFilters(filters []memo.ScalarExpr) (ranges []rangeFilter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful way to document helper functions like this can be unit tests. They can help walk others through how to add additional cases or patches without having to step through a full query's debugger. select_hint_tests.go is maybe one example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return ranges
}

func addSlidingRangeJoin(m *memo.Memo) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring pls. For exploration rules in particular, giving a generalization of (pattern matched) => (rearrangement) and/or a toy SQL example can be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err != nil {
return nil, err
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting the sort out into a separate node probably makes sense at some point, otherwise debugging/costing might be a bit tricker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now if we need to the coster can easily tell whether the plan will use an IndexScan or a Sort based on whether the Index field is set, and the described plan is clearly different, showing either the IndexScan or the Sort.

But this is definitely something to keep in mind when refactoring this.

@@ -1357,4 +1358,192 @@ SELECT SUM(x) FROM xy WHERE x IN (
},
},
},
{
name: "primary key range join",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple edge cases customers might hit:

  • left joins
  • join relations being non-standard types (table alias, subquery alias, CTE, recursive CTE, table function, tables with mixed upper/lower casing)
  • one of the tables being empty
  • one of the tables having 1 row
  • the tables matching cartesian product of both tables
  • tables matching no rows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other misc maybe lower priority:

  • new join op plays nicely with DISTINCT, projections, subquery expressions
  • new join op can be disabled with hints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@nicktobey nicktobey force-pushed the nicktobey/sliding-join branch from 333db24 to 57784d6 Compare July 25, 2023 23:29
@nicktobey nicktobey force-pushed the nicktobey/sliding-join branch from 56aff54 to 6405061 Compare July 26, 2023 05:51
…rangeHeapJoinIter`.

Nodes represent physical plans, not executions. It should be possible to spawn multiple executions from a physical plan. To that end, Nodes should not have state that changes during execution.
…angeHeaps. This guarentees correct behavior even when multiple tables have columns with the same name.
@nicktobey nicktobey force-pushed the nicktobey/sliding-join branch from 6e6181f to 1d2455f Compare July 28, 2023 17:17
@nicktobey nicktobey merged commit f28984e into main Jul 28, 2023
@nicktobey nicktobey deleted the nicktobey/sliding-join branch July 28, 2023 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants