-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "Sliding Range Join" execution plan #1880
Conversation
2f401f5
to
3648249
Compare
sql/analyzer/indexed_joins.go
Outdated
rel := &memo.SlidingRangeJoin{ | ||
JoinBase: join.Copy(), | ||
} | ||
// TODO: Remove the filter that was used to create the sliding range because it's no longer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can include a filter in the plan representation that you drop converting to the exec format. not ideal if you don't have to tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed for now. If a range join uses an index with multiple key expressions, removing the right set of filters is no longer simple.
We can attempt this if it ever proves to be a performance bottleneck, but I doubt it will.
(Once this is complete, SlidingRange can choose whether to use an index or a sort.)
(Eventually, SlidingRange will be able to choose which option is best.)
Note that although this tracks whether the inequalities are open or closed, we don't do anything with that information. Turns out it doesn't matter for correctness: It's okay for the secondary row iter assumes the ranges are closed and return extra rows, because the filters still get checked in the parent. We probably shouldn't depend on this though. I'll track the open/closeness of the ranges in the Node in a follow-up commit.
4821280
to
3a58b0b
Compare
…returns all non-null rows.
…ed. Now that we can build RangeJoins over an index that matches multiple filters, the logic for detecting unneeded filters is no longer trivial.
3a58b0b
to
3403ed9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, great work. Added test suggestions for a bunch of queries users will try to break. I think rearranging the execution into its own package will have meaningful perf impact, though I'd want it for package separation regardless. I think it's still great to ship this and then follow up and grab a bunch more perf by refactoring the analyzer nodes that capture execution state (including CachedResults for hash joins). Adding the unit test framework for something small like between pattern matching would be a good upfront investment for the additions and refactors we'll have to do in the future.
sql/plan/sliding_range.go
Outdated
if err != nil { | ||
if errors.Is(err, io.EOF) { | ||
// We've already imported every range into the priority queue. | ||
s.pendingRow = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd be surprised if this line was necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed for the case where the RHS table is exhausted before the LHS table. We still need to accept every remaining LHS row, and maintain the heap, but we need to track the fact that there's no RHS rows to process.
sql/plan/sliding_range.go
Outdated
} | ||
|
||
// Every active row must match the accepted row. | ||
return sql.RowsToRowIter(s.activeRanges.rows...), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i really think there would be a noticeable perf difference, not creating 900k new arrays and iter objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the profile, even accounting for the difference between running locally and running on hosted, we spend less than a second creating and iterating over these objects.
Now, this might be responsible for extra time spent in the garbage collector. We'll see when we refactor. But the creation and iteration itself is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I meant GC specifically, iirc the profile GC time more than doubled
sql/plan/sliding_range.go
Outdated
// SlidingRange is a Node that wraps a table with min and max range columns. When used as a secondary provider in Join | ||
// operations, it can efficiently compute the rows whose ranges bound the value from the other table. When the ranges | ||
// don't overlap, the amortized complexity is O(1) for each result row. | ||
type SlidingRange struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you feel about RangeHeap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me!
// satisfiesScalarRefs returns true if all GetFields in the expression | ||
// are columns provided by |grp| | ||
func satisfiesScalarRefs(e memo.ScalarExpr, grp *memo.ExprGroup) bool { | ||
// |grp| provides all tables referenced in |e| | ||
return e.Group().ScalarProps().Tables.Difference(grp.RelProps.OutputTables()).Len() == 0 | ||
} | ||
|
||
func getColumnRefFromScalar(s memo.ScalarExpr) *memo.ColRef { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
closedOnLowerBound, closedOnUpperBound bool | ||
} | ||
|
||
func getRangeFilters(filters []memo.ScalarExpr) (ranges []rangeFilter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A useful way to document helper functions like this can be unit tests. They can help walk others through how to add additional cases or patches without having to step through a full query's debugger. select_hint_tests.go
is maybe one example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sql/analyzer/indexed_joins.go
Outdated
return ranges | ||
} | ||
|
||
func addSlidingRangeJoin(m *memo.Memo) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring pls. For exploration rules in particular, giving a generalization of (pattern matched) => (rearrangement) and/or a toy SQL example can be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if err != nil { | ||
return nil, err | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitting the sort out into a separate node probably makes sense at some point, otherwise debugging/costing might be a bit tricker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now if we need to the coster can easily tell whether the plan will use an IndexScan or a Sort based on whether the Index field is set, and the described plan is clearly different, showing either the IndexScan or the Sort.
But this is definitely something to keep in mind when refactoring this.
@@ -1357,4 +1358,192 @@ SELECT SUM(x) FROM xy WHERE x IN ( | |||
}, | |||
}, | |||
}, | |||
{ | |||
name: "primary key range join", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple edge cases customers might hit:
- left joins
- join relations being non-standard types (table alias, subquery alias, CTE, recursive CTE, table function, tables with mixed upper/lower casing)
- one of the tables being empty
- one of the tables having 1 row
- the tables matching cartesian product of both tables
- tables matching no rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other misc maybe lower priority:
- new join op plays nicely with DISTINCT, projections, subquery expressions
- new join op can be disabled with hints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
…Scalar` Also fix the typo that the examples uncovered.
333db24
to
57784d6
Compare
This means that RangeHeapJoin will always cost better then InnerJoin, but a lookup join with multiple key expressions can cost even better.
56aff54
to
6405061
Compare
…of putting Nulls after non-Nulls.
…rangeHeapJoinIter`. Nodes represent physical plans, not executions. It should be possible to spawn multiple executions from a physical plan. To that end, Nodes should not have state that changes during execution.
…angeHeaps. This guarentees correct behavior even when multiple tables have columns with the same name.
6e6181f
to
1d2455f
Compare
…server into nicktobey/sliding-join
(The original PR was accidentally merged. I fixed the history but there doesn't seem to be a way to "unmerge" the PR)
This is the draft implementation of the "Sliding Range Join" execution plan. This allows for more performant joins when the join condition checks that the column on one table is within a range specified by two columns on the other table.