Add "Sliding Range Join" execution plan #1880

nicktobey · 2023-07-18T18:08:41Z

(The original PR was accidentally merged. I fixed the history but there doesn't seem to be a way to "unmerge" the PR)

This is the draft implementation of the "Sliding Range Join" execution plan. This allows for more performant joins when the join condition checks that the column on one table is within a range specified by two columns on the other table.

max-hoffman · 2023-07-20T15:29:24Z

sql/analyzer/indexed_joins.go

+						rel := &memo.SlidingRangeJoin{
+							JoinBase: join.Copy(),
+						}
+						// TODO: Remove the filter that was used to create the sliding range because it's no longer


you can include a filter in the plan representation that you drop converting to the exec format. not ideal if you don't have to tho

Removed for now. If a range join uses an index with multiple key expressions, removing the right set of filters is no longer simple.

We can attempt this if it ever proves to be a performance bottleneck, but I doubt it will.

…dingRange`

…exScan.

…ndexScan.

…orting.

(Once this is complete, SlidingRange can choose whether to use an index or a sort.)

(Eventually, SlidingRange will be able to choose which option is best.)

Note that although this tracks whether the inequalities are open or closed, we don't do anything with that information. Turns out it doesn't matter for correctness: It's okay for the secondary row iter assumes the ranges are closed and return extra rows, because the filters still get checked in the parent. We probably shouldn't depend on this though. I'll track the open/closeness of the ranges in the Node in a follow-up commit.

…returns all non-null rows.

…ed. Now that we can build RangeJoins over an index that matches multiple filters, the logic for detecting unneeded filters is no longer trivial.

max-hoffman

Mostly LGTM, great work. Added test suggestions for a bunch of queries users will try to break. I think rearranging the execution into its own package will have meaningful perf impact, though I'd want it for package separation regardless. I think it's still great to ship this and then follow up and grab a bunch more perf by refactoring the analyzer nodes that capture execution state (including CachedResults for hash joins). Adding the unit test framework for something small like between pattern matching would be a good upfront investment for the additions and refactors we'll have to do in the future.

max-hoffman · 2023-07-25T01:23:42Z

sql/plan/sliding_range.go

+		if err != nil {
+			if errors.Is(err, io.EOF) {
+				// We've already imported every range into the priority queue.
+				s.pendingRow = nil


i'd be surprised if this line was necessary

It's needed for the case where the RHS table is exhausted before the LHS table. We still need to accept every remaining LHS row, and maintain the heap, but we need to track the fact that there's no RHS rows to process.

max-hoffman · 2023-07-25T01:24:37Z

sql/plan/sliding_range.go

+	}
+
+	// Every active row must match the accepted row.
+	return sql.RowsToRowIter(s.activeRanges.rows...), nil


i really think there would be a noticeable perf difference, not creating 900k new arrays and iter objects

In the profile, even accounting for the difference between running locally and running on hosted, we spend less than a second creating and iterating over these objects.

Now, this might be responsible for extra time spent in the garbage collector. We'll see when we refactor. But the creation and iteration itself is not an issue.

Yeah I meant GC specifically, iirc the profile GC time more than doubled

max-hoffman · 2023-07-25T01:25:17Z

sql/plan/sliding_range.go

+// SlidingRange is a Node that wraps a table with min and max range columns. When used as a secondary provider in Join
+// operations, it can efficiently compute the rows whose ranges bound the value from the other table. When the ranges
+// don't overlap, the amortized complexity is O(1) for each result row.
+type SlidingRange struct {


how do you feel about RangeHeap?

Works for me!

max-hoffman · 2023-07-25T01:28:37Z

sql/analyzer/indexed_joins.go

 // satisfiesScalarRefs returns true if all GetFields in the expression
 // are columns provided by |grp|
 func satisfiesScalarRefs(e memo.ScalarExpr, grp *memo.ExprGroup) bool {
 	// |grp| provides all tables referenced in |e|
 	return e.Group().ScalarProps().Tables.Difference(grp.RelProps.OutputTables()).Len() == 0
 }

+func getColumnRefFromScalar(s memo.ScalarExpr) *memo.ColRef {


docstring pls

max-hoffman · 2023-07-25T01:33:25Z

sql/analyzer/indexed_joins.go

+	closedOnLowerBound, closedOnUpperBound bool
+}
+
+func getRangeFilters(filters []memo.ScalarExpr) (ranges []rangeFilter) {


A useful way to document helper functions like this can be unit tests. They can help walk others through how to add additional cases or patches without having to step through a full query's debugger. select_hint_tests.go is maybe one example

max-hoffman · 2023-07-25T01:37:15Z

sql/analyzer/indexed_joins.go

+	return ranges
+}
+
+func addSlidingRangeJoin(m *memo.Memo) error {


docstring pls. For exploration rules in particular, giving a generalization of (pattern matched) => (rearrangement) and/or a toy SQL example can be helpful

max-hoffman · 2023-07-25T01:38:57Z

sql/memo/exec_builder.go

+		if err != nil {
+			return nil, err
+		}
+	} else {


splitting the sort out into a separate node probably makes sense at some point, otherwise debugging/costing might be a bit tricker

Right now if we need to the coster can easily tell whether the plan will use an IndexScan or a Sort based on whether the Index field is set, and the described plan is clearly different, showing either the IndexScan or the Sort.

But this is definitely something to keep in mind when refactoring this.

max-hoffman · 2023-07-25T01:42:20Z

enginetest/join_op_tests.go

@@ -1357,4 +1358,192 @@ SELECT SUM(x) FROM xy WHERE x IN (
 			},
 		},
 	},
+	{
+		name: "primary key range join",


a couple edge cases customers might hit:

left joins

join relations being non-standard types (table alias, subquery alias, CTE, recursive CTE, table function, tables with mixed upper/lower casing)

one of the tables being empty

one of the tables having 1 row

the tables matching cartesian product of both tables

tables matching no rows

other misc maybe lower priority:

new join op plays nicely with DISTINCT, projections, subquery expressions

new join op can be disabled with hints

…Scalar` Also fix the typo that the examples uncovered.

This means that RangeHeapJoin will always cost better then InnerJoin, but a lookup join with multiple key expressions can cost even better.

…Heap.

…of putting Nulls after non-Nulls.

…rangeHeapJoinIter`. Nodes represent physical plans, not executions. It should be possible to spawn multiple executions from a physical plan. To that end, Nodes should not have state that changes during execution.

…angeHeaps. This guarentees correct behavior even when multiple tables have columns with the same name.

…server into nicktobey/sliding-join

Outline getColumnRefFromScalar from addMergeJoins

02109a8

nicktobey force-pushed the nicktobey/sliding-join branch from 2f401f5 to 3648249 Compare July 20, 2023 00:52

max-hoffman reviewed Jul 20, 2023

View reviewed changes

nicktobey marked this pull request as ready for review July 25, 2023 01:03

nicktobey and others added 24 commits July 24, 2023 18:05

Create new join type JoinTypeSlidingRange and `JoinTypeOuterLeftSli…

54ef154

…dingRange`

Add Between expression to Memoizer.

7c4ab43

Add Sliding Range Join expression to Memo.

e8763e6

Add coster for Sliding Range Join

3a14a7c

Generate Sliding Range Node.

72779ec

Generate RowIter for SlidingRange

10e094a

[ga-format-pr] Run ./format_repo.sh to fix formatting

8582b8e

Add basic tests for SlidingRangeJoin.

26507af

SlidingRangeJoin requires an index on the min column, and uses an Ind…

da1120f

…exScan.

SlidingRangeJoin requires an index on the value column, and uses an I…

1ecd744

…ndexScan.

Add TODO about removing filter condition when making SlidingRange.

bf05ce3

Add warning that IndexOfColName may not be safe.

8f02ddc

Add fields to plan.SlidingRange to capture the expressions used for s…

235d4e2

…orting.

When there's no index for the sliding range, use nil instead.

67745f4

(Once this is complete, SlidingRange can choose whether to use an index or a sort.)

In SlidingRange, create a Sort node instead of an IndexScan.

0e04826

(Eventually, SlidingRange will be able to choose which option is best.)

Simply logic for computing every possible SlidingRange index.

94e28c4

Allow generating SlidingRanges for joins that have multiple filters.

bd3755a

Avoid panic when input to SlidingRange isn't a table.

b1f28f8

Add additional tests for both tables with and without primary keys.

2e64fdb

Use index for left table in sliding range join if available.

ca729b0

Store in the SlidingRange iter whether its ranges are open or closed.

b1eb7c7

Add cost for SlidingRangeJoin

2bac676

Update inaccurate coster docstring.

1f785e9

nicktobey force-pushed the nicktobey/sliding-join branch from 4821280 to 3a58b0b Compare July 25, 2023 01:05

Fix bug: A lookup with a non-null key for a NullSafeEq inadvertantly …

9acbd79

…returns all non-null rows.

Remove the TODO about removing filters if they're provably not requir…

3403ed9

…ed. Now that we can build RangeJoins over an index that matches multiple filters, the logic for detecting unneeded filters is no longer trivial.

nicktobey force-pushed the nicktobey/sliding-join branch from 3a58b0b to 3403ed9 Compare July 25, 2023 01:12

nicktobey requested a review from jycor July 25, 2023 01:17

max-hoffman approved these changes Jul 25, 2023

View reviewed changes

nicktobey added 2 commits July 25, 2023 12:15

Rename SlidingJoin to RangeHeap

19bfbf1

Add docstring and examples to getRangeFilters and `getColumnRefFrom…

57784d6

…Scalar` Also fix the typo that the examples uncovered.

nicktobey force-pushed the nicktobey/sliding-join branch from 333db24 to 57784d6 Compare July 25, 2023 23:29

nicktobey added 5 commits July 25, 2023 16:39

Remove unnecessary check.

72836c5

Add docstring for memo.RangeHeap, and rename fields for clarity.

9dc95ec

Add docstring and examples for addRangeHeapJoin

9d556bf

Update coster for RangeHeapJoin.

e1c3f8a

This means that RangeHeapJoin will always cost better then InnerJoin, but a lookup join with multiple key expressions can cost even better.

Update JoinType strings to accommodate renaming SlidingRange to Range…

6405061

…Heap.

nicktobey force-pushed the nicktobey/sliding-join branch from 56aff54 to 6405061 Compare July 26, 2023 05:51

nicktobey added 7 commits July 26, 2023 15:07

When sorting the input for RangeHeapJoin, match the current behavior …

11ff9dd

…of putting Nulls after non-Nulls.

Don't push RangeHeap into a TableAlias.

7418941

Add additional tests for sliding join.

ab05cf5

Move all of RangeHeap's mutating state into its own iteraotr class, `…

d3ed9cf

…rangeHeapJoinIter`. Nodes represent physical plans, not executions. It should be possible to spawn multiple executions from a physical plan. To that end, Nodes should not have state that changes during execution.

Use Schema.IndexOf instead of Schema.IndexOfColName when making R…

52541e9

…angeHeaps. This guarentees correct behavior even when multiple tables have columns with the same name.

Add additional planning test.

70260ff

Merge branch 'main' into nicktobey/sliding-join

1d2455f

nicktobey force-pushed the nicktobey/sliding-join branch from 6e6181f to 1d2455f Compare July 28, 2023 17:17

nicktobey and others added 6 commits July 28, 2023 10:19

Move constant to top of coster.go.

038f099

[ga-format-pr] Run ./format_repo.sh to fix formatting

e766ae3

Correctly handle an empty RHS in RangeHeapJoins.

1901ea6

Merge branch 'nicktobey/sliding-join' of github.com:dolthub/go-mysql-…

cfb69f0

…server into nicktobey/sliding-join

Fix typo in planning test.

a05358d

Add op tests to match planning tests.

b9f3fc6

nicktobey merged commit f28984e into main Jul 28, 2023

nicktobey deleted the nicktobey/sliding-join branch July 28, 2023 20:08

nicktobey mentioned this pull request Aug 3, 2023

Slow Range Join dolthub/dolt#6298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "Sliding Range Join" execution plan #1880

Add "Sliding Range Join" execution plan #1880

nicktobey commented Jul 18, 2023 •

edited

Loading

max-hoffman Jul 20, 2023 •

edited

Loading

nicktobey Jul 25, 2023

max-hoffman left a comment

max-hoffman Jul 25, 2023

nicktobey Jul 25, 2023

max-hoffman Jul 25, 2023

nicktobey Jul 25, 2023

max-hoffman Jul 25, 2023

max-hoffman Jul 25, 2023

nicktobey Jul 25, 2023

max-hoffman Jul 25, 2023

nicktobey Jul 26, 2023

max-hoffman Jul 25, 2023

nicktobey Jul 26, 2023

max-hoffman Jul 25, 2023

nicktobey Jul 26, 2023

max-hoffman Jul 25, 2023

nicktobey Jul 26, 2023

max-hoffman Jul 25, 2023

max-hoffman Jul 25, 2023

nicktobey Jul 28, 2023

Add "Sliding Range Join" execution plan #1880

Add "Sliding Range Join" execution plan #1880

Conversation

nicktobey commented Jul 18, 2023 • edited Loading

max-hoffman Jul 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-hoffman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktobey commented Jul 18, 2023 •

edited

Loading

max-hoffman Jul 20, 2023 •

edited

Loading