Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CostedIndexScan outline, non-functional #2093

Closed
wants to merge 1 commit into from

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Oct 20, 2023

This is an outline for proposed new costed index scan. I stopped here because I got most of the shaped flushed out enough to get the first component (flatten expression tree) tested and functional.

This is also explained in the docstring but the process is:

  • Match Filter->ResolvedTable patterns. Filter pushdown is good enough so that all filters usable in range scans should now sit on top of their target tables.
  • Get the statistics for the table indexes, or create a dummy set of statistics that assumes uniform distribution of values.
  • Convert the parent filter into a new format for index costing. We (1) separate the index-supported from non-supported filters, (2) collapse AND and OR binary trees into arbitrary width trees, and (3) collect "leaf" comparisons into a format that makes it easy to iterate using index column ordering.
  • Run each index against a costing process based on the new filter tree. In general we will find all of the AND and OR subsets of filters we can use as prefix scans of the index. We maintain metadata moving up the tree to maintain a set of histogram buckets (which get INTERSECTed for AND, UNIONed for OR) and filters (id sets) that have been pushed into the hypothetical scan. We track the cost and set of included filters for the best indexScan.
  • At the end, we convert the set of included filters into a sql.RangeCollection->sql.IndexLookup->plan.IndexedTableAccess. We walk the expression root the same way as we did in the other steps, using a bitmask to determine which filters should be converted to ranges, and using the existing an sql.IndexBuilder to convert leaf expressions into ranges (plus manually INTERSECT/UNIONing the children of AND/OR filters).

TODO:

  • test creating uniform histogram buckets
  • implement the stats helper functions that emulate leaf filter behavior (truncate, intersect, union statistics)
  • implement and test the costing DFS (filter tree + index + stats => filters + accurate histogram bucket for the index scan)
  • implement and test converting a subset of filters from the flattened format into an sql.RangeCollection

Other:

  • I'm still considering whether to separate a lot of the logic into a package outside of analyzer.
  • I might simplify some of the statistics manipulation behavior at first.
  • There are some edge cases around negated indexes/IS NULL filters.
  • I need to test all of this with a more useful set of statistics, probably manually injected on the GMS side and then also in Dolt.
  • I'm sure there are other edge cases I am missing, there are a lot of indexed access plans that will light up.

@max-hoffman max-hoffman changed the title starter for index scan replacement CostedIndexScan outline, non-functional Oct 24, 2023
@max-hoffman max-hoffman requested a review from fulghum October 24, 2023 20:14
Copy link
Contributor

@fulghum fulghum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic makes sense. Just a couple of small comments/questions.

@@ -0,0 +1,63 @@
package stats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
package stats
// Copyright 2023 Dolthub, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package stats

}
}

iat := rt.Table.(sql.IndexAddressableTable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) is this type cast worth checking? Seems like some edge cases (some system tables?) may not implement this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this any non-ResolvedTable sql.TableNode types ended up being bugs

Comment on lines +85 to +91
(4: and
(3: xy.x = 1))
(5: and
(4: xy.y = 2))
(5: and
(6: xy.y > 0)
(7: xy.y < 7)))`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we create the parent and expressions on line 85 and 87 over the ors? In particular, it seemed odd that 5: and contained expression child number 4 when it doesn't seem to in the original expression.

I was expecting a flattened form more like:

(1: or
  (3: xy.x = 1))
  (4: xy.y = 2))
  (5: and
    (6: xy.y > 0)
    (7: xy.y < 7)))`,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I did this is because it's easy to say an OR's children are a list of ANDs. If that's confusing I can make the tree look more similar to the original version, the code becomes a little more complicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbering being weird might be another good reason to do it the way you suggest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the number overlaps were getting in the way of filter pruning, index_query_plans.go is now -11k

@max-hoffman max-hoffman force-pushed the max/costed-index-scan branch from de5d0f1 to 98924b6 Compare November 2, 2023 18:59
@Hydrocharged Hydrocharged deleted the max/costed-index-scan branch February 7, 2024 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants