CostedIndexScan outline, non-functional #2093

max-hoffman · 2023-10-20T23:09:55Z

This is an outline for proposed new costed index scan. I stopped here because I got most of the shaped flushed out enough to get the first component (flatten expression tree) tested and functional.

This is also explained in the docstring but the process is:

Match Filter->ResolvedTable patterns. Filter pushdown is good enough so that all filters usable in range scans should now sit on top of their target tables.
Get the statistics for the table indexes, or create a dummy set of statistics that assumes uniform distribution of values.
Convert the parent filter into a new format for index costing. We (1) separate the index-supported from non-supported filters, (2) collapse AND and OR binary trees into arbitrary width trees, and (3) collect "leaf" comparisons into a format that makes it easy to iterate using index column ordering.
Run each index against a costing process based on the new filter tree. In general we will find all of the AND and OR subsets of filters we can use as prefix scans of the index. We maintain metadata moving up the tree to maintain a set of histogram buckets (which get INTERSECTed for AND, UNIONed for OR) and filters (id sets) that have been pushed into the hypothetical scan. We track the cost and set of included filters for the best indexScan.
At the end, we convert the set of included filters into a sql.RangeCollection->sql.IndexLookup->plan.IndexedTableAccess. We walk the expression root the same way as we did in the other steps, using a bitmask to determine which filters should be converted to ranges, and using the existing an sql.IndexBuilder to convert leaf expressions into ranges (plus manually INTERSECT/UNIONing the children of AND/OR filters).

TODO:

test creating uniform histogram buckets
implement the stats helper functions that emulate leaf filter behavior (truncate, intersect, union statistics)
implement and test the costing DFS (filter tree + index + stats => filters + accurate histogram bucket for the index scan)
implement and test converting a subset of filters from the flattened format into an sql.RangeCollection

Other:

I'm still considering whether to separate a lot of the logic into a package outside of analyzer.
I might simplify some of the statistics manipulation behavior at first.
There are some edge cases around negated indexes/IS NULL filters.
I need to test all of this with a more useful set of statistics, probably manually injected on the GMS side and then also in Dolt.
I'm sure there are other edge cases I am missing, there are a lot of indexed access plans that will light up.

fulghum

Logic makes sense. Just a couple of small comments/questions.

fulghum · 2023-11-02T17:15:53Z

sql/stats/union.go

@@ -0,0 +1,63 @@
+package stats


Suggested change

package stats

// Copyright 2023 Dolthub, Inc.

//

// Licensed under the Apache License, Version 2.0 (the "License");

// you may not use this file except in compliance with the License.

// You may obtain a copy of the License at

//

// http://www.apache.org/licenses/LICENSE-2.0

//

// Unless required by applicable law or agreed to in writing, software

// distributed under the License is distributed on an "AS IS" BASIS,

// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

// See the License for the specific language governing permissions and

// limitations under the License.

package stats

fulghum · 2023-11-02T17:28:48Z

sql/analyzer/costed_index_scan.go

+			}
+		}
+
+		iat := rt.Table.(sql.IndexAddressableTable)


(minor) is this type cast worth checking? Seems like some edge cases (some system tables?) may not implement this?

yeah, this any non-ResolvedTable sql.TableNode types ended up being bugs

fulghum · 2023-11-02T18:33:16Z

sql/analyzer/costed_index_scan_test.go

+  (4: and
+    (3: xy.x = 1))
+  (5: and
+    (4: xy.y = 2))
+  (5: and
+    (6: xy.y > 0)
+    (7: xy.y < 7)))`,


Why do we create the parent and expressions on line 85 and 87 over the ors? In particular, it seemed odd that 5: and contained expression child number 4 when it doesn't seem to in the original expression.

I was expecting a flattened form more like:

(1: or (3: xy.x = 1)) (4: xy.y = 2)) (5: and (6: xy.y > 0) (7: xy.y < 7)))`,

Originally I did this is because it's easy to say an OR's children are a list of ANDs. If that's confusing I can make the tree look more similar to the original version, the code becomes a little more complicated

The numbering being weird might be another good reason to do it the way you suggest

I think the number overlaps were getting in the way of filter pruning, index_query_plans.go is now -11k

max-hoffman changed the title ~~starter for index scan replacement~~ CostedIndexScan outline, non-functional Oct 24, 2023

max-hoffman requested a review from fulghum October 24, 2023 20:14

max-hoffman mentioned this pull request Oct 26, 2023

Index optimize extension with dynamic index support in embed db #2036

Closed

fulghum reviewed Nov 2, 2023

View reviewed changes

Costed index scan partial impl

98924b6

max-hoffman force-pushed the max/costed-index-scan branch from de5d0f1 to 98924b6 Compare November 2, 2023 18:59

fulghum approved these changes Nov 7, 2023

View reviewed changes

max-hoffman mentioned this pull request Nov 7, 2023

Costed index scan framework #2134

Merged

BrewTestBot mentioned this pull request Nov 10, 2023

dolt 1.26.0 Homebrew/homebrew-core#153860

Merged

max-hoffman closed this Nov 16, 2023

Hydrocharged deleted the max/costed-index-scan branch February 7, 2024 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CostedIndexScan outline, non-functional #2093

CostedIndexScan outline, non-functional #2093

max-hoffman commented Oct 20, 2023 •

edited

Loading

fulghum left a comment

fulghum Nov 2, 2023

fulghum Nov 2, 2023

max-hoffman Nov 3, 2023

fulghum Nov 2, 2023

max-hoffman Nov 3, 2023

max-hoffman Nov 3, 2023

max-hoffman Nov 3, 2023

-package stats
+// Copyright 2023 Dolthub, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+package stats

CostedIndexScan outline, non-functional #2093

CostedIndexScan outline, non-functional #2093

Conversation

max-hoffman commented Oct 20, 2023 • edited Loading

fulghum left a comment

Choose a reason for hiding this comment

fulghum Nov 2, 2023

Choose a reason for hiding this comment

fulghum Nov 2, 2023

Choose a reason for hiding this comment

max-hoffman Nov 3, 2023

Choose a reason for hiding this comment

fulghum Nov 2, 2023

Choose a reason for hiding this comment

max-hoffman Nov 3, 2023

Choose a reason for hiding this comment

max-hoffman Nov 3, 2023

Choose a reason for hiding this comment

max-hoffman Nov 3, 2023

Choose a reason for hiding this comment

max-hoffman commented Oct 20, 2023 •

edited

Loading