smarter nested column index utilization #13977

clintropolis · 2023-03-25T06:04:41Z

Description

A sort of follow-up to #12830, which added NumericRangeIndex to improve the performance of nested column bound filters on numeric columns, but noted that there were still cases where the cost of bitmap operations overran the cost of just doing a full scan and using value matchers. This is also more or less the same thing described in #3878.

This PR attempts to improve this situation by adding cardinality based thresholds for range and predicate indexes to choose to skip using bitmap indexes.

changes:

adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale
NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations
Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning
these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values
these changes provide a pretty sweet performance boost for filter processing on nested columns

We might want to consider making these same changes to DictionaryEncodedStringIndexSupplier, but I have not done so at this time.

Benchmarks comparing to regular columns

SELECT SUM(long1) FROM foo WHERE double3 < 10000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 10000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       14           5000000              none        false  avgt    5  102.947 ±   0.875  ms/op
SqlNestedDataBenchmark.querySql       14           5000000              none        force  avgt    5   63.557 ±   0.477  ms/op
SqlNestedDataBenchmark.querySql       15           5000000              none        false  avgt    5  122.576 ±   0.919  ms/op
SqlNestedDataBenchmark.querySql       15           5000000              none        force  avgt    5   66.158 ±   0.667  ms/op

SELECT SUM(long1) FROM foo WHERE double3 < 1005.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       26           5000000              none        false  avgt    5   89.212 ±   4.634  ms/op
SqlNestedDataBenchmark.querySql       26           5000000              none        force  avgt    5   60.594 ±   3.423  ms/op
SqlNestedDataBenchmark.querySql       27           5000000              none        false  avgt    5   19.526 ±   0.543  ms/op
SqlNestedDataBenchmark.querySql       27           5000000              none        force  avgt    5   23.740 ±   0.404  ms/op

SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       28           5000000              none        false  avgt    5  113.113 ±  14.400  ms/op
SqlNestedDataBenchmark.querySql       28           5000000              none        force  avgt    5   62.150 ±   0.211  ms/op
SqlNestedDataBenchmark.querySql       29           5000000              none        false  avgt    5   60.687 ±   0.549  ms/op
SqlNestedDataBenchmark.querySql       29           5000000              none        force  avgt    5   50.789 ±   0.935  ms/op

SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       30           5000000              none        false  avgt    5  104.038 ±   1.316  ms/op
SqlNestedDataBenchmark.querySql       30           5000000              none        force  avgt    5   62.930 ±   0.327  ms/op
SqlNestedDataBenchmark.querySql       31           5000000              none        false  avgt    5  126.925 ±   1.966  ms/op
SqlNestedDataBenchmark.querySql       31           5000000              none        force  avgt    5   67.254 ±   2.807  ms/op

SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       32           5000000              none        false  avgt    5  111.034 ±   1.047  ms/op
SqlNestedDataBenchmark.querySql       32           5000000              none        force  avgt    5   73.748 ±   0.749  ms/op
SqlNestedDataBenchmark.querySql       33           5000000              none        false  avgt    5  116.868 ±   2.659  ms/op
SqlNestedDataBenchmark.querySql       33           5000000              none        force  avgt    5   75.269 ±   0.208  ms/op

34,35 smaller cardinality like range filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       34           5000000              none        false  avgt    5   34.077 ±   0.365  ms/op
SqlNestedDataBenchmark.querySql       34           5000000              none        force  avgt    5   25.754 ±   0.384  ms/op
SqlNestedDataBenchmark.querySql       35           5000000              none        false  avgt    5   59.226 ±   0.300  ms/op
SqlNestedDataBenchmark.querySql       35           5000000              none        force  avgt    5   29.900 ±   0.301  ms/op

36,37 smaller cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       36           5000000              none        false  avgt    5   81.823 ±   0.609  ms/op
SqlNestedDataBenchmark.querySql       36           5000000              none        force  avgt    5   40.584 ±   0.703  ms/op
SqlNestedDataBenchmark.querySql       37           5000000              none        false  avgt    5   80.930 ±   4.364  ms/op
SqlNestedDataBenchmark.querySql       37           5000000              none        force  avgt    5   37.533 ±   0.464  ms/op

38, 39 moderate cardinality like range
SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       38           5000000              none        false  avgt    5   77.782 ±   0.898  ms/op
SqlNestedDataBenchmark.querySql       38           5000000              none        force  avgt    5   44.718 ±   0.882  ms/op
SqlNestedDataBenchmark.querySql       39           5000000              none        false  avgt    5   68.806 ±   0.952  ms/op
SqlNestedDataBenchmark.querySql       39           5000000              none        force  avgt    5   50.848 ±   0.975  ms/op

40, 41 big cardinality lex range
SELECT SUM(long1) FROM foo WHERE string5 > '1'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       40           5000000              none        false  avgt    5  245.087 ±   1.706  ms/op
SqlNestedDataBenchmark.querySql       40           5000000              none        force  avgt    5  222.619 ±   3.206  ms/op
SqlNestedDataBenchmark.querySql       41           5000000              none        false  avgt    5  354.551 ±   8.115  ms/op
SqlNestedDataBenchmark.querySql       41           5000000              none        force  avgt    5  211.717 ±   9.943  ms/op

42, 43 big cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       42           5000000              none        false  avgt    5  796.396 ± 119.324  ms/op
SqlNestedDataBenchmark.querySql       42           5000000              none        force  avgt    5  648.279 ±   4.008  ms/op
SqlNestedDataBenchmark.querySql       43           5000000              none        false  avgt    5  504.493 ±  92.658  ms/op
SqlNestedDataBenchmark.querySql       43           5000000              none        force  avgt    5  383.268 ±  14.283  ms/op

44, 45 big cardinality like filter + selector filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%' AND string1 = '1000'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%' AND JSON_VALUE(nested, '$.nesteder.string1') = '1000'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       44           5000000              none        false  avgt    5  641.740 ±  13.625  ms/op
SqlNestedDataBenchmark.querySql       44           5000000              none        force  avgt    5  598.289 ±   9.617  ms/op
SqlNestedDataBenchmark.querySql       45           5000000              none        false  avgt    5   21.523 ±   0.254  ms/op
SqlNestedDataBenchmark.querySql       45           5000000              none        force  avgt    5   19.543 ±   0.231  ms/op

The really big wins here will be queries like the last one, where some selector or other cheap filters are combined with high cardinality ranges or predicates, since the filters get pushed into being 'post' filters.

Release note

Key changed/added classes in this PR

ColumnConfig
NestedFieldLiteralColumnIndexSupplier

This PR has:

changes: * adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale * NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations * Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning * these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values * these changes provide a pretty sweet performance boost for filter processing on nested columns

clintropolis · 2023-03-25T06:37:51Z

I decided to do the DruidPredicateIndex threshold based on the overall value cardinality instead of evaluating each predicate on the assumption that the predicates themselves are not that cheap. However, there might be room for exploration of evaluating the predicates under a certain dictionary sizes to determine the actual number of bitmap operations like we are doing for range indexes (well the LexicographicalRangeIndex for strings has a predicate form that further reduces the actual number of bitmaps, but that they operate on the full range before the predicate).

clintropolis · 2023-03-25T06:42:00Z

Also looks like I might have some legitimate test failures, possibly due to mismatch of behavior between indexes and value matchers on some nested columns 😓 investigating

clintropolis · 2023-03-25T07:06:39Z

the different test results seem to be because of different behavior of the predicate index on the bound filter vs the value matcher, the latter looks like it is matching nulls. The results now actually look more consistent with other numeric filters using predicates due to lexicographic sorting.

I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.

clintropolis · 2023-03-25T07:43:46Z

I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.

Actually the mismatch is between the double predicate being used for the DruidPredicateIndex of the nested column, while the value matcher is using the string predicate. I'm not entirely certain how to fix this yet, since the virtual column is planned with an output type of 'string', while internally the nested field column is typed 'double' and so uses the double predicate. The predicate index might need to further specify which predicate should be used instead of just supplying the factory so that I can make this better behaved.

imply-cheddar · 2023-03-27T03:24:33Z

processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java

+   * This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
+   * been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the


I'm scared of exposing this behavior through adjusting the return of .as(). I think it would be better to have the columns continue to return a meaningful object for .as() but have the those interfaces that benefit from this be able to opt themselves out of being used for bitmaps. (Or, in some future world, return an equivalent ValueMatcher)

This was already true of as itself, returning null has always been an indicator that the index we asked for is not available. I just made it a bit clearer in the docs and also leaned into allowing the things as returns to also return null to indicate that an index is not available.

For the types of short circuits done in this PR, i feel like the index supplier has better information to make the decision than anything else above it since it has direct access to the size of dictionary, number of rows to be scanned, etc, but I will think on it

imply-cheddar · 2023-03-27T03:27:53Z

processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java

+   * This can make many standalone filters faster in cases where the overhead of doing bitmap operations to construct a
+   * {@link org.apache.druid.segment.BitmapOffset} or {@link org.apache.druid.segment.vector.BitmapVectorOffset} can
+   * often exceed the cost of just using doing a full scan and using a
+   * {@link org.apache.druid.query.filter.ValueMatcher}.


I think the "many" in "This can make many standalone filters faster" is a bit of a hyperbola. What about "Some standalone filters speed up with this optimization in cases where the overhead of walking the dictionary and combining bitmaps to construct ..."?

imply-cheddar · 2023-03-27T03:40:07Z

benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java

+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 30, 31
+      "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 32,33
+      "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+      // 34,35 smaller cardinality like range filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+      // 36,37 smaller cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+      // 38-39 moderate cardinality like range
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+      // 40, 41 big cardinality lex range
+      "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+      // 42, 43 big cardinality like predicate filter
+      "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+      "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",


Can you add various sized IN clauses to your benchmark? 10, 100, 1000, 10000, 100000?

After reading the PR, I don't think they should be impacted, but it would be good to know that they aren't as they are an example of a thing that falls back to predicate-like behavior, but which we have seen also cause performance issues when run as a ValueMatcher

InDimFilter typically only uses Utf8ValueSetIndex and StringValueSetIndex which cannot be impacted by the changes currently in this PR, unless it is using an ExtractionFn which causes it to use a DruidPredicateIndex. I need to figure out how to get something that would plan to something with an extractionFn in here to see how this PR would impact this filter.

imply-cheddar · 2023-03-27T03:55:37Z

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

+      ColumnarInts intsColumn = ints.get();
+      final int size = intsColumn.size();
+      intsColumn.close();


This type of pattern is a bit scary. In this instance given that it's when deserializing the object, it's probably fine, but we've had issues where we thought we were getting a thing and then closing it, only to find out that we were getting a thing that was cached and supposed to live for the lifetime of a query and the immediate close was closing things.

That said, I think we can make this cleaner by making the Supplier<> itself be a class that has a method that knows the number of rows represented by the thing that it supplies. It should be easy enough to implement this for the suppliers used in this case (the number is already on the jvm heap for those objects anyway) and this avoids any need to create an object and close it immediately.

imply-cheddar · 2023-03-27T03:56:14Z

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

+              columnConfig.skipValueRangeIndexScale(),
+              columnConfig.skipValuePredicateIndexScale()


Why not just pass the columnConfig object itself?

imply-cheddar · 2023-03-27T04:00:12Z

...ing/src/main/java/org/apache/druid/segment/nested/NestedFieldLiteralColumnIndexSupplier.java

+    if (skipPredicateIndex && clazz.equals(DruidPredicateIndex.class)) {
+      return null;
+    }


Technically, this would indicate that this column cannot implement the given the interface, so the filter in question should fall back to another interface (i.e. it might make the filter to cascade down into trying an even worse strategy). Instead, we should return a DruidPredicateIndex that returns null when asked to compute the index, indicating that it was unable to compute the index. This will make it easier to do this "correctly" when we make the change to have this object able to also return ValueMatcher objects.

…lumn-indexes

processing/src/main/java/org/apache/druid/segment/AutoTypeColumnMerger.java

...essing/src/main/java/org/apache/druid/segment/nested/VariantArrayColumnAndIndexSupplier.java

@@ -199,7 +207,9 @@
      Supplier<FrontCodedIntArrayIndexed> arrayDictionarySupplier,
      Supplier<ColumnarInts> encodedValueColumnSupplier,
      GenericIndexed<ImmutableBitmap> valueIndexes,
-      @SuppressWarnings("unused") BitmapFactory bitmapFactory
+      @SuppressWarnings("unused") BitmapFactory bitmapFactory,
+      @SuppressWarnings("unused") ColumnConfig columnConfig,


...essing/src/main/java/org/apache/druid/segment/nested/VariantArrayColumnAndIndexSupplier.java

-      @SuppressWarnings("unused") BitmapFactory bitmapFactory
+      @SuppressWarnings("unused") BitmapFactory bitmapFactory,
+      @SuppressWarnings("unused") ColumnConfig columnConfig,
+      @SuppressWarnings("unused") int numRows


imply-cheddar · 2023-04-05T06:48:36Z

processing/src/main/java/org/apache/druid/segment/nested/NestedFieldColumnIndexSupplier.java

+    this.skipRangeIndexThreshold = (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRows);
+    this.skipPredicateIndex = localDictionarySupplier.get().size() > Math.ceil(columnConfig.skipValuePredicateIndexScale() * numRows);


I think it's a bit nicer to make these methods instead of fields that are computed on every constructor. The computation that's being done is minimal, that's true, but it's better to only do it when needed. I think that if you just store the reference to the ColumnConfig you can pretty easily move this check into private methods. All of the places where you would check for skipping the predicate index, you would probably also be grabbing a localDictionarySupplier anyway, so you can reuse that singular reference too.

imply-cheddar · 2023-04-06T02:04:42Z

processing/src/main/java/org/apache/druid/segment/column/DruidPredicateIndex.java

@@ -36,4 +36,12 @@
   */
  @Nullable
  BitmapColumnIndex forPredicate(DruidPredicateFactory matcherFactory);
+
+  static boolean checkSkipThreshold(@Nullable ColumnConfig columnConfig, int numRowsToScan, int dictionaryCardinality)


shouldSkip? checkSkipThreshold doesn't really tell me how I'm supposed to interpret true/false. Does true mean that it exceeded the threshold and therefore should be skipped? Or does it mean that it did not exceed the threshold and should be used?

imply-cheddar · 2023-04-06T02:07:40Z

processing/src/main/java/org/apache/druid/segment/column/NumericRangeIndex.java

+
+  static boolean checkSkipThreshold(ColumnConfig columnConfig, int numRowsToScan, int rangeSize)
+  {
+    return rangeSize > (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRowsToScan);


You aren't doing the null check on this one. That intentional?

…nto smarter-nested-column-indexes

clintropolis added Performance Area - Querying labels Mar 25, 2023

clintropolis added 2 commits March 25, 2023 00:15

fix bug in bound filter, add @nullable annotations to index maker things

b51445a

revert bound filter changes

59980cb

fix tests

4402479

imply-cheddar reviewed Mar 27, 2023

View reviewed changes

clintropolis added this to the 26.0 milestone Apr 1, 2023

clintropolis added the WIP label Apr 1, 2023

clintropolis added 4 commits April 4, 2023 18:34

Merge remote-tracking branch 'upstream/master' into smarter-nested-co…

4627125

…lumn-indexes

use thresholds on optimized nested common format index suppliers as well

862a5dd

remove progress indicator

df0ec73

oops

3bd7d2d

github-advanced-security bot found potential problems Apr 5, 2023

View reviewed changes

imply-cheddar reviewed Apr 5, 2023

View reviewed changes

refactor

131b5a1

clintropolis removed the WIP label Apr 5, 2023

clintropolis and others added 3 commits April 5, 2023 09:35

fix inspection

9b3170e

Update DictionaryEncodedStringValueIndex.java

ed3fc6c

Update DictionaryEncodedStringValueIndex.java

413e5dd

imply-cheddar reviewed Apr 6, 2023

View reviewed changes

clintropolis added 2 commits April 5, 2023 20:42

move and rename methods, no nulls

0d36125

Merge remote-tracking branch 'origin/smarter-nested-column-indexes' i…

afa6a05

…nto smarter-nested-column-indexes

imply-cheddar approved these changes Apr 6, 2023

View reviewed changes

clintropolis merged commit b11c0bc into apache:master Apr 6, 2023

clintropolis deleted the smarter-nested-column-indexes branch April 6, 2023 11:09

techdocsmith mentioned this pull request Apr 12, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

This was referenced Dec 13, 2023

remove search auto strategy, estimateSelectivity of BitmapColumnIndex #15550

Merged

add support for ColumnConfig skipValueRangeIndexScale and skipValueRangeIndexScale to traditional string columns #15551

Closed

clintropolis mentioned this pull request Dec 21, 2023

expression virtual column indexes #15585

Merged

6 tasks

clintropolis mentioned this pull request Feb 6, 2024

adaptive filter partitioning #15838

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smarter nested column index utilization #13977

smarter nested column index utilization #13977

clintropolis commented Mar 25, 2023 •

edited

Loading

clintropolis commented Mar 25, 2023

clintropolis commented Mar 25, 2023

clintropolis commented Mar 25, 2023

clintropolis commented Mar 25, 2023

imply-cheddar Mar 27, 2023

clintropolis Mar 27, 2023

imply-cheddar Mar 27, 2023

imply-cheddar Mar 27, 2023

clintropolis Mar 27, 2023 •

edited

Loading

imply-cheddar Mar 27, 2023

imply-cheddar Mar 27, 2023

imply-cheddar Mar 27, 2023

imply-cheddar Apr 5, 2023

imply-cheddar Apr 6, 2023

imply-cheddar Apr 6, 2023

		* This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
		* been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the

		columnConfig.skipValueRangeIndexScale(),
		columnConfig.skipValuePredicateIndexScale()

		this.skipRangeIndexThreshold = (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRows);
		this.skipPredicateIndex = localDictionarySupplier.get().size() > Math.ceil(columnConfig.skipValuePredicateIndexScale() * numRows);

smarter nested column index utilization #13977

smarter nested column index utilization #13977

Conversation

clintropolis commented Mar 25, 2023 • edited Loading

Description

Benchmarks comparing to regular columns

Release note

Key changed/added classes in this PR

clintropolis commented Mar 25, 2023

clintropolis commented Mar 25, 2023

clintropolis commented Mar 25, 2023

clintropolis commented Mar 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis commented Mar 25, 2023 •

edited

Loading

clintropolis Mar 27, 2023 •

edited

Loading