Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smarter nested column index utilization #13977

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 25, 2023

Description

A sort of follow-up to #12830, which added NumericRangeIndex to improve the performance of nested column bound filters on numeric columns, but noted that there were still cases where the cost of bitmap operations overran the cost of just doing a full scan and using value matchers. This is also more or less the same thing described in #3878.

This PR attempts to improve this situation by adding cardinality based thresholds for range and predicate indexes to choose to skip using bitmap indexes.

changes:

  • adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale
  • NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations
  • Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning
  • these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values
  • these changes provide a pretty sweet performance boost for filter processing on nested columns

We might want to consider making these same changes to DictionaryEncodedStringIndexSupplier, but I have not done so at this time.

Benchmarks comparing to regular columns

SELECT SUM(long1) FROM foo WHERE double3 < 10000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 10000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       14           5000000              none        false  avgt    5  102.947 ±   0.875  ms/op
SqlNestedDataBenchmark.querySql       14           5000000              none        force  avgt    5   63.557 ±   0.477  ms/op
SqlNestedDataBenchmark.querySql       15           5000000              none        false  avgt    5  122.576 ±   0.919  ms/op
SqlNestedDataBenchmark.querySql       15           5000000              none        force  avgt    5   66.158 ±   0.667  ms/op
SELECT SUM(long1) FROM foo WHERE double3 < 1005.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       26           5000000              none        false  avgt    5   89.212 ±   4.634  ms/op
SqlNestedDataBenchmark.querySql       26           5000000              none        force  avgt    5   60.594 ±   3.423  ms/op
SqlNestedDataBenchmark.querySql       27           5000000              none        false  avgt    5   19.526 ±   0.543  ms/op
SqlNestedDataBenchmark.querySql       27           5000000              none        force  avgt    5   23.740 ±   0.404  ms/op
SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       28           5000000              none        false  avgt    5  113.113 ±  14.400  ms/op
SqlNestedDataBenchmark.querySql       28           5000000              none        force  avgt    5   62.150 ±   0.211  ms/op
SqlNestedDataBenchmark.querySql       29           5000000              none        false  avgt    5   60.687 ±   0.549  ms/op
SqlNestedDataBenchmark.querySql       29           5000000              none        force  avgt    5   50.789 ±   0.935  ms/op
SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       30           5000000              none        false  avgt    5  104.038 ±   1.316  ms/op
SqlNestedDataBenchmark.querySql       30           5000000              none        force  avgt    5   62.930 ±   0.327  ms/op
SqlNestedDataBenchmark.querySql       31           5000000              none        false  avgt    5  126.925 ±   1.966  ms/op
SqlNestedDataBenchmark.querySql       31           5000000              none        force  avgt    5   67.254 ±   2.807  ms/op
SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       32           5000000              none        false  avgt    5  111.034 ±   1.047  ms/op
SqlNestedDataBenchmark.querySql       32           5000000              none        force  avgt    5   73.748 ±   0.749  ms/op
SqlNestedDataBenchmark.querySql       33           5000000              none        false  avgt    5  116.868 ±   2.659  ms/op
SqlNestedDataBenchmark.querySql       33           5000000              none        force  avgt    5   75.269 ±   0.208  ms/op
34,35 smaller cardinality like range filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       34           5000000              none        false  avgt    5   34.077 ±   0.365  ms/op
SqlNestedDataBenchmark.querySql       34           5000000              none        force  avgt    5   25.754 ±   0.384  ms/op
SqlNestedDataBenchmark.querySql       35           5000000              none        false  avgt    5   59.226 ±   0.300  ms/op
SqlNestedDataBenchmark.querySql       35           5000000              none        force  avgt    5   29.900 ±   0.301  ms/op
36,37 smaller cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       36           5000000              none        false  avgt    5   81.823 ±   0.609  ms/op
SqlNestedDataBenchmark.querySql       36           5000000              none        force  avgt    5   40.584 ±   0.703  ms/op
SqlNestedDataBenchmark.querySql       37           5000000              none        false  avgt    5   80.930 ±   4.364  ms/op
SqlNestedDataBenchmark.querySql       37           5000000              none        force  avgt    5   37.533 ±   0.464  ms/op
38, 39 moderate cardinality like range
SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       38           5000000              none        false  avgt    5   77.782 ±   0.898  ms/op
SqlNestedDataBenchmark.querySql       38           5000000              none        force  avgt    5   44.718 ±   0.882  ms/op
SqlNestedDataBenchmark.querySql       39           5000000              none        false  avgt    5   68.806 ±   0.952  ms/op
SqlNestedDataBenchmark.querySql       39           5000000              none        force  avgt    5   50.848 ±   0.975  ms/op
40, 41 big cardinality lex range
SELECT SUM(long1) FROM foo WHERE string5 > '1'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       40           5000000              none        false  avgt    5  245.087 ±   1.706  ms/op
SqlNestedDataBenchmark.querySql       40           5000000              none        force  avgt    5  222.619 ±   3.206  ms/op
SqlNestedDataBenchmark.querySql       41           5000000              none        false  avgt    5  354.551 ±   8.115  ms/op
SqlNestedDataBenchmark.querySql       41           5000000              none        force  avgt    5  211.717 ±   9.943  ms/op
42, 43 big cardinality like predicate filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       42           5000000              none        false  avgt    5  796.396 ± 119.324  ms/op
SqlNestedDataBenchmark.querySql       42           5000000              none        force  avgt    5  648.279 ±   4.008  ms/op
SqlNestedDataBenchmark.querySql       43           5000000              none        false  avgt    5  504.493 ±  92.658  ms/op
SqlNestedDataBenchmark.querySql       43           5000000              none        force  avgt    5  383.268 ±  14.283  ms/op
44, 45 big cardinality like filter + selector filter
SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%' AND string1 = '1000'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%' AND JSON_VALUE(nested, '$.nesteder.string1') = '1000'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score     Error  Units
SqlNestedDataBenchmark.querySql       44           5000000              none        false  avgt    5  641.740 ±  13.625  ms/op
SqlNestedDataBenchmark.querySql       44           5000000              none        force  avgt    5  598.289 ±   9.617  ms/op
SqlNestedDataBenchmark.querySql       45           5000000              none        false  avgt    5   21.523 ±   0.254  ms/op
SqlNestedDataBenchmark.querySql       45           5000000              none        force  avgt    5   19.543 ±   0.231  ms/op

The really big wins here will be queries like the last one, where some selector or other cheap filters are combined with high cardinality ranges or predicates, since the filters get pushed into being 'post' filters.

Release note


Key changed/added classes in this PR
  • ColumnConfig
  • NestedFieldLiteralColumnIndexSupplier

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

changes:
* adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale
* NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations
* Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning
* these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values
* these changes provide a pretty sweet performance boost for filter processing on nested columns
@clintropolis
Copy link
Member Author

I decided to do the DruidPredicateIndex threshold based on the overall value cardinality instead of evaluating each predicate on the assumption that the predicates themselves are not that cheap. However, there might be room for exploration of evaluating the predicates under a certain dictionary sizes to determine the actual number of bitmap operations like we are doing for range indexes (well the LexicographicalRangeIndex for strings has a predicate form that further reduces the actual number of bitmaps, but that they operate on the full range before the predicate).

@clintropolis
Copy link
Member Author

Also looks like I might have some legitimate test failures, possibly due to mismatch of behavior between indexes and value matchers on some nested columns 😓 investigating

@clintropolis
Copy link
Member Author

the different test results seem to be because of different behavior of the predicate index on the bound filter vs the value matcher, the latter looks like it is matching nulls. The results now actually look more consistent with other numeric filters using predicates due to lexicographic sorting.

I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.

@clintropolis
Copy link
Member Author

I think there is a bug with the bound filter, it needs to wrap the predicate index in the null matching lower bound in the same way that the range index does.

Actually the mismatch is between the double predicate being used for the DruidPredicateIndex of the nested column, while the value matcher is using the string predicate. I'm not entirely certain how to fix this yet, since the virtual column is planned with an output type of 'string', while internally the nested field column is typed 'double' and so uses the double predicate. The predicate index might need to further specify which predicate should be used instead of just supplying the factory so that I can make this better behaved.

Comment on lines 31 to 32
* This is indicated returning null from {@link ColumnIndexSupplier#as(Class)} even though it would have otherwise
* been able to create a {@link BitmapColumnIndex}. For range indexes on columns where every value has an index, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm scared of exposing this behavior through adjusting the return of .as(). I think it would be better to have the columns continue to return a meaningful object for .as() but have the those interfaces that benefit from this be able to opt themselves out of being used for bitmaps. (Or, in some future world, return an equivalent ValueMatcher)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already true of as itself, returning null has always been an indicator that the index we asked for is not available. I just made it a bit clearer in the docs and also leaned into allowing the things as returns to also return null to indicate that an index is not available.

For the types of short circuits done in this PR, i feel like the index supplier has better information to make the decision than anything else above it since it has direct access to the size of dictionary, number of rows to be scanned, etc, but I will think on it

Comment on lines 39 to 42
* This can make many standalone filters faster in cases where the overhead of doing bitmap operations to construct a
* {@link org.apache.druid.segment.BitmapOffset} or {@link org.apache.druid.segment.vector.BitmapVectorOffset} can
* often exceed the cost of just using doing a full scan and using a
* {@link org.apache.druid.query.filter.ValueMatcher}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "many" in "This can make many standalone filters faster" is a bit of a hyperbola. What about "Some standalone filters speed up with this optimization in cases where the overhead of walking the dictionary and combining bitmaps to construct ..."?

Comment on lines +176 to +197
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
// 30, 31
"SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
// 32,33
"SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
// 34,35 smaller cardinality like range filter
"SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
// 36,37 smaller cardinality like predicate filter
"SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
// 38-39 moderate cardinality like range
"SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
// 40, 41 big cardinality lex range
"SELECT SUM(long1) FROM foo WHERE string5 > '1'",
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
// 42, 43 big cardinality like predicate filter
"SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add various sized IN clauses to your benchmark? 10, 100, 1000, 10000, 100000?

After reading the PR, I don't think they should be impacted, but it would be good to know that they aren't as they are an example of a thing that falls back to predicate-like behavior, but which we have seen also cause performance issues when run as a ValueMatcher

Copy link
Member Author

@clintropolis clintropolis Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InDimFilter typically only uses Utf8ValueSetIndex and StringValueSetIndex which cannot be impacted by the changes currently in this PR, unless it is using an ExtractionFn which causes it to use a DruidPredicateIndex. I need to figure out how to get something that would plan to something with an extractionFn in here to see how this PR would impact this filter.

Comment on lines 537 to 539
ColumnarInts intsColumn = ints.get();
final int size = intsColumn.size();
intsColumn.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of pattern is a bit scary. In this instance given that it's when deserializing the object, it's probably fine, but we've had issues where we thought we were getting a thing and then closing it, only to find out that we were getting a thing that was cached and supposed to live for the lifetime of a query and the immediate close was closing things.

That said, I think we can make this cleaner by making the Supplier<> itself be a class that has a method that knows the number of rows represented by the thing that it supplies. It should be easy enough to implement this for the suppliers used in this case (the number is already on the jvm heap for those objects anyway) and this avoids any need to create an object and close it immediately.

Comment on lines 550 to 551
columnConfig.skipValueRangeIndexScale(),
columnConfig.skipValuePredicateIndexScale()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pass the columnConfig object itself?

Comment on lines 134 to 136
if (skipPredicateIndex && clazz.equals(DruidPredicateIndex.class)) {
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this would indicate that this column cannot implement the given the interface, so the filter in question should fall back to another interface (i.e. it might make the filter to cascade down into trying an even worse strategy). Instead, we should return a DruidPredicateIndex that returns null when asked to compute the index, indicating that it was unable to compute the index. This will make it easier to do this "correctly" when we make the change to have this object able to also return ValueMatcher objects.

@clintropolis clintropolis added this to the 26.0 milestone Apr 1, 2023
@@ -199,7 +207,9 @@
Supplier<FrontCodedIntArrayIndexed> arrayDictionarySupplier,
Supplier<ColumnarInts> encodedValueColumnSupplier,
GenericIndexed<ImmutableBitmap> valueIndexes,
@SuppressWarnings("unused") BitmapFactory bitmapFactory
@SuppressWarnings("unused") BitmapFactory bitmapFactory,
@SuppressWarnings("unused") ColumnConfig columnConfig,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'columnConfig' is never used.
@SuppressWarnings("unused") BitmapFactory bitmapFactory
@SuppressWarnings("unused") BitmapFactory bitmapFactory,
@SuppressWarnings("unused") ColumnConfig columnConfig,
@SuppressWarnings("unused") int numRows

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'numRows' is never used.
Comment on lines 124 to 125
this.skipRangeIndexThreshold = (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRows);
this.skipPredicateIndex = localDictionarySupplier.get().size() > Math.ceil(columnConfig.skipValuePredicateIndexScale() * numRows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit nicer to make these methods instead of fields that are computed on every constructor. The computation that's being done is minimal, that's true, but it's better to only do it when needed. I think that if you just store the reference to the ColumnConfig you can pretty easily move this check into private methods. All of the places where you would check for skipping the predicate index, you would probably also be grabbing a localDictionarySupplier anyway, so you can reuse that singular reference too.

@clintropolis clintropolis removed the WIP label Apr 5, 2023
@@ -36,4 +36,12 @@
*/
@Nullable
BitmapColumnIndex forPredicate(DruidPredicateFactory matcherFactory);

static boolean checkSkipThreshold(@Nullable ColumnConfig columnConfig, int numRowsToScan, int dictionaryCardinality)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldSkip? checkSkipThreshold doesn't really tell me how I'm supposed to interpret true/false. Does true mean that it exceeded the threshold and therefore should be skipped? Or does it mean that it did not exceed the threshold and should be used?


static boolean checkSkipThreshold(ColumnConfig columnConfig, int numRowsToScan, int rangeSize)
{
return rangeSize > (int) Math.ceil(columnConfig.skipValueRangeIndexScale() * numRowsToScan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You aren't doing the null check on this one. That intentional?

@clintropolis clintropolis merged commit b11c0bc into apache:master Apr 6, 2023
@clintropolis clintropolis deleted the smarter-nested-column-indexes branch April 6, 2023 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants