-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support IN predicate in ColumnValue SegmentPruner (#6756) #6776
Conversation
private int _inPredicateThreshold; | ||
public static final int DEFAULT_VALUE_FOR_IN_PREDICATE = 10; | ||
public static final String CONFIG_MAX_VALUE_FOR_IN_PREDICATE = "pinot.segment.pruner.columnvalue.in.threshold"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be moved to CommonConstants in inner Server class.
We should use pinot.server.query.executor.pruner
prefix since that is what is used by QueryExecutorConfig when it creates SegmentPrunerConfig. Please check the QueryExecutorConfig constructor.
Secondly, since this config is applicable to min-max column value pruner on the server, we should use the convention pinot.server.query.executor.pruner.columnvaluesegmentpruner.<your new in predicate config>
This is because we create the config for each type of pruner as seen in the following code in SegmentPrunerConfig
for (String segmentPrunerName : segmentPrunerNames) {
_segmentPrunerNames.add(segmentPrunerName);
_segmentPrunerConfigs.add(segmentPrunerConfig.subset(segmentPrunerName));
}
When init() is called on ColumnValueSegmentPruner, it is initialized with all configs under pinot.server.query.executor.pruner.columnvaluesegmentpruner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
* <li>Column min/max value</li> | ||
* </ul> | ||
*/ | ||
private boolean pruneInPredicate(IndexSegment segment, InPredicate inPredicate, Map<String, DataSource> dataSourceCache) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) please consider adding more details to javadoc highlighting when the pruning won't happen if number of values is greater than threshold. Also add the return value - true if the segment can be pruned , false otherwise ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
return true; | ||
} | ||
|
||
private boolean checkMinMaxRange(DataSourceMetadata dataSourceMetadata, Comparable value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) please add brief javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
Codecov Report
@@ Coverage Diff @@
## master #6776 +/- ##
=============================================
+ Coverage 43.35% 66.04% +22.69%
- Complexity 7 12 +5
=============================================
Files 1411 1411
Lines 68666 68685 +19
Branches 9918 9924 +6
=============================================
+ Hits 29768 45363 +15595
+ Misses 36373 20106 -16267
- Partials 2525 3216 +691
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@@ -36,6 +38,8 @@ | |||
import org.apache.pinot.spi.data.FieldSpec.DataType; | |||
import org.apache.pinot.spi.env.PinotConfiguration; | |||
|
|||
import static org.apache.pinot.common.utils.CommonConstants.Server.CONFIG_THRESHOLD_FOR_IN_PREDICATE; | |||
import static org.apache.pinot.common.utils.CommonConstants.Server.DEFAULT_VALUE_FOR_IN_PREDICATE_THRESHOLD; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) style check - please avoid static imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
bca5caf
to
ff73b5a
Compare
public static final int DEFAULT_VALUE_FOR_IN_PREDICATE_THRESHOLD = 10; | ||
public static final String CONFIG_THRESHOLD_FOR_IN_PREDICATE= "pinot.server.query.executor.pruner.columnvaluesegmentpruner.in.threshold"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final int DEFAULT_VALUE_FOR_IN_PREDICATE_THRESHOLD = 10; | |
public static final String CONFIG_THRESHOLD_FOR_IN_PREDICATE= "pinot.server.query.executor.pruner.columnvaluesegmentpruner.in.threshold"; | |
public static final String CONFIG_OF_VALUE_PRUNER_IN_PREDICATE_THRESHOLD = "pinot.server.query.executor.pruner.columnvaluesegmentpruner.inpredicate.threshold"; | |
public static final int DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
@Override | ||
public void init(PinotConfiguration config) { | ||
_inPredicateThreshold = config.getProperty(Server.CONFIG_THRESHOLD_FOR_IN_PREDICATE, Server.DEFAULT_VALUE_FOR_IN_PREDICATE_THRESHOLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we exclude the prefix pinot.server.query.executor.pruner.columnvaluesegmentpruner
from the config key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might make the name look cleaner. However, keeping the dot notation based name makes it consistent with all the config keys we have today. Also the name makes it self explanatory that it is for pruner on server query execution
What do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds okay to me to keep the prefix
* <li> false if the value is greater than min value or value is smaller than max value</li> | ||
* </ul> | ||
*/ | ||
private boolean checkMinMaxRange(DataSourceMetadata dataSourceMetadata, Comparable value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest returning false
if the value is not within the range, which is more intuitive.
(Optional) Also, we can slightly improve the performance by first perform the null check on the min/max value, then loop over the values to compare, instead of doing null checks for each value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! for part 1. For the optional part, how do you suggest to do that as the function might be needed to be split into separate functions for minValue and maxValue check independently for cleaner approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it this way then. This is cleaner, and should have travail performance impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the config key
@Override | ||
public void init(PinotConfiguration config) { | ||
_inPredicateThreshold = config.getProperty(Server.CONFIG_OF_VALUE_PRUNER_IN_PREDICATE_THRESHOLD, Server.DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect because when the config is passed from the SegmentPrunerProvider
, the prefix is already removed, and here you should use key inpredicate.threshold
to access the config.
The reason why the test works is because you directly pass the top-level config into this pruner class, which is not the case in the production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
* <li> false if the value is greater than min value or value is smaller than max value</li> | ||
* </ul> | ||
*/ | ||
private boolean checkMinMaxRange(DataSourceMetadata dataSourceMetadata, Comparable value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it this way then. This is cleaner, and should have travail performance impact.
* <ul> | ||
* <li>Column min/max value</li> | ||
* </ul> | ||
* Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc appears to be very verbose for this function and is taking up too many vertical lines. Can we keep it concise by getting rid of <ul>
. Also I think @return
is the standard way of specifying function return values (unless something has changed recently). I would suggest just having the following as javadoc:
/**
* @return true if size of values is greater than threshold or if value is greater than min and less than max;
* otherwise, false.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
/** | ||
* Check if the comparable value is within min/max range | ||
* <ul> | ||
* <li>Column min/max value</li> | ||
* </ul> | ||
* Returns: | ||
* <ul> | ||
* <li> true if the value is greater than min value or value is smaller than max value</li> | ||
* <li> false if the value is smaller than min value or value is greater than max value</li> | ||
* </ul> | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. A single @return
should be sufficient and will describe what the function is doing well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
378fad9
to
968a815
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
@@ -237,6 +237,8 @@ | |||
"pinot.server.instance.realtime.alloc.offheap.direct"; | |||
public static final String PREFIX_OF_CONFIG_OF_PINOT_FS_FACTORY = "pinot.server.storage.factory"; | |||
public static final String PREFIX_OF_CONFIG_OF_PINOT_CRYPTER = "pinot.server.crypter"; | |||
public static final String CONFIG_OF_VALUE_PRUNER_IN_PREDICATE_THRESHOLD = "inpredicate.threshold"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the full config key here which can be used to update the server config, and introduce another constant in ColumnValueSegmentPruner
for the config key without the prefix.
Server side segment pruning is currently supported for =, RANGE filter operators using min-max value stats (segment metadata). Similarly, bloom filter is also used for = filter.
For IN filter operator, we should add support for min-max value based pruning if the number of values in the IN clause are below a certain threshold.
Adding this support for large number of values in IN clause won't be helpful as the pruning may not happen (since values are likely to be spread across several segments) and the time to prune itself may negate the benefits. So, let's start with a configurable value with default being less than 10.
Issue #6756