-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Aggregation precomputation (rebased) #18927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Aggregation precomputation (rebased) #18927
Conversation
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
… completed action items Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
|
{"run-benchmark-test": "id_13"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/4009/ . Final results will be published once the job is completed. |
|
The command above was used to test the benchmarking bot. I just wanted to see how that worked because I saw it on other prs. Sorry for the confusion. |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/4009/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/145/
|
sandeshkr419
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ajleong623 for rebasing changes. I am trying to review but I think reviewing 3 different aggregations in a single PR is making this difficult to review.
Let's break this into multiple independent PRs: missing values, rare terms, significant/string terms.
I do have some comments on missing terms section which you can address when you break it.
(You can put this in draft mode and later close this once you finish merging up all changes independently).
| // The optimization does not work when there are subaggregations. | ||
| if (subAggregators.length > 0) { | ||
| return false; | ||
| } | ||
|
|
||
| // When fieldname does not exist, we cannot collect through the precomputation. | ||
| if (fieldName == null || weight == null) { | ||
| return false; | ||
| } | ||
|
|
||
| // we do not collect any documents through the missing aggregation when the missing parameter | ||
| // is up. | ||
| if (valuesSourceConfig != null && valuesSourceConfig.missing() != null) { | ||
| return true; | ||
| } | ||
|
|
||
| // The optimization could only be used if there are no deleted documents and the top-level | ||
| // query matches all documents in the segment. | ||
| if (weight.count(ctx) == 0) { | ||
| return true; | ||
| } else if (weight.count(ctx) != ctx.reader().maxDoc()) { | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into the ordering of different short-circuits, I'm thinking the true cases needs to be prioritized first because that will save a lot of time in cutting up flow.
For example, let's say you haven sub-aggs > 0, but weight.count == 0 : in this case, you will still say that pre-compute has returned false and go on to create a leaf collector even though your matching documents (weight.count) is 0.
So, I guess prioritizing weight.count == 0 in ordering makes sense.
Something like:
weight == null: return falseweight.count == 0: return truevaluesSourceConfig != null && valuesSourceConfig.missing() != nullorsubAggregators.length > 0- Remaining can exist in the same order after the above checks.
Between subAggregators.length > 0 and valuesSourceConfig != null && valuesSourceConfig.missing() != null - I'm thinking that do subAggegators matter if valuesSourceConfig != null && valuesSourceConfig.missing() != null is satisfied. If the valueSourceConfig check works with subAggregators as well, then the subAggregator checks can be moved after valueSourceConfig check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably start with:
if (weight == null) {
// Weight not assigned - cannot use this optimization
return false;
} else {
if (weight.count(ctx) == 0) {
// No documents matches top level query on this segment, we can skip the segment entirely
return true;
} else if (weight.count(ctx) != ctx.reader().maxDoc()) {
// weight.count(ctx) == ctx.reader().maxDoc() implies there are no deleted documents and
// top-level query matches all docs in the segment
return false;
}
}
| if (this.valuesSource != null) { | ||
| this.fieldName = valuesSource.getIndexFieldName(); | ||
| } else { | ||
| this.fieldName = null; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just making this more concise and readable
this.fieldName = this.valuesSource != null ? valuesSource.getIndexFieldName() : null;
Signed-off-by: Anthony Leong <aj.leong623@gmail.com>
|
❌ Gradle check result for 5f69668: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
This change is the same as #18106 but rebased to a more descriptive branch
This change expands on using the techniques from @sandeshkr419 pull request #11643 to precompute aggregations for match all or match none queries. We can leverage reading from termsEnum to precompute the aggregation when the field is indexed and when there are no deletions. We can check that no terms are deleted by using the weight and checking if it matches maxDocs of the reader.
Unfortunately, I was not able to use the same technique for numeric aggregators like LongRareTermsAggregator. This is because the numeric points are not indexed by frequency of terms but instead through KD-trees to optimize for different types of operations https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/PointValues.java.
Please let me know if there are any comments, concerns or suggestions.
Related Issues
Resolves #13123
#13122
#10954
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.