-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple pipeline reductions from final agg reduction #45796
Conversation
Historically only two things happened in the final reduction: empty buckets were filled, and pipeline aggs were reduced (since it was the final reduction, this was safe). Usage of the final reduction is growing however. Auto-date-histo might need to perform many reductions on final-reduce to merge down buckets, CCS may need to side-step the final reduction if sending to a different cluster, etc Having pipelines generate their output in the final reduce was convenient, but is becoming increasingly difficult to manage as the rest of the agg framework advances. This commit decouples pipeline aggs from the final reduction: 1. Introduces a new "top level" reduce, which should be called at the beginning of the reduce cycle (e.g. from the SearchPhaseController) 2. Adds a `materializePipeline()` method to InternalAggs and InternalMultiBucket. This is essentially the final reduce for pipelines 3. Makes reductions on pipelines a no-op By separating pipeline reduction into their own set of methods, aggregations are free to use the final reduction for whatever purpose without worrying about generating pipeline results which are non-reducible
Pinging @elastic/es-analytics-geo |
Note: the first commit contains the main changes, the second commit ( |
/cc @javanna since you've looked at the final reduce stuff for CCS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone who knows Significant Terms aggregation better should probably comment on that part, I'm just taking it on faith. Otherwise this looks good to me. I left two nits about the use of the Streams API, but they're both a matter of opinion, so don't feel like you have to fix them.
server/src/main/java/org/elasticsearch/search/aggregations/InternalAggregations.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/InternalMultiBucketAggregation.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to separate the aggs reduction and the pipeline reduction. However I wonder if all the renamings are needed. Could we just change InternalAggregations
to separate reduce
and reducePipeline
?
server/src/main/java/org/elasticsearch/search/aggregations/InternalAggregation.java
Outdated
Show resolved
Hide resolved
Review comments addressed. I renamed the "materialize" stuff back to "reduce", although there's a small hiccup. Because InternalAggregations's I first renamed that to In the second commit I changed it to I don't have a strong opinion either way, although I'd lean towards renaming to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me.
I don't have a strong opinion either way, although I'd lean towards renaming to reduce() and just accepting that it touches every single aggregation :)
+1 to move back to reduce
and no need for another round of review since the change should be straightforward ;).
This reverts commit ba627f6.
@elasticmachine update branch |
Oh my goodness, I thought I merged this. Fixing up conflicts so that it can go in. Oops! |
20ad569
to
abbd05c
Compare
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
Historically only two things happened in the final reduction: empty buckets were filled, and pipeline aggs were reduced (since it was the final reduction, this was safe). Usage of the final reduction is growing however. Auto-date-histo might need to perform many reductions on final-reduce to merge down buckets, CCS may need to side-step the final reduction if sending to a different cluster, etc Having pipelines generate their output in the final reduce was convenient, but is becoming increasingly difficult to manage as the rest of the agg framework advances. This commit decouples pipeline aggs from the final reduction by introducing a new "top level" reduce, which should be called at the beginning of the reduce cycle (e.g. from the SearchPhaseController). This will only reduce pipeline aggs on the final reduce after the non-pipeline agg tree has been fully reduced. By separating pipeline reduction into their own set of methods, aggregations are free to use the final reduction for whatever purpose without worrying about generating pipeline results which are non-reducible
Historically only two things happened in the final reduction: empty buckets were filled, and pipeline aggs were reduced (since it was the final reduction, this was safe). Usage of the final reduction is growing however. Auto-date-histo might need to perform many reductions on final-reduce to merge down buckets, CCS may need to side-step the final reduction if sending to a different cluster, etc Having pipelines generate their output in the final reduce was convenient, but is becoming increasingly difficult to manage as the rest of the agg framework advances. This commit decouples pipeline aggs from the final reduction by introducing a new "top level" reduce, which should be called at the beginning of the reduce cycle (e.g. from the SearchPhaseController). This will only reduce pipeline aggs on the final reduce after the non-pipeline agg tree has been fully reduced. By separating pipeline reduction into their own set of methods, aggregations are free to use the final reduction for whatever purpose without worrying about generating pipeline results which are non-reducible
Historically only two extra activities happened in the final reduction: empty buckets were filled, and pipeline aggs were reduced (since it was the final reduction, this was safe). Usage of the final reduction is growing however. Auto-date-histo might need to perform many reductions on final-reduce to merge down buckets, CCS may need to side-step the final reduction if sending to a different cluster, etc
Having pipelines generate their output in the final reduce was convenient, but is becoming increasingly difficult to manage as the rest of the agg framework advances.
This commit decouples pipeline aggs from the final reduction:
materializePipeline()
method to InternalAggs and InternalMultiBucket. This is essentially the final reduce for pipelinesBy separating pipeline reduction into their own set of methods, aggregations are free to use the final reduction for whatever purpose without worrying about generating pipeline results which are non-reducible
Closes #44914, predecessor PR was #45359