Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple pipeline reductions from final agg reduction #45796

Merged
merged 11 commits into from
Dec 5, 2019

Conversation

polyfractal
Copy link
Contributor

Historically only two extra activities happened in the final reduction: empty buckets were filled, and pipeline aggs were reduced (since it was the final reduction, this was safe). Usage of the final reduction is growing however. Auto-date-histo might need to perform many reductions on final-reduce to merge down buckets, CCS may need to side-step the final reduction if sending to a different cluster, etc

Having pipelines generate their output in the final reduce was convenient, but is becoming increasingly difficult to manage as the rest of the agg framework advances.

This commit decouples pipeline aggs from the final reduction:

  1. Introduces a new "top level" reduce, which should be called at the beginning of the reduce cycle (e.g. from the SearchPhaseController)
  2. Adds a materializePipeline() method to InternalAggs and InternalMultiBucket. This is essentially the final reduce for pipelines
  3. Makes reductions on pipelines a no-op

By separating pipeline reduction into their own set of methods, aggregations are free to use the final reduction for whatever purpose without worrying about generating pipeline results which are non-reducible

Closes #44914, predecessor PR was #45359

Historically only two things happened in the final reduction:
empty buckets were filled, and pipeline aggs were reduced (since it
was the final reduction, this was safe).  Usage of the final reduction
is growing however.  Auto-date-histo might need to perform
many reductions on final-reduce to merge down buckets, CCS
may need to side-step the final reduction if sending to a
different cluster, etc

Having pipelines generate their output in the final reduce was
convenient, but is becoming increasingly difficult to manage
as the rest of the agg framework advances.

This commit decouples pipeline aggs from the final reduction:
1. Introduces a new "top level" reduce, which should be called
at the beginning of the reduce cycle (e.g. from the SearchPhaseController)
2. Adds a `materializePipeline()` method to InternalAggs and
InternalMultiBucket.  This is essentially the final reduce
for pipelines
3. Makes reductions on pipelines a no-op

By separating pipeline reduction into their own set of methods,
aggregations are free to use the final reduction for whatever
purpose without worrying about generating pipeline results
which are non-reducible
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@polyfractal
Copy link
Contributor Author

Note: the first commit contains the main changes, the second commit ("Remove unnecessary doReduce()") just does a bulk rename since doReduce() is no longer needed. So reviewing the first commit first would be less noisy.

@polyfractal
Copy link
Contributor Author

/cc @javanna since you've looked at the final reduce stuff for CCS
/cc @markharwood since I had to make some changes to SigTerms (e.g. the buckets get updated with scores, and I had to adjust some ctors to allow buckets to be recreated with the same score)

@colings86 colings86 added v7.5.0 and removed v7.4.0 labels Aug 30, 2019
Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone who knows Significant Terms aggregation better should probably comment on that part, I'm just taking it on faith. Otherwise this looks good to me. I left two nits about the use of the Streams API, but they're both a matter of opinion, so don't feel like you have to fix them.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to separate the aggs reduction and the pipeline reduction. However I wonder if all the renamings are needed. Could we just change InternalAggregations to separate reduce and reducePipeline ?

@polyfractal
Copy link
Contributor Author

Review comments addressed.

I renamed the "materialize" stuff back to "reduce", although there's a small hiccup. Because InternalAggregations's reduce() is essentially now reducePipelines(), we are left over with the abstract doReduce().

I first renamed that to reduce() which makes sense, although it modifies some 70+ files :)

In the second commit I changed it to doReduce() which has a much more minimal impact (because all the aggs implemented doReduce() abstract method before), although a bit strange conceptually as there isn't a reduce() anymore.

I don't have a strong opinion either way, although I'd lean towards renaming to reduce() and just accepting that it touches every single aggregation :)

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me.

I don't have a strong opinion either way, although I'd lean towards renaming to reduce() and just accepting that it touches every single aggregation :)

+1 to move back to reduce and no need for another round of review since the change should be straightforward ;).

This reverts commit ba627f6.
@polyfractal
Copy link
Contributor Author

@elasticmachine update branch

@jimczi jimczi added v7.6.0 and removed v7.5.0 labels Nov 12, 2019
@polyfractal
Copy link
Contributor Author

Oh my goodness, I thought I merged this. Fixing up conflicts so that it can go in.

Oops!

@polyfractal
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

@polyfractal polyfractal merged commit 9c34ff9 into elastic:master Dec 5, 2019
polyfractal added a commit that referenced this pull request Dec 5, 2019
Historically only two things happened in the final reduction:
empty buckets were filled, and pipeline aggs were reduced (since it
was the final reduction, this was safe).  Usage of the final reduction
is growing however.  Auto-date-histo might need to perform
many reductions on final-reduce to merge down buckets, CCS
may need to side-step the final reduction if sending to a
different cluster, etc

Having pipelines generate their output in the final reduce was
convenient, but is becoming increasingly difficult to manage
as the rest of the agg framework advances.

This commit decouples pipeline aggs from the final reduction by
introducing a new "top level" reduce, which should be called
at the beginning of the reduce cycle (e.g. from the SearchPhaseController).
This will only reduce pipeline aggs on the final reduce after
the non-pipeline agg tree has been fully reduced.

By separating pipeline reduction into their own set of methods,
aggregations are free to use the final reduction for whatever
purpose without worrying about generating pipeline results
which are non-reducible
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Historically only two things happened in the final reduction:
empty buckets were filled, and pipeline aggs were reduced (since it
was the final reduction, this was safe).  Usage of the final reduction
is growing however.  Auto-date-histo might need to perform
many reductions on final-reduce to merge down buckets, CCS
may need to side-step the final reduction if sending to a
different cluster, etc

Having pipelines generate their output in the final reduce was
convenient, but is becoming increasingly difficult to manage
as the rest of the agg framework advances.

This commit decouples pipeline aggs from the final reduction by
introducing a new "top level" reduce, which should be called
at the beginning of the reduce cycle (e.g. from the SearchPhaseController).
This will only reduce pipeline aggs on the final reduce after
the non-pipeline agg tree has been fully reduced.

By separating pipeline reduction into their own set of methods,
aggregations are free to use the final reduction for whatever
purpose without worrying about generating pipeline results
which are non-reducible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

auto_date_histogram fails where date_histogram does not
6 participants