Skip to content

Conversation

@polyfractal
Copy link
Contributor

UnmappedTerms/UnmappedSampler/UnmappedSigTerms aggs try to delegate reduction to a sibling object that is not unmapped. That delegated agg will run the reductions, and also reduce any pipeline aggs. But because delegation comes before running pipelines, the Unmapped* agg also tries to run pipeline aggs.

This causes the pipeline to run twice, and potentially double it's output in buckets which can create invalid JSON and break when converting to maps.

This fixes the issue by toggling a flag in UnmappedTerms if it delegated away reduction so that it knows not to run pipeline aggs either.

Closes #33514

Zachary Tong added 2 commits September 7, 2018 14:05
UnmappedTerms aggs try to delegate reduction to a sibling object that is
not unmapped.  That delegated agg will run the reductions, and also
reduce any pipeline aggs.  But because delegation comes before running
pipelines, the UnmappedTerms _also_ tries to run pipeline aggs.

This causes the pipeline to run twice, and potentially double it's output
in buckets which can create invalid JSON and break when converting to
maps.

This fixes the issue by toggling a flag in UnmappedTerms if it delegated
away reduction so that it knows not to run pipeline aggs either.

Closes elastic#33514
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of the fact that unmapped aggs need to do this delegation but this LGTM given thats how it works. I wonder if we can improve this to avoid having to delegate like this down the road?

@polyfractal
Copy link
Contributor Author

Another easy fix would be sorting the aggs before reduction, so that we preferentially choose mapped aggs as the first reduction object.

We'd probably need the unmapped aggs to implement a simple interface so we could tell them apart from other aggs, and would need to do the sort, but that might be overall cleaner than this sort of delegation?

@colings86
Copy link
Contributor

Another easy fix would be sorting the aggs before reduction, so that we preferentially choose mapped aggs as the first reduction object.

What would happen if all the aggs were unmapped, would we end up having to do the same delegation we do now?

We'd probably need the unmapped aggs to implement a simple interface so we could tell them apart from other aggs,

This could also be done by having a method on InternalAggregation that is something like isMapped() which an unmapped agg could return false for?

What would be great is if we could find a way to make it so Unmapped aggs are not special at all and so we don't have to avoid using them for reduce or being careful about double execution

@polyfractal
Copy link
Contributor Author

What would happen if all the aggs were unmapped, would we end up having to do the same delegation we do now?

Same delegation process, but no need for the flag to track state. If there are only unmapped aggs, the first one will get chosen to run. Then it will look through all the aggs, see they are also all unmapped and then just return itself skipping reduction entirely. Then it'll also run pipelines.

This could also be done by having a method on InternalAggregation that is something like isMapped() which an unmapped agg could return false for?

++ that'd work too, and be cleaner.

@polyfractal
Copy link
Contributor Author

@colings86 Ok, reworked to use a sorting approach. Has an overall smaller change to the codebase, at the small cost of a sort. I also removed the delegation ability from unmapped aggs, since it's not needed anymore. If they get called to lead the reduction they will just return themselves, because there are no other mapped aggs.

There is a bit of a danger here that the caller won't know to sort first, and won't know that unmapped aggs just return themselves. Right now there's only one place that calls this code so it's reasonably contained. But we could make the unmapped aggs throw a UnsupportedOperationException to avoid accidents.

Thoughts?

@polyfractal
Copy link
Contributor Author

Jenkins, run gradle build tests

1 similar comment
@polyfractal
Copy link
Contributor Author

Jenkins, run gradle build tests

@polyfractal
Copy link
Contributor Author

Huzzah, tests passing again. Think this is ready for another look.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment but this LGTM. Thanks for spending the time to work to a better solution

*
* Applies to any pipeline agg, not just max.
*/
public void testFieldGetsWrittenOutTwice() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this testing that the field does NOT get written twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, the name should be adjusted. Named it while it was failing, but that doesn't make sense now that it's fixed :)

@polyfractal
Copy link
Contributor Author

Jenkins, run gradle build tests

@polyfractal polyfractal merged commit 25d74bd into elastic:master Sep 26, 2018
polyfractal pushed a commit that referenced this pull request Sep 26, 2018
Previously, unmapped aggs try to delegate reduction to a sibling agg that is
mapped.  That delegated agg will run the reductions, and also
reduce any pipeline aggs.  But because delegation comes before running
pipelines, the unmapped agg _also_ tries to run pipeline aggs.

This causes the pipeline to run twice, and potentially double it's output
in buckets which can create invalid JSON (e.g. same key multiple times)
and break when converting to maps.

This fixes by sorting the list of aggregations ahead of time so that mapped
aggs appear first, meaning they preferentially lead the reduction.  If all aggs
are unmapped, the first unmapped agg simply creates a new unmapped object
and returns that for the reduction.

This means that unmapped aggs no longer defer and there is no chance for
a secondary execution of pipelines (or other side effects caused by deferring
execution).

Closes #33514
polyfractal pushed a commit that referenced this pull request Sep 26, 2018
Previously, unmapped aggs try to delegate reduction to a sibling agg that is
mapped.  That delegated agg will run the reductions, and also
reduce any pipeline aggs.  But because delegation comes before running
pipelines, the unmapped agg _also_ tries to run pipeline aggs.

This causes the pipeline to run twice, and potentially double it's output
in buckets which can create invalid JSON (e.g. same key multiple times)
and break when converting to maps.

This fixes by sorting the list of aggregations ahead of time so that mapped
aggs appear first, meaning they preferentially lead the reduction.  If all aggs
are unmapped, the first unmapped agg simply creates a new unmapped object
and returns that for the reduction.

This means that unmapped aggs no longer defer and there is no chance for
a secondary execution of pipelines (or other side effects caused by deferring
execution).

Closes #33514
kcm pushed a commit that referenced this pull request Oct 30, 2018
Previously, unmapped aggs try to delegate reduction to a sibling agg that is
mapped.  That delegated agg will run the reductions, and also
reduce any pipeline aggs.  But because delegation comes before running
pipelines, the unmapped agg _also_ tries to run pipeline aggs.

This causes the pipeline to run twice, and potentially double it's output
in buckets which can create invalid JSON (e.g. same key multiple times)
and break when converting to maps.

This fixes by sorting the list of aggregations ahead of time so that mapped
aggs appear first, meaning they preferentially lead the reduction.  If all aggs
are unmapped, the first unmapped agg simply creates a new unmapped object
and returns that for the reduction.

This means that unmapped aggs no longer defer and there is no chance for 
a secondary execution of pipelines (or other side effects caused by deferring
execution).

Closes #33514
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aggregations: toXContent response contains duplicate fields

3 participants