Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify ordering support on terms aggregations #17588

Closed
jpountz opened this issue Apr 7, 2016 · 9 comments
Closed

Simplify ordering support on terms aggregations #17588

jpountz opened this issue Apr 7, 2016 · 9 comments
Assignees
Labels
:Analytics/Aggregations Aggregations >breaking stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@jpountz
Copy link
Contributor

jpountz commented Apr 7, 2016

We try to be as flexible as possible when it comes to sorting terms aggregations. However, sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts, which is disappointing to users. Instead, I suggest that we only allow order options that result in
reasonably accurate results:

  • remove the ability to sort by ascending count
  • remove ordering by sub aggregations entirely, or only allow when the leaf of the path is a min or max aggregation: in that case counts will not be accurate but I believe that the top buckets will be correct.
@clintongormley
Copy link
Contributor

Is this something we can possibly support with two-phase aggregations? #12316

@jpountz
Copy link
Contributor Author

jpountz commented Apr 7, 2016

With multiple phases, we could get accurate results when sorting by min/max aggs (the first round could compute the top buckets and the second could ask all shards only for these values). This would also help refine counts when sorting by descending count since we can assume that shards have similar number of occurrences of each term. But we cannot make such assumptions for sub aggregations. For instance if you sort a terms aggregation by a sub avg aggregation, you could still get very different top terms for each shard eg. if the field that you compute the avg on has outliers.

@rashidkpc
Copy link

Please open an issue on the kibana repo if and when you start moving on this as it will be a breaking change for many of our users.

@colings86
Copy link
Contributor

Discussed in FixItFriday and we agreed to split out removing the ascending count option from this issue so the conversation can be separated from removing support for sorting by sub aggregations (see #17614)

@CristianWeiland
Copy link

I had this problem (described in #23108), and increasing size of terms aggregation gives a more accurate result. Increasing size to a value bigger than the number of documents seems to give the correct result.
Why does this happen?

@markharwood
Copy link
Contributor

As noted on #17614 (comment), one possible policy is to error if numTermsReturned < size requested and point out the use of partitioning as a way to get accurate-within-partition results.

cc @elastic/es-search-aggs

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@nik9000
Copy link
Member

nik9000 commented Jul 29, 2020

Folks have long since come to rely on ordering by sub-agg so we think we're probably better off adding any optimizations that we can for it and doing what we can to make us more likely to produce correct answers.

The trouble with these orderings is that they make it very hard to prune results and still be sure that we have the right results, both on the data nodes and on the coordinating nodes. We try to work around this with shard_size so the data nodes return all of the important shards. We work around this on the coordinating node by never pruning results at all. Neither of these are great. The shard_size might not contain the important buckets so we could end up being wrong. Not pruning results from the shards could use a ton of memory. Both bad! But maybe we can do something about it? We're not really sure how yet though.

@nik9000
Copy link
Member

nik9000 commented Jun 23, 2021

Looking at this again, I think my comment from last time is still accurate. Folks rely on these flawed or not. I don't think we can drop this. And we don't have time in the short term to do these accurately. I think, at least for a while, we're not going to do anything with this.

@nik9000 nik9000 closed this as completed Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >breaking stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

9 participants