Support probability: auto for random sampler aggregation. #86559

Mpdreamz · 2022-05-09T09:33:37Z

Description

Right now the onus of choosing the appropriate probability ratio lays fully on the consuming side.

This means the consuming side has to come up with multi phase query patterns to ensure enough data is loaded (see e.g elastic/kibana#127598)

It would be great if the random_sampler could explicitly support this multi phase query behavior through automatically increasing the probability.

"probability": "auto",
"min_documents": 1000000
"max_retries": 3

I am not sure if min_documents needs to be per shard or overall to statistically work out correctly.

This would also help ensuring if the query ends up running on a shard with very few documents we'd automatically take all documents into account.

The max_retries controls how often the coordinating node is able to reissue the query with a higher probability.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-05-12T16:37:15Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

benwtrent · 2022-06-01T15:14:08Z

I am not sure if min_documents needs to be per shard or overall to statistically work out correctly.

There is no way to do a global view. It would have to be per shard.

There is currently no way for aggs to run again from scratch after final merge. This especially gets tricky in the cross-cluster search scenario.

It could be a "min count per shard" or something. Though this doesn't really address the issue. The issue is "How can I get a representative sample, in the fastest way, with an acceptable error rate".

Even if you provided a min doc count, this doesn't address that issue. You would probably still need to run again (what if a particular date-histogram bucket or terms bucket had very few docs and thus those stats are less accurate?)

I do agree some more thought around this needs to be done. Maybe some "acceptable error rate" is probably a better solution here (though that too will take some significant work).

elasticsearchmachine · 2022-07-20T19:07:52Z

Pinging @elastic/ml-core (Team:ML)

Mpdreamz added >enhancement needs:triage Requires assignment of a team area label labels May 9, 2022

gwbrown added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels May 12, 2022

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 12, 2022

not-napoleon added the :ml Machine learning label Jul 20, 2022

elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 20, 2022

felixbarny mentioned this issue Nov 28, 2022

Constant query time for (some) metrics aggregations #91743

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support probability: auto for random sampler aggregation. #86559

Support probability: auto for random sampler aggregation. #86559

Mpdreamz commented May 9, 2022 •

edited

Loading

elasticmachine commented May 12, 2022

benwtrent commented Jun 1, 2022

elasticsearchmachine commented Jul 20, 2022

Support probability: auto for random sampler aggregation. #86559

Support probability: auto for random sampler aggregation. #86559

Comments

Mpdreamz commented May 9, 2022 • edited Loading

Description

elasticmachine commented May 12, 2022

benwtrent commented Jun 1, 2022

elasticsearchmachine commented Jul 20, 2022

Mpdreamz commented May 9, 2022 •

edited

Loading