Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support probability: auto for random sampler aggregation. #86559

Open
Mpdreamz opened this issue May 9, 2022 · 3 comments
Open

Support probability: auto for random sampler aggregation. #86559

Mpdreamz opened this issue May 9, 2022 · 3 comments
Labels
:Analytics/Aggregations Aggregations >enhancement :ml Machine learning Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team

Comments

@Mpdreamz
Copy link
Member

Mpdreamz commented May 9, 2022

Description

Right now the onus of choosing the appropriate probability ratio lays fully on the consuming side.

This means the consuming side has to come up with multi phase query patterns to ensure enough data is loaded (see e.g elastic/kibana#127598)

It would be great if the random_sampler could explicitly support this multi phase query behavior through automatically increasing the probability.

"probability": "auto",
"min_documents": 1000000
"max_retries": 3

I am not sure if min_documents needs to be per shard or overall to statistically work out correctly.

This would also help ensuring if the query ends up running on a shard with very few documents we'd automatically take all documents into account.

The max_retries controls how often the coordinating node is able to reissue the query with a higher probability.

@Mpdreamz Mpdreamz added >enhancement needs:triage Requires assignment of a team area label labels May 9, 2022
@gwbrown gwbrown added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels May 12, 2022
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 12, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@benwtrent
Copy link
Member

I am not sure if min_documents needs to be per shard or overall to statistically work out correctly.

There is no way to do a global view. It would have to be per shard.

There is currently no way for aggs to run again from scratch after final merge. This especially gets tricky in the cross-cluster search scenario.

It could be a "min count per shard" or something. Though this doesn't really address the issue. The issue is "How can I get a representative sample, in the fastest way, with an acceptable error rate".

Even if you provided a min doc count, this doesn't address that issue. You would probably still need to run again (what if a particular date-histogram bucket or terms bucket had very few docs and thus those stats are less accurate?)

I do agree some more thought around this needs to be done. Maybe some "acceptable error rate" is probably a better solution here (though that too will take some significant work).

@not-napoleon not-napoleon added the :ml Machine learning label Jul 20, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement :ml Machine learning Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

6 participants