Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve sampling and normalization of population chart. #24402

Merged
merged 2 commits into from
Oct 24, 2018

Conversation

walterra
Copy link
Contributor

@walterra walterra commented Oct 23, 2018

Summary

This PR optimizes how contextual data is fetched for the population analysis chart. The previous structure of nested aggregations to get the data was like this:

byTime (date histogram) -> sample (sampling on each time slice) -> entities

This is now changed to the following structure:

sample (sample on whole dataset) => byTime (date histogram) -> entities
  • Previously the sampling was done inside each time slice. That means the longer the time range queried, the more sampling would have been done. The updated version does the sampling in the outer most aggregation. This makes the sampling and performance more predictable.
  • Additionally, a random document score is now used to get a more random sample of documents. Sampling takes the top N documents based on their score. Because the previous query was a bool query with must clauses only, all documents had the same score.
  • And finally, another optimization is the normalization of values for metric functions based on count and sum. In contrast to functions like mean, the results of the former could be heavily skewed because of the sampling. The normalization adjusts the values to take into account the total amount of documents without sampling.

image

The screenshots above show results before (left) and after (right) the optimization.

  • The currently used sampling size is now hard coded at 50000, the examples above are just to give you an impression how the sampling and normalization work before and after.
  • The left versions did the sampling for each time slice so the results look better for lower sampling values, but the queries were also much heavier.
  • Looking at the left versions, you can see that the results for the contextual data grow with higher sampling, that's the result of the lack of normalization.
  • The right side with sampling of 200 show very limited results, because the sampling is now applied to the outer aggregation. However, you can already notice the normalization kicking in with some dots showing up higher than any in the left version.
  • For comparison, the center bottom chart is done without sampling.

Checklist

Checklist not applicable (existing tests pass, no DOM changes).


Part of #21163.

@walterra walterra added non-issue Indicates to automation that a pull request should not appear in the release notes v7.0.0 :ml Feature:ml-results legacy - do not use v6.5.0 labels Oct 23, 2018
@walterra walterra self-assigned this Oct 23, 2018
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui

@walterra walterra mentioned this pull request Oct 23, 2018
39 tasks
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

{
random_score: {
// static seed to get same randomized results on every request
seed: 'ml'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you now need to use a field parameter here as well as seed to get reproducible results - see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-random

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out, I was working off some outdated docs. Push an update in 43ced43.

@alvarezmelissa87
Copy link
Contributor

This LGTM overall. 👍 I'm just waiting on responses to Pete's comments since I have less context on that bit.

Copy link
Contributor

@alvarezmelissa87 alvarezmelissa87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elasticmachine
Copy link
Contributor

💔 Build Failed

@walterra
Copy link
Contributor Author

retest

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@walterra walterra merged commit 2c7caee into elastic:master Oct 24, 2018
@walterra walterra deleted the ml-population-chart-normalization branch October 24, 2018 15:37
walterra added a commit to walterra/kibana that referenced this pull request Oct 24, 2018
…#24402)

This optimizes how contextual data is fetched for the population analysis chart.
walterra added a commit that referenced this pull request Oct 24, 2018
…#24508)

This optimizes how contextual data is fetched for the population analysis chart.
@sophiec20 sophiec20 added the Feature:Anomaly Detection ML anomaly detection label Jun 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Anomaly Detection ML anomaly detection Feature:ml-results legacy - do not use :ml non-issue Indicates to automation that a pull request should not appear in the release notes v6.5.0 v7.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants