-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional box plot agg options #60466
Comments
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Summarizing from a previous comment -- a common style is for the whiskers extend to the furthest points within [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. Points that are outside this interval are marked as outliers and displayed on the plot. I suspect this is what many users would think of as a 'standard' box plot. A couple questions/ observations:
|
@jtibshirani - Couple of questions on this:
Thanks! |
Julie and I discussed this on Zoom, and have a few thoughts. It sounds like the biggest use case here is the 1.5IQR points. If we just want to add that, we can simply include it in the output we have now, going from returning 5 numbers to returning 7 numbers. This would also allow users to know if there were outliers, by checking if the 1.5IQR value was less than the max (or more than the min on the other side). It would then be possible to query the outliers with a range query if desiered. In this proposal, we would not add a new parameter to the agg, and would not support the 9/91 or 2/98 quantile cases. We'd just be enhancing the current agg with two new output values, and let the user choose which to use for displaying the whisker end points. @benwtrent @mattfield @pmoust @tveasey - Tagging you folks as interested parties on this for feedback on this proposed solution. Would returning a set of 7 numbers - max, Q3 + 1.5 * IQR, Q3, Q2, Q1, Q1 - 1.5*IQR, min - be enough to make this aggregation useful to you? |
Yes. |
@blaklaybul @Winterflower @joshdevins what do y'all think of this proposal? |
This sounds like a good proposal to me. I agree that actual outliers should be retrieved separately and might conceivably need to be downsampled anyway for a very large data set. I think it would be worth mentioning this thinking
in the docs for this agg as well. |
Hi, from a completely self-centred point of view the ability to sort by the whisker values is important. If I'm understanding correctly the proposal is to return the values of the 1.5IQR upper and lower bounds but not actually work out the highest and lowest values that are contained to produce the actual whisker values. Leaving it to the user to execute follow up queries in order to determine outliers and the actual whisker values. Is this understanding correct? Apologies if i've misunderstood. |
@leewadhams (and others) - I don't want to just return the 1.5 IQR values, since that's pretty trivial to compute client side and doesn't add much utility to the aggregation. Ideally, I'd like to return the closest contained value to the 1.5 IQR point, which I believe should be the whisker value, but in practice it's not that simple. Boxplot is built on a bounded error sketch of the data (it uses a t-digest internally), so the best I can do in the general case is to get close to the whisker value. I'm still playing around with methodologies, but I hope to be able to quantify "close" a bit more before I release this. By the same token, we can't return outlier values from this aggregation, because the sketch doesn't store exact values. |
In #51948 we have a basic support for a box plot graphs with a minimum amount of supported values (min, max q1, q2, and q3). as we discussed before date for some alternative methods of displaying whiskers can be derived from the 5 values we provide at the moment. Recently, we have got a request from a user to add these calculations into the aggregation. I would like to discuss this as well as adding support for some other styles of box plot such as:
The text was updated successfully, but these errors were encountered: