-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: add ProportionalSumAggregator #71191
Conversation
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Sorry I let this sit for so long. Things have been busy for me. But I think I get this now! Say you have a document describing some change over a range of time:
If you were to aggregate like this:
Then each hour would get the all the bytes in the range. That'd be weird. So you've implemented:
Which evenly distributes those bytes across the whole hour. We don't know how the bytes actually arrived but even distribution is a reasonable guess. This does seem useful. I wonder if it'd be better to plumb it through the values access stuff rather than named agregation stuff. So, like, the agg would look more like:
If we could make something like that work then you could put aggs between the range/histogram agg and the metric aggs. Also, we'd get metric agg implementations "for free". @dgieselaar would something like that be useful for you? @not-napoleon, am I making sense? Or is this crazy? |
@nik9000 Maybe! I definitely can see the use case. But in APM, we're not using something like this right now. And prior art would lead me to believe we would care more about correctness here. If only we could store three-dimensional histograms (value, count, timestamp) 😄 |
Thanks for looking into this.
Yes, that's the intention. (Small correction: bytes are distributed across the whole range and not the "whole hour".)
I agree that this would be better. I can try but would need some guidance (e.g. some pointers to code that does similar things). |
@nik9000 We'd talked about doing something like this when first implementing histograms over range fields, and I think it provides a useful feature. There's a pretty good write up of the netflow use case in #37642, which basically wants this feature. We ended up not addressing it when we first built out the range histogram stuff, in part because we felt assuming even distribution over the range wasn't great. But I'm not sure what we do now is all that much better, so I'm okay with exploring this again. I just glanced at the code, and it looks like it deals with partial overlaps in buckets, which is good. We should definitely make sure we have some good tests around that (which there might be, I didn't read the test cases in detail). Is this intended to only work with date ranges, or would we want to support numeric and IP ranges too? Not sure if we have use cases for those. |
Looks like it supports numeric ranges now. |
Actually this may be more useful than I originally used. It would make it much easier to compare aggregates of events with pre-aggregated summary documents of those events. Suppose that I have a rule that records the error count per 5 minutes (the interval being configurable by the user). I can index those summary documents with a @timerange value of 16:00-16:05. If I want to display the pre-aggregated data next to the source data as num errors per minute, I think I would have to fetch a sample document to get the bucket size, and then convert the rate (and fill in gaps in the chart etc). But if I can use @timerange with a date range agg with a 1m interval and have es distribute it across the time range, and all the stuff like rate, sum etc just work, I think that would be incredibly useful. Is that what we are considering here? |
If you have documents that store the number of events and the time range these events fell into then a
would add 12/5 = 2.4 events to the "proportional sum" of the five 1 minute buckets starting at |
@wylieconlon is this useful for lens? |
Thanks to @nik9000 for giving an example. @dgieselaar I'm not sure yet, but I think this would be used in Lens because we can't calculate it client side. I added support for My main question is about defining the relationship between the metric and the aggregation to spread it over. Your examples @nik9000 are pretty simple, but I am imagining a more complex document structure and how it interacts with aggregations. Here are some examples that I'd like to understand:
|
All good questions! I'm thinking about 'em. Might be the answer is "we
don't distribute like that" but we certainly don't want to box ourselves
into a corner.
…On Wed, Apr 14, 2021, 18:18 Wylie Conlon ***@***.***> wrote:
Thanks to @nik9000 <https://github.com/nik9000> for giving an example.
@dgieselaar <https://github.com/dgieselaar> I'm not sure yet, but I think
this would be used in Lens because we can't calculate it client side. I
added support for Range data in the next minor of Kibana, but not in any
released versions.
My main question is about defining the relationship between the metric and
the aggregation to spread it over. Your examples @nik9000
<https://github.com/nik9000> are pretty simple, but I am imagining a more
complex document structure and how it interacts with aggregations. Here are
some examples that I'd like to understand:
- I have 2 aggs, date histogram and number histogram. Which one is the
range spread across?
- Terms agg on an array field. Can I evenly spread out the range
across each term...?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#71191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXISK6LTUWZIU5J4ILSDTIYIFDANCNFSM42HKMT6Q>
.
|
Are the two histogram aggs nested? The proportional sum aggregator considers the innermost ancestor histogram aggregation that is based on a range valued field.
The logic for the proportional sum aggregator could be extended to cover term aggregations over array fields, too: The newly introduced |
@nik9000 I have some cycles left and could work on the ProportionalSumAggregator. WDYT? |
@swachter sorry to leave you hanging for so long. I don't think a new agg is the right way to do this. But I think its something we should do. I have ideas, but I don't have time, sadly. I'm sorry this doesn't leave you in a good place. |
@nik9000 Can you reveal some of your ideas? I would like to see if our use case would be covered. Do you have a time horizon for your plans? Can I help on this? |
We don't see a clear way to extend this to a generalized use case. we believe we might be able to tackle this in our TSDB project. Right now we're focused on other priorities, and will close this one out in favor of addressing in a future PR. |
Issue: #71189
This PR is by far not complete. I just want to confirm that this additional kind of aggregator may have a chance to get merged.