WIP: add ProportionalSumAggregator #71191

swachter · 2021-04-01T15:39:41Z

This PR is by far not complete. I just want to confirm that this additional kind of aggregator may have a chance to get merged.

elasticmachine · 2021-04-01T17:38:45Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

nik9000 · 2021-04-12T12:19:23Z

Sorry I let this sit for so long. Things have been busy for me. But I think I get this now!

Say you have a document describing some change over a range of time:

PUT router/_doc/1?refresh
{
  "time_frame" : {
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  },
  "sent_bytes": 1324553423
}

If you were to aggregate like this:

POST router/_search
{
  "aggs": {
    "dates": {
      "date_histogram": {
        "field: "time_frame",
        "fixed_interval": "hour"
      },
      "aggs": {
        "sum": {
          "field": "sent_bytes"
        }
      }
    }
  }
}

Then each hour would get the all the bytes in the range. That'd be weird. So you've implemented:

POST router/_search
{
  "aggs": {
    "dates": {
      "date_histogram": {
        "field: "time_frame",
        "fixed_interval": "hour"
      },
      "aggs": {
        "proportional_sum": {
          "field": "sent_bytes"
        }
      }
    }
  }
}

Which evenly distributes those bytes across the whole hour. We don't know how the bytes actually arrived but even distribution is a reasonable guess. This does seem useful.

I wonder if it'd be better to plumb it through the values access stuff rather than named agregation stuff. So, like, the agg would look more like:

POST router/_search
{
  "aggs": {
    "dates": {
      "date_histogram": {
        "field: "time_frame",
        "fixed_interval": "hour"
      },
      "aggs": {
        "sum": {
          "field": "sent_bytes.proportional"
        }
      }
    }
  }
}

If we could make something like that work then you could put aggs between the range/histogram agg and the metric aggs. Also, we'd get metric agg implementations "for free".

@dgieselaar would something like that be useful for you? @not-napoleon, am I making sense? Or is this crazy?

dgieselaar · 2021-04-12T12:48:53Z

@nik9000 Maybe! I definitely can see the use case. But in APM, we're not using something like this right now. And prior art would lead me to believe we would care more about correctness here. If only we could store three-dimensional histograms (value, count, timestamp) 😄

swachter · 2021-04-12T12:53:12Z

Thanks for looking into this.

Which evenly distributes those bytes across the whole hour. We don't know how the bytes actually arrived but even distribution is a reasonable guess. This does seem useful.

Yes, that's the intention. (Small correction: bytes are distributed across the whole range and not the "whole hour".)

I wonder if it'd be better to plumb it through the values access stuff rather than named agregation stuff.

I agree that this would be better. I can try but would need some guidance (e.g. some pointers to code that does similar things).

not-napoleon · 2021-04-12T14:33:20Z

@nik9000 We'd talked about doing something like this when first implementing histograms over range fields, and I think it provides a useful feature. There's a pretty good write up of the netflow use case in #37642, which basically wants this feature. We ended up not addressing it when we first built out the range histogram stuff, in part because we felt assuming even distribution over the range wasn't great. But I'm not sure what we do now is all that much better, so I'm okay with exploring this again.

I just glanced at the code, and it looks like it deals with partial overlaps in buckets, which is good. We should definitely make sure we have some good tests around that (which there might be, I didn't read the test cases in detail).

Is this intended to only work with date ranges, or would we want to support numeric and IP ranges too? Not sure if we have use cases for those.

nik9000 · 2021-04-12T14:54:27Z

Is this intended to only work with date ranges, or would we want to support numeric and IP ranges too? Not sure if we have use cases for those.

Looks like it supports numeric ranges now.

dgieselaar · 2021-04-12T20:27:50Z

Actually this may be more useful than I originally used. It would make it much easier to compare aggregates of events with pre-aggregated summary documents of those events. Suppose that I have a rule that records the error count per 5 minutes (the interval being configurable by the user). I can index those summary documents with a @timerange value of 16:00-16:05. If I want to display the pre-aggregated data next to the source data as num errors per minute, I think I would have to fetch a sample document to get the bucket size, and then convert the rate (and fill in gaps in the chart etc). But if I can use @timerange with a date range agg with a 1m interval and have es distribute it across the time range, and all the stuff like rate, sum etc just work, I think that would be incredibly useful. Is that what we are considering here?

swachter · 2021-04-13T07:14:21Z

But if I can use @timerange with a date range agg with a 1m interval and have es distribute it across the time range, and all the stuff like rate, sum etc just work, I think that would be incredibly useful. Is that what we are considering here?

If you have documents that store the number of events and the time range these events fell into then a proportional_sum aggregation nested in a 1 minute date range field histogram aggregation would divide the number of events proportionally into the corresponding 1 minute buckets. E.g. the events document:

{
  "timerange": {
    "gte": "2021-04-13 12:00"
    "lt": "2021-04-13 12:05"
  }
  "events": 12
}

would add 12/5 = 2.4 events to the "proportional sum" of the five 1 minute buckets starting at 12:00, 12:01, 12:02, 12:03, and 12:04.

dgieselaar · 2021-04-14T18:56:04Z

@wylieconlon is this useful for lens?

wylieconlon · 2021-04-14T22:18:43Z

Thanks to @nik9000 for giving an example. @dgieselaar I'm not sure yet, but I think this would be used in Lens because we can't calculate it client side. I added support for Range data in the next minor of Kibana, but not in any released versions.

My main question is about defining the relationship between the metric and the aggregation to spread it over. Your examples @nik9000 are pretty simple, but I am imagining a more complex document structure and how it interacts with aggregations. Here are some examples that I'd like to understand:

I have 2 aggs, date histogram and number histogram. Which one is the range spread across?
Terms agg on an array field. Can I evenly spread out the range across each term...?

nik9000 · 2021-04-15T00:43:23Z

All good questions! I'm thinking about 'em. Might be the answer is "we don't distribute like that" but we certainly don't want to box ourselves into a corner.

…

On Wed, Apr 14, 2021, 18:18 Wylie Conlon ***@***.***> wrote: Thanks to @nik9000 <https://github.com/nik9000> for giving an example. @dgieselaar <https://github.com/dgieselaar> I'm not sure yet, but I think this would be used in Lens because we can't calculate it client side. I added support for Range data in the next minor of Kibana, but not in any released versions. My main question is about defining the relationship between the metric and the aggregation to spread it over. Your examples @nik9000 <https://github.com/nik9000> are pretty simple, but I am imagining a more complex document structure and how it interacts with aggregations. Here are some examples that I'd like to understand: - I have 2 aggs, date histogram and number histogram. Which one is the range spread across? - Terms agg on an array field. Can I evenly spread out the range across each term...? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#71191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUXISK6LTUWZIU5J4ILSDTIYIFDANCNFSM42HKMT6Q> .

swachter · 2021-04-15T06:40:23Z

* I have 2 aggs, date histogram and number histogram. Which one is the range spread across?

Are the two histogram aggs nested? The proportional sum aggregator considers the innermost ancestor histogram aggregation that is based on a range valued field.

* Terms agg on an array field. Can I evenly spread out the range across each term...?

The logic for the proportional sum aggregator could be extended to cover term aggregations over array fields, too: The newly introduced RangeFieldBucketAggregatorCollectContext could be replaced by something more general. In particular, it could be reduced to an interface that provides the proportionalValue method only. In case of a term aggregation over an array valued field, the method would divide the value by the length of the array.

swachter · 2021-05-04T14:47:56Z

@nik9000 I have some cycles left and could work on the ProportionalSumAggregator. WDYT?

nik9000 · 2021-05-06T19:11:32Z

@swachter sorry to leave you hanging for so long. I don't think a new agg is the right way to do this. But I think its something we should do. I have ideas, but I don't have time, sadly. I'm sorry this doesn't leave you in a good place.

swachter · 2021-05-17T12:11:29Z

@nik9000 Can you reveal some of your ideas? I would like to see if our use case would be covered. Do you have a time horizon for your plans? Can I help on this?

wchaparro · 2021-09-20T17:13:41Z

We don't see a clear way to extend this to a generalized use case. we believe we might be able to tackle this in our TSDB project. Right now we're focused on other priorities, and will close this one out in favor of addressing in a future PR.

add ProportionalSumAggregator

eac868e

elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Apr 1, 2021

swachter mentioned this pull request Apr 1, 2021

give access to some histogram related information #69059

Closed

swachter changed the title ~~add ProportionalSumAggregator~~ WIP: add ProportionalSumAggregator Apr 1, 2021

imotov requested a review from nik9000 April 1, 2021 17:38

imotov added the :Analytics/Aggregations Aggregations label Apr 1, 2021

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 1, 2021

imotov added >enhancement WIP labels Apr 1, 2021

wchaparro closed this Sep 20, 2021

WIP: add ProportionalSumAggregator #71191

WIP: add ProportionalSumAggregator #71191

Uh oh!

Conversation

swachter commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 1, 2021

Uh oh!

nik9000 commented Apr 12, 2021

Uh oh!

dgieselaar commented Apr 12, 2021

Uh oh!

swachter commented Apr 12, 2021

Uh oh!

not-napoleon commented Apr 12, 2021

Uh oh!

nik9000 commented Apr 12, 2021

Uh oh!

dgieselaar commented Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swachter commented Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgieselaar commented Apr 14, 2021

Uh oh!

wylieconlon commented Apr 14, 2021

Uh oh!

nik9000 commented Apr 15, 2021 via email

Uh oh!

swachter commented Apr 15, 2021

Uh oh!

swachter commented May 4, 2021

Uh oh!

nik9000 commented May 6, 2021

Uh oh!

swachter commented May 17, 2021

Uh oh!

wchaparro commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

swachter commented Apr 1, 2021 •

edited

Loading

dgieselaar commented Apr 12, 2021 •

edited

Loading

swachter commented Apr 13, 2021 •

edited

Loading

wchaparro commented Sep 20, 2021 •

edited

Loading