Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add ProportionalSumAggregator #71191

Closed
wants to merge 1 commit into from
Closed

Conversation

swachter
Copy link

@swachter swachter commented Apr 1, 2021

Issue: #71189

This PR is by far not complete. I just want to confirm that this additional kind of aggregator may have a chance to get merged.

@elasticsearchmachine elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Apr 1, 2021
@swachter swachter changed the title add ProportionalSumAggregator WIP: add ProportionalSumAggregator Apr 1, 2021
@imotov imotov requested a review from nik9000 April 1, 2021 17:38
@imotov imotov added the :Analytics/Aggregations Aggregations label Apr 1, 2021
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@nik9000
Copy link
Member

nik9000 commented Apr 12, 2021

Sorry I let this sit for so long. Things have been busy for me. But I think I get this now!

Say you have a document describing some change over a range of time:

PUT router/_doc/1?refresh
{
  "time_frame" : {
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  },
  "sent_bytes": 1324553423
}

If you were to aggregate like this:

POST router/_search
{
  "aggs": {
    "dates": {
      "date_histogram": {
        "field: "time_frame",
        "fixed_interval": "hour"
      },
      "aggs": {
        "sum": {
          "field": "sent_bytes"
        }
      }
    }
  }
}

Then each hour would get the all the bytes in the range. That'd be weird. So you've implemented:

POST router/_search
{
  "aggs": {
    "dates": {
      "date_histogram": {
        "field: "time_frame",
        "fixed_interval": "hour"
      },
      "aggs": {
        "proportional_sum": {
          "field": "sent_bytes"
        }
      }
    }
  }
}

Which evenly distributes those bytes across the whole hour. We don't know how the bytes actually arrived but even distribution is a reasonable guess. This does seem useful.

I wonder if it'd be better to plumb it through the values access stuff rather than named agregation stuff. So, like, the agg would look more like:

POST router/_search
{
  "aggs": {
    "dates": {
      "date_histogram": {
        "field: "time_frame",
        "fixed_interval": "hour"
      },
      "aggs": {
        "sum": {
          "field": "sent_bytes.proportional"
        }
      }
    }
  }
}

If we could make something like that work then you could put aggs between the range/histogram agg and the metric aggs. Also, we'd get metric agg implementations "for free".

@dgieselaar would something like that be useful for you? @not-napoleon, am I making sense? Or is this crazy?

@dgieselaar
Copy link
Member

@nik9000 Maybe! I definitely can see the use case. But in APM, we're not using something like this right now. And prior art would lead me to believe we would care more about correctness here. If only we could store three-dimensional histograms (value, count, timestamp) 😄

@swachter
Copy link
Author

Thanks for looking into this.

Which evenly distributes those bytes across the whole hour. We don't know how the bytes actually arrived but even distribution is a reasonable guess. This does seem useful.

Yes, that's the intention. (Small correction: bytes are distributed across the whole range and not the "whole hour".)

I wonder if it'd be better to plumb it through the values access stuff rather than named agregation stuff.

I agree that this would be better. I can try but would need some guidance (e.g. some pointers to code that does similar things).

@not-napoleon
Copy link
Member

@nik9000 We'd talked about doing something like this when first implementing histograms over range fields, and I think it provides a useful feature. There's a pretty good write up of the netflow use case in #37642, which basically wants this feature. We ended up not addressing it when we first built out the range histogram stuff, in part because we felt assuming even distribution over the range wasn't great. But I'm not sure what we do now is all that much better, so I'm okay with exploring this again.

I just glanced at the code, and it looks like it deals with partial overlaps in buckets, which is good. We should definitely make sure we have some good tests around that (which there might be, I didn't read the test cases in detail).

Is this intended to only work with date ranges, or would we want to support numeric and IP ranges too? Not sure if we have use cases for those.

@nik9000
Copy link
Member

nik9000 commented Apr 12, 2021

Is this intended to only work with date ranges, or would we want to support numeric and IP ranges too? Not sure if we have use cases for those.

Looks like it supports numeric ranges now.

@dgieselaar
Copy link
Member

dgieselaar commented Apr 12, 2021

Actually this may be more useful than I originally used. It would make it much easier to compare aggregates of events with pre-aggregated summary documents of those events. Suppose that I have a rule that records the error count per 5 minutes (the interval being configurable by the user). I can index those summary documents with a @timerange value of 16:00-16:05. If I want to display the pre-aggregated data next to the source data as num errors per minute, I think I would have to fetch a sample document to get the bucket size, and then convert the rate (and fill in gaps in the chart etc). But if I can use @timerange with a date range agg with a 1m interval and have es distribute it across the time range, and all the stuff like rate, sum etc just work, I think that would be incredibly useful. Is that what we are considering here?

@swachter
Copy link
Author

swachter commented Apr 13, 2021

But if I can use @timerange with a date range agg with a 1m interval and have es distribute it across the time range, and all the stuff like rate, sum etc just work, I think that would be incredibly useful. Is that what we are considering here?

If you have documents that store the number of events and the time range these events fell into then a proportional_sum aggregation nested in a 1 minute date range field histogram aggregation would divide the number of events proportionally into the corresponding 1 minute buckets. E.g. the events document:

{
  "timerange": {
    "gte": "2021-04-13 12:00"
    "lt": "2021-04-13 12:05"
  }
  "events": 12
}

would add 12/5 = 2.4 events to the "proportional sum" of the five 1 minute buckets starting at 12:00, 12:01, 12:02, 12:03, and 12:04.

@dgieselaar
Copy link
Member

@wylieconlon is this useful for lens?

@wylieconlon
Copy link

Thanks to @nik9000 for giving an example. @dgieselaar I'm not sure yet, but I think this would be used in Lens because we can't calculate it client side. I added support for Range data in the next minor of Kibana, but not in any released versions.

My main question is about defining the relationship between the metric and the aggregation to spread it over. Your examples @nik9000 are pretty simple, but I am imagining a more complex document structure and how it interacts with aggregations. Here are some examples that I'd like to understand:

  • I have 2 aggs, date histogram and number histogram. Which one is the range spread across?
  • Terms agg on an array field. Can I evenly spread out the range across each term...?

@nik9000
Copy link
Member

nik9000 commented Apr 15, 2021 via email

@swachter
Copy link
Author

* I have 2 aggs, date histogram and number histogram. Which one is the range spread across?

Are the two histogram aggs nested? The proportional sum aggregator considers the innermost ancestor histogram aggregation that is based on a range valued field.

* Terms agg on an array field. Can I evenly spread out the range across each term...?

The logic for the proportional sum aggregator could be extended to cover term aggregations over array fields, too: The newly introduced RangeFieldBucketAggregatorCollectContext could be replaced by something more general. In particular, it could be reduced to an interface that provides the proportionalValue method only. In case of a term aggregation over an array valued field, the method would divide the value by the length of the array.

@swachter
Copy link
Author

swachter commented May 4, 2021

@nik9000 I have some cycles left and could work on the ProportionalSumAggregator. WDYT?

@nik9000
Copy link
Member

nik9000 commented May 6, 2021

@swachter sorry to leave you hanging for so long. I don't think a new agg is the right way to do this. But I think its something we should do. I have ideas, but I don't have time, sadly. I'm sorry this doesn't leave you in a good place.

@swachter
Copy link
Author

@nik9000 Can you reveal some of your ideas? I would like to see if our use case would be covered. Do you have a time horizon for your plans? Can I help on this?

@wchaparro
Copy link
Member

wchaparro commented Sep 20, 2021

We don't see a clear way to extend this to a generalized use case. we believe we might be able to tackle this in our TSDB project. Right now we're focused on other priorities, and will close this one out in favor of addressing in a future PR.

@wchaparro wchaparro closed this Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants