Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf should do some simple metric aggregation/rollup #380

Closed
ekini opened this issue Nov 19, 2015 · 28 comments
Closed

Telegraf should do some simple metric aggregation/rollup #380

ekini opened this issue Nov 19, 2015 · 28 comments

Comments

@ekini
Copy link
Contributor

ekini commented Nov 19, 2015

Let's say I have 1k metrics per second, generated by one host, with the same tags, but different values.
I want to sum all values, aggregated by 1 minute.

I can send all of them to InfluxDB and do aggregation there. It works for a few hosts, but what if I have thousands of them? InfluxDB will just die.

I'm not speaking about complex functions, but some simple ones like sum(), count() and mean() would be nice to have.

@sparrc
Copy link
Contributor

sparrc commented Nov 20, 2015

This is an interesting idea, do you have any ideas for how these aggregation functions could be configured? It would probably need to be a separate [aggregation] section of the config, where you could define different aggregation functions, maybe like this:

[aggregations]
[[aggregations.sum]]
measurement = "cpu_usage_*"
interval = "60s"
...
[[aggregations.mean]]
...

This would then need to be processed after collection. It'll be a little tricky I think because these measurements will need to be gathered, but then dropped before they get flushed (but flushed as part of the aggregate).

Another option could be putting the aggregate config as part of each plugin config, maybe something like this:

[cpu]
percpu = true
totalcpu = true
drop = ["cpu_time"]
[cpu.sum]
...
[cpu.mean]
...

@sparrc
Copy link
Contributor

sparrc commented Nov 20, 2015

BTW, @ekini which plugin is generating that many metrics?

@ekini
Copy link
Contributor Author

ekini commented Nov 20, 2015

The one that parses logs :)

I've been thinking about it a bit, and I'm still not sure how to configure aggregation. But there should be some grouping, by time and tags.

@sparrc
Copy link
Contributor

sparrc commented Nov 20, 2015

Seems like aggregations could be their own special type of plugin. They could live in their own directory and have an interface to make it easy for contributors.

Mechanically, I'm thinking they would need to be run by the flusher goroutine in agent.go, on the slice of points, before flush gets called.

Doing it this way would support the former of the two config options I listed above.

@sparrc
Copy link
Contributor

sparrc commented Nov 20, 2015

Actually we can aggregate stats as they arrive here: https://github.com/influxdb/telegraf/blob/master/agent.go#L397-L399

that way not needing to deal with dropping metrics that shouldn't be flushed on their own, we can just add the aggregated stats directly to the slice of points.

@erowan
Copy link

erowan commented Jun 7, 2016

I need to sum bytes + duration to aggregate netflow stats. Looking at your statsd plugin it doesn't appear to perform a sum. Can this be added similar to etsy/statsd?

@sparrc
Copy link
Contributor

sparrc commented Jun 7, 2016

@erowan please open a separate feature request for the statsd input if you have one. Although I'm not 100% sure I understand what you mean. The statsd protocol sums only if you are sending counters, doesn't it? Or are you talking about performing a sum on histogram/timer metrics? Can you link to some documentation on that if it exists in the etsy implementation?

@erowan
Copy link

erowan commented Jun 7, 2016

Hello @sparrc, it's documented here https://github.com/etsy/statsd/blob/master/docs/metric_types.md

But I think I am going to write (bytes*8)/duration = bps directly as a timing metric to telegraph statsd now.

@sparrc
Copy link
Contributor

sparrc commented Jun 7, 2016

@erowan do you mean timing sums? https://github.com/etsy/statsd/blob/master/docs/metric_types.md#timing

can you open a separate feature request for that?

@erowan
Copy link

erowan commented Jun 7, 2016

@sparrc yes that was what I was referring too. I am still pondering on it. I'll gladly open later if required.
Cheers.

This was referenced Jun 8, 2016
@alimousazy
Copy link
Contributor

alimousazy commented Jun 8, 2016

Can I work on aggregation ? it just a matter of moving code around since I have working version but it inside one of the input plug in @sparrc ?

@sparrc
Copy link
Contributor

sparrc commented Jun 8, 2016

You can open a PR but I can't guarantee I'll accept it. This is a difficult problem and many of the stats require storing large amounts of data to be completely accurate. If you can please try to use the statsd running_stats code for these as well: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/running_stats.go

I'd prefer that over using an outside library.

Currently running_stats doesn't have a median or sum function, but that should be simple to add.

@alimousazy
Copy link
Contributor

Here is a PR which addresz the issue #1364

@jadbox
Copy link

jadbox commented Jun 23, 2016

We're also looking for a way to do aggregations sum within telegraph before the data is sent over to Influx as our volume can be 100k(s) updates per second.

@jadbox
Copy link

jadbox commented Jun 23, 2016

An ideal solution for me is if the logparser plugin (#1320) supported aggregates in the way statsD works.

@alimousazy
Copy link
Contributor

@jadbox If you mean by aggregation sum of each field this can be added easily to histogram aggregation filter. I don't think it is the right to have aggregation within input it because it really hard to apply it on other input plugins .

@jadbox
Copy link

jadbox commented Jun 24, 2016

@alimousazy
In my other example, I had "joe" as a key, but my data are arbitrary number of keys that I wouldn't code into the histogram query.

userID | timestamp | doesActionA | doesActionB
joe, 1466550440, 50, 20
joe, 1466550440, 10, 15
terry, 1466550440, 5, 30

and I want to aggregate in telegraph before sending to Influx:

# aggregate into 1s blocks, and send each block to Influx
joe, 1466550440, 60, 35
terry, 1466550440, 5, 30

These are the aggregate 1s slices I need to send directly to Influx. I'm not seeing how histogram solves this- can you explain it more? Note that I do not know the userID field values ahead of time... they are arbitrary data points.

@alimousazy
Copy link
Contributor

alimousazy commented Jun 24, 2016

@jadbox Could you please tell me if Joe and Terry are tag names or metric names ? if it is a tag name then aggregation will be per tag so you will have two metrics with same metric name but different tags aggregated per tag name already supported with current implementation (The result that you want). I will all add "_ALL" as reserved metric name which allow aggregation all metrics regardless of the name but that doesn't matter in your case.by the way LogParser will emit all the metric under one metric name but I think with different tags, so you will the expected result.

@sparrc
Copy link
Contributor

sparrc commented Jun 25, 2016

see influxdata/influxdb#6910

@sparrc sparrc changed the title Telegraf should do some simple metrics aggregation Telegraf should do some simple metric aggregation/rollup Jun 25, 2016
@jadbox
Copy link

jadbox commented Jun 26, 2016

@sparrc fyi, in my case I need aggregations before I send data to a DB. (400k/s writes)

@alimousazy Joe/Terry are tag names. The metric name would be a single static name as the data falls into a single category.

Okay, you're saying that this is supported with LogParser, but how do I tell LogParser to increment certain fields together by tag name, by 1 minute sliced batches? I don't see anything related to aggregations (either by tag or by time slice) in the docs:

https://github.com/influxdata/telegraf/tree/master/plugins/inputs/logparser

@sparrc
Copy link
Contributor

sparrc commented Jun 26, 2016

it is not supported by logparser, there is currently no support for this except using the statsd input.

The solution for this will need to be generic and usable across all plugins, as well as supporting filtering of tag key/values, field names, and measurement names.

@alimousazy
Copy link
Contributor

@jadbox You don't have to add anything to logpaser config, you just to enable histogram filter by adding this configuration (You can enable the filter to any kind of plugins)

[[filter.histogram]]
  bucketsize = 20  
  flush_interval = "1m"
  [filter.histogram.metrics]
    (replace with your metric name) = [0.90] 


*Note: you can tone aggregation interval by modifying flush_interval (I may change flush interval to aggregation interval) , If you don't need percentile just leave the array empty.

Note this code is not merged yet so you have to merge it your self and build from source. expect changes after code review .

Once you feel that the code solve your case I will add sum

@jadbox
Copy link

jadbox commented Jun 26, 2016

@alimousazy Okay, I think adding sum to histogram may work for me. I don't need the percentile so my config would look like this I assume.

[[filter.histogram]]
  bucketsize = 20
  flush_interval = "1m"
  [filter.histogram.metrics]
    tracking_log = [] 

Might be useful to optionally specify to just export sum (when it has been added) instead of always including variance, mean, and count along with it. This may save a good chunk of performance when dealing with high volume of data. Of course, this breaks the notion of the filter plugin being a histogram versus just an aggregator.

@alimousazy
Copy link
Contributor

@jadbox, I just Added support for sum to the pull request.

Don't worry about performance I'm using special implementation for Histogram which specially designed for streaming and low memory foot print, please let me know about any feedback.

I will spend tonight in testing solidifying the solution.

@pauldix
Copy link
Member

pauldix commented Jun 27, 2016

I recently added an issue for InfluxDB to be able to do aggregations across many measurements. It would be good if the Telegraf method for doing this used a similar sort of structure and syntax. See influxdata/influxdb#6910

@alimousazy
Copy link
Contributor

@pauldix I can map the syntax to something like this

[[filter.histogram]]
  [rollup] 
    name= "foo"
    measurements = ["foo", "bar"] # Leaving it empty mean all the metrics 
    fields = [] #specifying one field (if left as empty mean all)
    functions = ["mean", "count", "max", "percentile(90) as perc_90", "percentile(99) as perc_99"]
    periods = "5m" #flushing interval 
    drop_original = true # drop original metrics only if it contain all the aggregate fields 

While I feel adding the fields condition have a big cost since we are dealing with streaming data.

Any other ideas for filter which reside between input and output plugins , I have the following filter that I might implement in the future if infrastructure get merged :

1- Rename filter for renaming tags or metric (Metric shaping).
2- Condition filter to drop metric which doesn't meet condition like more than specific value or have specific tag (Useful for alerting).
3- Sampling filter use sampling tools and library to sample metric and reduce bandwidth usage .
4- Bandwidth filter specify the max number of metric that should be emitted in specify period of time this can be number of metric or metric size in bytes.
5- Remote control filter which use integrated messaging library like NanoMSG to accept command from centralized service this can be used to enable disable other filters on demand.

Any ideas on these filters syntax ( I might with other ideas in the future ) ?

@alimousazy
Copy link
Contributor

alimousazy commented Aug 5, 2016

I just added support for :
1- Rollup.
2- Functions to be applied (mean, sum .... etc)
3- Glob matching for both metric name and tag name.
4- Flag to not drop the original metric after aggregation.

Example :

rollup = [
  "(Name new) (Tag interface en*) (Functions mean 0.90)",
  "(Name cpu_value) (Measurements cpu) (Functions mean sum) (pass)",
]

For more information check the pull request #1364

@sparrc
Copy link
Contributor

sparrc commented Aug 29, 2016

closing this in favor of #1662

@sparrc sparrc closed this as completed Aug 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants