telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

jasonkeller · 2017-04-17T12:58:48Z

System info:

telegraf 1.2, RHEL7

Steps to reproduce:

Have inputs (like SNMP) running
pkill -1 telegraf or pkill -SIGHUP telegraf

Expected behavior:

Soft configuration reload

Actual behavior:

Truncated output buffer and soft configuration reload, causing huge derivative dips in Grafana.

Additional info:

https://community.influxdata.com/t/refreshing-telegraf-shows-dips-on-graphs-in-grafana/525/4
Opening this bug per daniel

First mentioned at the bottom of this bug by another user...
#69

And I think JZ even asked this on the forums but got no response...
https://community.influxdata.com/t/reload-config-telegraf-config-file-without-restarting-process/62

jasonkeller · 2017-05-18T18:40:41Z

Any motion on this?

danielnelson · 2017-07-06T01:49:23Z

This is somewhat tricky.

Each output currently has its own metric buffer and may not be able to flush at this time. The metric buffers for the outputs may have more differences among them than just a position, due to filtering options transforming the points.

It may be easiest to stop the world and allow the outputs a time period to flush everything, and then perform the reload. If the output could not complete in this time the buffered points would be lost for it.

Another way we could potentially handle this is by moving to a shared output buffer and performing filtering on flush. This would use less memory when there are multiple outputs, but filtering would need to be done each time in case of failure, or only failures could be buffered per output.

I think I'll do the stop the world reload first, and perhaps do the shared output buffer at some later date.

jasonkeller · 2017-10-20T14:41:16Z

@danielnelson are there any updates to this? I keep running into this and getting weird dip/spikes on my graphs in grafana on derivative functions due to the missing datapoints.

If we don't have a good way to update the telegraf instance with new endpoints to poll without losing data, that really nerfs when we can realistically begin polling new devices.

jasonkeller · 2017-10-20T16:14:16Z

@danielnelson I'll back up a second and get this out there so people realize the other implications of restarting/refreshing the telegraf process.

So part of the issue is dropping data (which if you flush more frequently than poll, you can get around it with careful timing), but another issue that may inevitably bite you is interval skew. If you don't restart at the relative point in the interval that you did previously, telegraf will begin polling at a different point in your interval at the same cadence, leading to a frame-shift of points that will cause a spike/dip on your graph.

danielnelson · 2017-10-21T01:02:40Z

I'm hoping to fix this as part of the configuration overhaul to support kv config stores. #272

I've heard about reload causing issues with plugin specific collection interval #2839, is it happening also with the global interval? Also what round interval is set to.

jasonkeller · 2017-10-23T14:40:40Z

Round interval is set to true, with default interval in our agent section set to 60s. All our probe intervals are set to 300s though. Does round_interval only interact with the global interval in the agent section?

#2839 sounds exactly like what has been happening. I wrote a shell script now to calculate and time process refresh/restart using 'at' to avoid further incident.

danielnelson · 2017-10-23T20:56:54Z

I haven't investigated the issue closely yet, but it is supposed to work in either case.

rdxmb · 2018-01-02T16:10:23Z

similar problem here: https://community.influxdata.com/t/telegraf-should-reconnect-after-influxdb-timeouts/3550 . I guess this is the same issue.

danielnelson · 2018-01-02T18:40:34Z

@rdxmb That doesn't look like a similar problem to the one reported on this issue.

voiprodrigo · 2018-12-08T03:18:47Z

@danielnelson Could this be an incentive to add support to persist buffers on disk? :)

rdxmb · 2018-12-10T10:44:30Z

I think I confused this issue with another. I am sorry.

srebhan · 2023-07-26T18:46:24Z

@jasonkeller is this still an issue with the latest version of Telegraf? If so, is there any simple way to reproduce the issue?

telegraf-tiger · 2023-08-10T18:09:52Z

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

danielnelson added the bug unexpected problem or unintended behavior label Apr 19, 2017

danielnelson mentioned this issue May 4, 2018

[[outputs.influxdb]] Auto update http client certificates. Feature request. #4086

Closed

danielnelson added the area/configuration label Jul 3, 2018

danielnelson self-assigned this Nov 12, 2018

russorat mentioned this issue Aug 13, 2020

Output buffer persistence #802

Open

monilshah98 mentioned this issue Sep 22, 2020

Data synchronization between two instances of InfluxDB. influxdata/influxdb#19607

Open

danielnelson removed their assignment Sep 1, 2021

srebhan added the waiting for response waiting for response from contributor label Jul 26, 2023

telegraf-tiger bot closed this as completed Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

jasonkeller commented Apr 17, 2017 •

edited

Loading

jasonkeller commented May 18, 2017

danielnelson commented Jul 6, 2017 •

edited

Loading

jasonkeller commented Oct 20, 2017

jasonkeller commented Oct 20, 2017

danielnelson commented Oct 21, 2017

jasonkeller commented Oct 23, 2017 •

edited

Loading

danielnelson commented Oct 23, 2017

rdxmb commented Jan 2, 2018

danielnelson commented Jan 2, 2018

voiprodrigo commented Dec 8, 2018 •

edited

Loading

rdxmb commented Dec 10, 2018

srebhan commented Jul 26, 2023

telegraf-tiger bot commented Aug 10, 2023

telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

Comments

jasonkeller commented Apr 17, 2017 • edited Loading

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

jasonkeller commented May 18, 2017

danielnelson commented Jul 6, 2017 • edited Loading

jasonkeller commented Oct 20, 2017

jasonkeller commented Oct 20, 2017

danielnelson commented Oct 21, 2017

jasonkeller commented Oct 23, 2017 • edited Loading

danielnelson commented Oct 23, 2017

rdxmb commented Jan 2, 2018

danielnelson commented Jan 2, 2018

voiprodrigo commented Dec 8, 2018 • edited Loading

rdxmb commented Dec 10, 2018

srebhan commented Jul 26, 2023

telegraf-tiger bot commented Aug 10, 2023

jasonkeller commented Apr 17, 2017 •

edited

Loading

danielnelson commented Jul 6, 2017 •

edited

Loading

jasonkeller commented Oct 23, 2017 •

edited

Loading

voiprodrigo commented Dec 8, 2018 •

edited

Loading