-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telegraf dropped/purged/truncated its output buffer on SIGHUP #2679
Comments
Any motion on this? |
This is somewhat tricky. Each output currently has its own metric buffer and may not be able to flush at this time. The metric buffers for the outputs may have more differences among them than just a position, due to filtering options transforming the points. It may be easiest to stop the world and allow the outputs a time period to flush everything, and then perform the reload. If the output could not complete in this time the buffered points would be lost for it. Another way we could potentially handle this is by moving to a shared output buffer and performing filtering on flush. This would use less memory when there are multiple outputs, but filtering would need to be done each time in case of failure, or only failures could be buffered per output. I think I'll do the stop the world reload first, and perhaps do the shared output buffer at some later date. |
@danielnelson are there any updates to this? I keep running into this and getting weird dip/spikes on my graphs in grafana on derivative functions due to the missing datapoints. If we don't have a good way to update the telegraf instance with new endpoints to poll without losing data, that really nerfs when we can realistically begin polling new devices. |
@danielnelson I'll back up a second and get this out there so people realize the other implications of restarting/refreshing the telegraf process. So part of the issue is dropping data (which if you flush more frequently than poll, you can get around it with careful timing), but another issue that may inevitably bite you is interval skew. If you don't restart at the relative point in the interval that you did previously, telegraf will begin polling at a different point in your interval at the same cadence, leading to a frame-shift of points that will cause a spike/dip on your graph. |
Round interval is set to true, with default interval in our agent section set to 60s. All our probe intervals are set to 300s though. Does round_interval only interact with the global interval in the agent section? #2839 sounds exactly like what has been happening. I wrote a shell script now to calculate and time process refresh/restart using 'at' to avoid further incident. |
I haven't investigated the issue closely yet, but it is supposed to work in either case. |
similar problem here: https://community.influxdata.com/t/telegraf-should-reconnect-after-influxdb-timeouts/3550 . I guess this is the same issue. |
@rdxmb That doesn't look like a similar problem to the one reported on this issue. |
@danielnelson Could this be an incentive to add support to persist buffers on disk? :) |
I think I confused this issue with another. I am sorry. |
@jasonkeller is this still an issue with the latest version of Telegraf? If so, is there any simple way to reproduce the issue? |
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you! |
System info:
telegraf 1.2, RHEL7
Steps to reproduce:
Expected behavior:
Soft configuration reload
Actual behavior:
Truncated output buffer and soft configuration reload, causing huge derivative dips in Grafana.
Additional info:
https://community.influxdata.com/t/refreshing-telegraf-shows-dips-on-graphs-in-grafana/525/4
Opening this bug per daniel
First mentioned at the bottom of this bug by another user...
#69
And I think JZ even asked this on the forums but got no response...
https://community.influxdata.com/t/reload-config-telegraf-config-file-without-restarting-process/62
The text was updated successfully, but these errors were encountered: