-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure buffer is written to Influx even when there's no network connection #4963
Comments
I'm not positive, but I believe the 1.9RC addresses this if you don't mind trying it out: 1.9.0-rc1 |
Metrics collected when Telegraf is offline are added to the metric buffer and sent when a connection is re-established, this is true in the current release, <=1.8.3, as well as the latest release candidate. I assume this is not what you are seeing, can you please add reproduction steps and we can look into what might cause your problem? |
Nope, this isn't what I'm seeing. I'm not really sure what repro steps to provide. No matter what configuration I use for Telegraf, when the device is disconnected from the InfluxDB server, no data is collected. I've tried increasing the buffer size, modifying the default buffer flush, etc. It appears that as soon as Telegraf attempts to send data to InfluxDB, the data is removed from the buffer. If the request fails, it would appear the data is not returned to the buffer to be sent later. Only repro steps for this I can recommend are running Telegraf, e.g. to collect CPU data, disconnecting the device, and reconnecting a minute or two later, then checking InfluxDB to see the gap in data during the downtime. Since Telegraf can consume quite a bit of data and is intended for use on lots of devices, including single-board-computers, this feature would possibly require writing the buffer to disk until connection is reestablished due to the small amount of memory available on many devices. I've never seen Telegraf consume more memory or perform disk I/O when unable to reach InfluxDB. If this is the intended behavior, I'm not sure it works on a clean install with default config files. |
Can you show your configuration file and let me know what version of Telegraf you are using, and then also run your repro steps with Telegraf does keep all metrics in memory, it never saves them to disk, so it does use additional memory when the output can not send but only up the |
What are the units of |
Yes, each record is called a |
Got it. It's possible the buffer simply overflows too quickly for me to notice. I've set it pretty high (around 1 million) but still apparently missed all the data from the offline time window. I'll experiment with the buffer limit and confirm, while also running with the debug flag to generate the log. Will Telegraf gracefully flush the buffer if it runs out of memory? |
No, either the oom_killer terminates the process, which cannot be handled by a process at all, or the process panics and exits. We don't try to handle the panic as it is usually is impossible to write data without the ability to allocate memory. |
I would really aprreciate a feature that allows for longer Caching (on-Disk). My current Workaround is using the "exec" output plugin with a custom python script that tries to transmit the data and caches it locally for the event of a connection loss. |
@natejgardner is the buffer working then? |
Relating #802 |
Feature Request
Opening a feature request kicks off a discussion.
Proposal:
For IOT and mobile devices, connectivity is not guaranteed to be consistent. It would be great if Telegraf could robustly handle sending events gathered while offline as soon as the connection is restored.
Current behavior:
Telegraf only writes to InfluxDB if InfluxDB is reachable. Events gathered when InfluxDB is unreachable are discarded and not written to InfluxDB when the connection is restored.
Desired behavior:
Telegraf holds all messages in the buffer while InfluxDB is not reachable and only removes them from the buffer when InfluxDB has responded that the writes were successful (as long as the buffer hasn't filled).
Use case: [Why is this important (helps with prioritizing requests)]
IOT devices, mobile devices, and basically everything that uses wifi or mobile networks deal with inconsistent connectivity. It'd be nice if that didn't imply losing all the data from those moments when the client is disconnected-- especially when behavior while disconnected is what one is trying to analyze!
The text was updated successfully, but these errors were encountered: