-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Output resilience broken #6694
Comments
The pausing behavior here is intended. The inputs Of course the panic crash is a bug, I believe I have fixed this in #6806. |
@danielnelson and what if I am OK with not delivering metrics to faulty output? I have multiple outputs, some off which I do not control. sometimes they go down and I do not care if they do not receive data, as long as my own output destinations gets data successfully. |
I have the same need as @aurimasplu @danielnelson could you pls suggest a workaround to ignore max_undelivered_messages ? I am using Telegraf 1.14.4 |
Same here, but I think there’s another ticket for that already. |
@Hipska can you put a reference to that ticket please? |
I am using a fork of telegraf with postgres output (https://github.com/phemmer/telegraf/tree/postgres) on a Raspberry Pi to push to a local timescaledb and a remote one. It has worked fine for a long time, with low latency. I observed today that if the remote db goes down or access to it is somehow lost, then the data does not go to any of the other local accessible outputs. When I removed the problematic output from the conf, things started working fine. @phemmer any ideas about this? This issue seems similar to the one reported here: #6694 . However, it seems your source already has the code fix for this issue. |
There is certainly some kind of issue here. I think that most people would expect Telegraf to be robust to output issues where there is more than 1 output and even to recover gracefully even if he only output goes away and comes back (which I think you've already fixed?). In my case, it was an issue with the MQTT output (#10180 ) - instead of only stopping that output, it crashed Telegraf completely. My argument is that Telegraf is often used for system monitoring - if your monitoring app crashes because of an external issue, that's a major problem. This is, after all, why we may configure more than 1 output channel. My view is that no output channel of Telegraf should ever cause it to crash. Complain loudly, yes, crash, no. Without this, Telegraf cannot, unfortunately, be used as the main/only monitoring system. And if I have to put in a second monitoring system to monitor the first, then it won't happen. Which would be a shame because I like Telegraf and would like to recommend it. |
I do fully agree with you on telegraf should not crash when a recoverable problem to one of its outputs happens. On the other hand, telegraf is NOT a monitoring solution. It is a data collection agent (mostly time series data) which can obviously be used to also collect monitoring data. It does not handle or keep states, does not send alerts, does no remediation on the collected data, ... You would always need to have a 'real' monitoring tool to keep an eye if telegraf is still running or if the metrics are being collected/stored or even take actions (alerts/remediation) on the values of the collected data. |
Hi! Some years later I have stumbled upon the same problem. I collect a lot of information to several outputs. One of these output is an influxdb reachable through a network connection (satellite) that is not always available. Having this data in influx in realtime when the link is up is a plus, but it is not critical, since data is collected and transferred to another online place. I understand the rationale behind keeping all outputs synced, but I think that sync could be optional and configurable. |
Relevant telegraf.conf:
System info:
telegraf 1.12.1-1.12.4 (earlier probably affected as well, just tested with .1 and .4
Steps to reproduce:
Expected behavior:
I would expect all other outputs to continue functioning unaffected, and the output that went away to queue up metrics until the bufferlimit is reached.
Actual behavior:
If one output goes away ungracefully (In my case vpn tunnel down, a firewall drop does the same) and packets run into timeout, telegraf starts dropping metrics for all outputs. If the lost output comes back telegraf panics and gets restarted by systemd. All buffered metrics are lost for all outputs. Happens with influx and influx_v2 output.
Additional info:
Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Wrote batch of 51 metrics in 61.231974ms
Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Buffer fullness: 1 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 787 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 946 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=pmi: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 1000 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Wrote batch of 23 metrics in 96.375817ms
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 17 / 100000 metrics
This keeps happening and data is not written to any of the output, I see a gap in the Graphs.
This happens when the output comes back:
Nov 18 11:03:40 telegraf[11714]: panic: channel is full
Nov 18 11:03:40 telegraf[11714]: goroutine 10357 [running]:
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*trackingAccumulator).onDelivery(0xc000292780, 0x2c11e80, 0xc002bff860)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/accumulator.go:167 +0x7a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingData).notify(…)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:73
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:163 +0x9e
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).Accept(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/metric/tracking.go:144 +0x3a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).metricWritten(0xc0001c2fa0, 0x2c72240, 0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:93 +0x72
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).Accept(0xc0001c2fa0, 0xc002092000, 0x30, 0x30)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:179 +0xa6
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc0001aa280, 0x0, 0xc000560660)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/internal/models/running_output.go:190 +0xf7
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1(0xc001755b00, 0xc0016d7bc0)
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:597 +0x27
Nov 18 11:03:40 telegraf[11714]: created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce
Nov 18 11:03:40 telegraf[11714]: #11/go/src/github.com/influxdata/telegraf/agent/agent.go:596 +0xc8
Nov 18 11:03:40 systemd[1]: telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 18 11:03:40 systemd[1]: telegraf.service: Unit entered failed state.
Nov 18 11:03:40 systemd[1]: telegraf.service: Failed with result ‘exit-code’.
Nov 18 11:03:40 systemd[1]: telegraf.service: Service hold-off time over, scheduling restart.
The text was updated successfully, but these errors were encountered: