-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telegraf does not reconnect to influxdb #5905
Comments
@rdxmb Not sure what is going on here, maybe stale dns caching, here are a couple questions that could help us understand what is going on:
|
Telegraf is running natively under systemd
hmm... What would be the way to reproduce it? Completely without ReverseProxy? So how should be the connection instead? NodePort with http (without SSL)? .... ?
I will try next week. |
telegraf -> DNS -> public IP -> Firewall -> internal virtual IP with traefik/ingress -> kubernets DNS -> service -> pod only the pod and probably its IP are changing. So I cannot see what could be a DNS problem here. |
Maybe you could check with netstat or similar if Telegraf is reconnecting to the correct IP? |
yes, it is:
yes it does. |
Couple follow up questions come to mind:
|
On 6 of the 9 nodes there are also containers running which include telegraf - to different Domains, but same IP. To simplify the debugging I will pick out the 3 Nodes where there is only the telegraf from systemd running:
What do you mean? When the connection dissapeared or when it is running?
-> No. (So it uses quiet the established connection? - IMHO this is the point.) |
Oh, after a
there are two active connections:
and after another
|
the oldest connection is dying then:
|
... and then the other.
|
There will be an extra connection after SIGHUP, we have fixed this in #5912 but it's not released. We can see that it is reusing the connection during normal processing, but what about when InfluxDB goes offline? Can you show in the three states: telegraf+influxdb working, influxdb down, influxdb up but telegraf not able to send. Show 2-3 samples of each state if you can. |
Node1
same on two other nodes doing this at the same time. |
It is still trying to use the same connection after InfluxDB is brought down, for comparison here it is with all components on a local system, in this case I get a
|
Can this have something to do with the sizes of interval , flush and the time telegraf needs for reading and writing metrics at a specific time? Just an idea: When the first write to influx is not finished yet, will telegraf use the existing connection for its second write? |
Telegraf never writes twice to an output at the same time. If the output takes so long to write that the next Probably a long shot, but could you try out one of the nightly builds to see if anything has changed? |
Maybe at one of these nights. This is my production environment - I don't want to stop the influxdb during the day. Do you already have an idea what happens here? |
Understandable. No, I'm not sure, it feels like after a timeout Go should be closing the connection because it would be dirty. Could it be a Go bug? We need simpler reproduction steps if we want to try reporting upstream, otherwise someone will need to replicate your setup and dig deeper into the code. I tried writing a HTTP server that never responds, but that wasn't enough to replicate, so my next idea is to try to borrow a colleagues time so we can try this out on our internal Kubernetes cluster. |
@danielnelson I can offer you to deploy a separate influxdb instance in our kubernetes cluster. You could get write access via https -> traefik ingress -> influxdb . So you can install telegraf on your site and debug. |
I may take you up on this, but give me a bit more time to research on my end. |
Can you try setting this environment variable which should disable HTTP/2 support?
|
Where? In the systemd unit? Or |
Should work in either spot, but I recommend |
In my actual test-setup |
In another test all nodes with the var set reconnected. The one without that var set did not reconnect. Is this already the solution or just a workaround? I will try to get an |
Thanks for the testing, this is just a workaround though I think the biggest downside is needing to set the variable. |
looks good: without variable set
with variable set
|
@danielnelson to set this variable for telegraf in docker: Is it enough to set it as an env var via Docker? |
I believe so. |
works 👍 |
I still have the problem that one node does not reconnect and quits writing metrics. All other nodes and pods (in the same and in other networks) are reconnecting ... This seems crazy to me. |
Even with the environment variable set on ALL nodes? |
yes. Here on the host with the problem:
And actually
|
Did you do a full restart at least once? SIGHUP won't be enough for Telegraf to get new environment variables. |
multiple restarts during the last week... . At the moment, telegraf is connected. Seems like it takes a very long time to come back. I will watch it the next days - maybe the connection / metrics come back later as expected ... |
This issue might be related golang/go#21978 |
@rdxmb what version of traefik are you using and is there a way you can share your config? I'm still failing to reproduce. |
|
|
see above. What I actually see is that there no headless service for the statefulSet. Could that be a problem?
As I wrote above:
Definitely no. The VIP only switches when the traefik does not answer on port 80 .
If I can help you in any way please let me know. |
I can confirm the same thing happening with telegraf |
I'll note that it appears InfluxCloud 1.x DB instances are fronted by AWS ALBs speaking HTTP/2, so this might be an issue when AWS rotates ALB instances. I'm in the process of diagnosing some intermittent connectivity issues we've had for a while and will report here (and to influx support) with what I find. |
this seems to be stable for 5 months in my case 👍 , no matter if telegraf is running in a docker-container or natively on an ubuntu host |
After collecting some more data, what we are seeing looks like the following.
I'll start to test disabling http/2 and if that doesn't cause any issues I'll make disabling it our default. Then we can see if we continue to have 15 minutes of failed metric sending during step 3 above or not. |
In testing with telegraf 1.11.4, I can mostly reproduce this behavior at will by blocking the traffic to the ALB IP address in use. E.G. use At this point, the connection is still in the ESTABLISHED state but no traffic can egress, Eventually it switches to CLOSE-WAIT and Upon the next write, telegraf starts spewing "Client.Timeout exceeded while awaiting headers" and other log messages mentioned earlier. I waited 30 minutes and this never recovered. This means I'm not perfectly reproducing the 15 minute hang then recovery that I see in the wild, but I am reproducing the behavior others have described in this issue. The behavior improves dramatically if i restart telegraf with |
This upstream issue more closely describes the issue than the issue I linked earlier: golang/go#36026. The workaround in #7517 should solve the issue for the |
Relevant telegraf.conf:
System info:
Steps to reproduce:
Expected behavior:
telegraf does reconnect
Actual behavior:
telegraf is not able to reconnect
Additional info:
influxdb ist running as a statefulSet in kubernetes
stopping influxdb is realized by set the replica to 0
starting influxdb is realized by set the replica to 1 - which starts a new POD
telegraf.log:
influxdb is reachable and curl can write:
a telegraf restart (although taking a long time) does fix the problem
After that, telegraf does write to influxdb again.
The text was updated successfully, but these errors were encountered: