-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'systemctl restart telegraf' failing (related to logparser ?) #4610
Comments
Does it behave the same if you use reload?
|
That is what I tried afterwards after I discovered these problems. I changed to reload, but this was even worse. telegraf totally stopped sending metrics. And a weird cut off log entry in the log again.
Then the logfile ends and nothing is written to it anymore until I changed back to restart. As you can see, telegraf was still running, and doing reloads, but stopped working at all.
|
By the way I was using restart because of all the other bugs with logparser not realizing that a file is overwritten or a second logparser not working, when using reload etc. So now also my latest workaroud is broken. |
Now I tried:
|
On Ubuntu 14.04 the restart problem shows up diffferently. Here telegraf processes are not stopped properly (because of lack of systemd kill mechanisms ?) and are piling up. The problem is, that they still produce load and after a couple of weeks the load rises dramatically. I need to use SIGKILL to get rid of these old processes. ps output:
On every telegraf restart this error mesage shows up:
Still the files are created on startup:
|
I migrated all my logparsers to the tail plugin with grok data_format and that worked very well. Also I am on telegraf 1.8 on all hosts now. Then I switched to |
Glad to hear the tail plugin is working better than logparser, though I am somewhat surprised that it helped with the rotation issue. I am working on a redesign of the shutdown procedure for 1.9 which I think will fix this issue. |
So far I had telegraf 1.8 and tail plugin running on 3 LTS versions of Ubuntu on 7 servers. On some machines I didn't have reload nor restart enabled on some either periodic restart or reload. This mixed test setup ran for over a week now and on all machines the input files were overwritten with new preprocessed logfile entries every 10 minutes. If this is comparable to a normal logrotation, I could not see the rotation issue anymore during this time. |
I think this is the same as reported in #4457 |
I have now 1.10.0~a50fb5a2-0 (and had 1.9.0 with memory leak) but still see this behavior on Ubuntu 14.04. Before upgrade to 1.9.0. I did a periodic reload of the agent, what worked mostly (sometimes the agent just stopped sending metrics). Couldn't test 1.9+ on Ubuntu 18.04/16.04 yet. Also only with periodic reload (and without) of the agent the tail plugin is loosing a lot of metrics, what worked better in 1.8. , thats why I switched to restart again.
In the metrics you can see:
Now I downgraded to 1.8.3 and periodic reloads for this host to get some kind of stability for production. The most important metrics comes from reading the webserver logs, what did'nt work well with 1.9+ |
Relevant telegraf.conf:
Agent runs some basic system metric plus one logparser that parses a preprocessed (sorted) nginx acces.log. The preprocessing script runs every 10 minutes and extracts 10 minutes (in the past, to make sure nginx buffer is flushed) of logdata to a sorted file for ingestion into telegraf logparser (every run about 10.000 lines)
Logparser config (one file in /etc/telegraf/telegraf.d):
System info:
telegraf 1.7.3 on Ubuntu 18.04 server
Steps to reproduce:
Expected behavior:
Restart succeeds
Actual behavior:
sometimes (not always) the restart doesn't finish and the process gets killed by systemd
Additional info:
Journal:
In the Log below one can see that telegraf restart is triggered 18:20:02 (from the script in CRON above) and then logparser tries to parse a line even before it flushes the cashed metrics. Also the Log-entry doesn't exist in that way and is somehow cut off. This is always seen when this problem occurs, so seems to be related to logparser.
After flushing the old metrics it takes the Systemd TimeoutSec before it gets killed and restarted (successfully).
telegraf.log:
The text was updated successfully, but these errors were encountered: