-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CPU system plugin that get stuck after suspend #3342
Fix CPU system plugin that get stuck after suspend #3342
Conversation
Signed-off-by: Pierre Fersing <pierre.fersing@bleemeo.com>
4b1953d
to
57c75c5
Compare
(cherry picked from commit f5a9d1b)
Thank you, I'll include this in 1.4.3 |
I upgraded from 1.4.2 to 1.4.3 to get this fix. Seems to work great for me. For reference; this fixes issue #721. |
I don't know If I'm able to re-open this issue but I am still seeing this problem. I have tried both telegraf 1.4.4 and 1.5.0-1 . I see this error in the logs: E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time on telegraf version This has started to happen since my Ubuntu host in AWS decided to briefly suspend. I still don't know why it suspended in the first place. It's the telegraf stats that alerted me to the problem. The host is pending a reboot, which I will do tonight, but I've been uable to resolve the issue so far by re-installing different versions of telegraf. |
@richm-ww Do you see this message once or does it repeat each interval? |
It’s every interval:
Dec 28 18:01:00 telegraf[30685]: 2017-12-28T18:01:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:02:00 telegraf[30685]: 2017-12-28T18:02:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:03:00 telegraf[30685]: 2017-12-28T18:03:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:04:00 telegraf[30685]: 2017-12-28T18:04:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:05:00 telegraf[30685]: 2017-12-28T18:05:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:06:00 telegraf[30685]: 2017-12-28T18:06:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:07:00 telegraf[30685]: 2017-12-28T18:07:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:08:00 telegraf[30685]: 2017-12-28T18:08:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:09:00 telegraf[30685]: 2017-12-28T18:09:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Dec 28 18:10:00 telegraf[30685]: 2017-12-28T18:10:00Z E! Error in plugin [inputs.cpu]: Error: current total CPU time is less than previous total CPU time
Rich
|
@danielnelson I have the same problem as @richm-ww. It started for couple of ours amazon linux AWS instances in about the same time. What is strange the instances was working when the problem arise. Telegraf was updated to output from
after couple of seconds
after next couple of seconds
|
If I can properly count it looks like the decreasing column is cpu steal, and in fact it is going all over the place. You aren't suspending between these samples are you @krise3k? Curious, what does top report as your cpu steal? Last item on line like this:
|
@danielnelson No I don't doing anything with this host. Three days ago we had the same problem with another instance. |
@danielnelson I wonder if adding option to ignore |
I guess I would terminate and recreate any instance with -150635151360.0% cpu steal... this is just to weird for my taste. I think what we should do is remove the error message and just report what /proc/stat shows. If that means the percents are negative like top is showing then at least you can see and alert on it. |
First I thought that it's some kind of bug with reporting |
I've been thinking about my idea to remove the error message and report percentages, and I don't think we can do this. There are a couple known cases where the value will decrease, and I don't want to cause issues in these cases where the plugin can recover. What does everyone think about waiting to see if this clears up on AWS and if it continues perhaps we could place guards around the logging message so that it will only log once. Another option would be to add an option to only report raw time, the opposite of |
This fix an issue when a Linux system is suspended, after resuming, Telegraf no longer send CPU metrics and log:
This occur after a suspend, because /proc/stat counter decrease ! [1]
On previous Telegraf version this issue did not occured because it updated lastStats and complained only ONCE. PR #3306 changed this behavior.
This PR restore the old behavior: If total CPU time decrease, still update lastStat (so next metrics gather should work) and complain.
[1]: Result of cat /proc/stat before and after a system suspend:
4th field (idle) and 5th field (iowait) of cpu1, cpu2, cpu3 are reseted to 0 after suspend. Which cause them for decrease.