-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics-cpu-mpstat.rb inflating CPU when run from sensu #11
Comments
This will also be an issue with the CPU check - it is sampling (by default) for a second, but at least you are just concerned about whether the CPU is over a high threshold. In addition, you can mitigate the problem because it is possible to change the sleep length so you are increasing the time you are measuring the deltas over, thereby reducing the importance of transient spikes in that measuring period. If using the metric-cpu.rb and getting accurate stats into graphite, it might be better to run a cpu check by querying the graphite stats, rather than running a check on the client itself and trying to sample the CPU utilisation. |
@dalesit could you put together a pr with some of the recommendations (such as increasing the default sleep) outlined on getting more accurate results? |
I don't think there is a real solution for this module - for metrics the sampling approach is flawed, as the sensu checks and metrics arrive at the same time, so the CPU will be artificially high at this point. It is worse for machines (or VMs) with a restricted number of cores. The metrics-cpu.rb check gives accurate figures, but doesn't give the output as CPU usage, but raw ticks. It needs converting to CPU utilisation in the graphing solution. For the CPU check, at least there is the option to change the sampling window. For a multi-CPU system it is less of a concern, as the sensu checks are less likely to be hogging the CPU for that second. However, for smaller VMs, with one or two cores, there is a higher likelihood of false positives from a sampling approach, and extending the sampling window will mitigate this. It then becomes a tradeoff between the length of time to complete the check and the representativeness of the sample. |
@dalesit I see, when I have some time I will read through the code in depth and validate but I am of the opinion that if a check is inherently flawed and no reasonable solution can be found we should remove in a major release. Have you played around at all with the sampling in the CPU check and determined a window that seems reasonable? This will obviously depend on the hardware on the machine but I was just curious if you had any findings to share. If not when I have some time I will try to see what seems like a reasonable window with a couple vms. |
I have a problem on one of my systems which is showing (according to metrics-cpu-mpstat) nearly 100% cpu utilisation, but according to sar, 75% idle time. The act of measuring the CPU over a second seems to be incorrectly putting up the CPU stats, although not on other identical servers which are 88% idle (according to sar). Running top -d 0.5 shows that the check-rabbitmq checks run at high CPU when they come through. If they coincide with the cpu metrics run, the stats will be skewed. However, running metrics-cpu.rb just takes the counter, and compares it against the value the next time it is run, so gives a true picture of the amount of CPU utilisation over that interval.
The text was updated successfully, but these errors were encountered: