-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metricbeat: Use system-wide ticks to calculate docker cpu usage #6608
Conversation
return u.Stats.CPUStats.SystemUsage - u.Stats.PreCPUStats.SystemUsage | ||
} | ||
|
||
func (u *cpuUsage) perCpu() common.MapStr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func perCpu should be perCPU
31027c1
to
9b3a805
Compare
The reasoning looks good to me, have a look at tests, something is wrong there:
|
9b3a805
to
8a24b98
Compare
@@ -6,31 +6,37 @@ | |||
fields: | |||
- name: kernel.pct | |||
type: scaled_float | |||
scaling_factor: 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember a scaling factor > 1000 will not have any compression advantage anymore over just using float instead of scaled_float. So I suggest we just change to float here. Also below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, if I understood well, scaled_float
is stored as an integer, wouldn't it be a breaking change to change it to float? (OTOH changing the scaling_factor
could also be a breaking change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As each version has a new template, things should not break here. From a query perspective I think it was already like a float before so having a float afterwards should just work.
If someone ingest with the old template he would just have lower accuracy but not get an error.
@jpountz Is the above correct? The ingestion and quering of float
and scaled_float
looks identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming all values are between 0 and 1, this field will likely require only 20 bits per value, which is still better than a float. However I'm very surprised you need so much precision for a percentage, why are 100 or 1000 not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is not precision, but that this is a percentage that can be over 100%, it is for CPU usage as can be seen in top
or docker stats
, and this can be up to 100 times the number of cpus.
Maybe the best is to use float
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz wdyt? would it be ok to change to float
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact it may go over 1 is not an issue, I still think scaled_float is a better fit. However let's use a more reasonable scaling factor like 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh now I wonder that you might have interpreted the scaling factor as an upper bound of the values, while it's not. It just means that Elasticsearch will store x * scaling_factor
as a long
internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ok, understood now :) Then 100/1000 will be probably enough, yes. Thanks @jpountz !!
I'll leave it empty as it was before, default seems to be 1000.
This also cleans up the implementation a bit. A CHANGELOG should be added for this change. Perhaps you can add a few more details in the PR message about the previous value compare to the current value which is more focused on users coming to this PR to understand the change? |
8a24b98
to
a256cad
Compare
Added a line to changelog and more information to the commit message. |
a256cad
to
807a57c
Compare
@jsoriano Needs another rebase. |
807a57c
to
3cae5b6
Compare
Rebased, but let's wait to make a decission on the type of percentage fields before merging. |
This implementation is more coherent with `docker stats` implementation. It is based in the ratio between the number of CPU ticks of the container and the number of ticks of the whole system during a fixed period of time. Previous implementation considered that the whole system had a cpu time of a second during each measurement, what is not always 100% accurate. New implementation also takes into account the number of CPUs, as other tools like `top` or `docker stats` do, so if a container is using 25% of CPU time of a 4 CPUs machine, the reported percentage will be 100%, with a maximum of 400% total usage if it is using all the CPU time. These changes can modify the resulting values of docker CPU metrics.
3cae5b6
to
82e4585
Compare
Changes to |
@jsoriano Could you update the PR description with details on what the logic is before and what it is now? So users coming here to figure out how their values changed get it without reading the code. |
PR description updated. |
This PR continues with #6608 in the aim to provide docker metrics more consistent with `docker stats`. Currently we report for disk I/O the number of operations per second. This PR adds accumulated I/O stats per number of operations and per data volume since the container started, as `docker stats` does. I think this information is more valuable, as it permits to have further aggregations when visualizing it. It also addresses some other issues in the metricset. Fields in events have been reorganized and old ones have been marked as deprecated. We could also add some extra functionality, as obtaining metrics also on Windows hosts (as is this module now it would only work with docker daemons running on Linux and not sure about Mac). In Linux we could also collect metrics per block device. - [x] Add accumulated I/O stats to diskio metricset - [x] Fix diskio metricset when docker reads from multiple devices - [x] Some naming and documentation fixes - [x] Fix memory leak in blkio service (last seen stats were being stored, but never removed) - [x] More tests To follow-up in other issues: - ~~Group by device in Linux?~~ To be done if requested in the future - Support for Windows storage stats? Follow-up issue #6815
This implementation looks for a better coherence with
docker stats
.It is based on the number of cpus and the ratio between the number of CPU ticks of the container and the number of ticks of the whole system during a fixed period of time, as
docker stats
,top
or similar tools do. This will make CPU percentage metrics to be greater than 100% at some cases (e.g. it will be 200% if a container is running during 50% of ticks in a 4 CPUs system).Before it was based only in the ratio between container and system ticks.
Some documentation texts have been rewritten
, and scaling factor of percentages has been increased, as cpu percentage can be over 100, up to 100 times the number of cores (as intop
,docker stats
or similar tools).I think we might deprecate, or move to other part the "
system
" CPU metric. It is used to calculate percentages of metrics for containers, but I don't think it is very useful beyond that. Also, with current implementation its percentage should always constant, as it is calculated as a ratio with itself.