Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat: Use system-wide ticks to calculate docker cpu usage #6608

Merged
merged 1 commit into from
Mar 28, 2018

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Mar 20, 2018

This implementation looks for a better coherence with docker stats.

It is based on the number of cpus and the ratio between the number of CPU ticks of the container and the number of ticks of the whole system during a fixed period of time, as docker stats, top or similar tools do. This will make CPU percentage metrics to be greater than 100% at some cases (e.g. it will be 200% if a container is running during 50% of ticks in a 4 CPUs system).
Before it was based only in the ratio between container and system ticks.

Some documentation texts have been rewritten , and scaling factor of percentages has been increased, as cpu percentage can be over 100, up to 100 times the number of cores (as in top, docker stats or similar tools).

I think we might deprecate, or move to other part the "system" CPU metric. It is used to calculate percentages of metrics for containers, but I don't think it is very useful beyond that. Also, with current implementation its percentage should always constant, as it is calculated as a ratio with itself.

return u.Stats.CPUStats.SystemUsage - u.Stats.PreCPUStats.SystemUsage
}

func (u *cpuUsage) perCpu() common.MapStr {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func perCpu should be perCPU

@jsoriano jsoriano force-pushed the docker-stats-equivalence branch from 31027c1 to 9b3a805 Compare March 20, 2018 18:05
@jsoriano jsoriano added the in progress Pull request is currently in progress. label Mar 20, 2018
@jsoriano jsoriano assigned ruflin and exekias and unassigned ruflin and exekias Mar 20, 2018
@jsoriano jsoriano requested review from ruflin and exekias March 20, 2018 18:44
@exekias
Copy link
Contributor

exekias commented Mar 21, 2018

The reasoning looks good to me, have a look at tests, something is wrong there:

# github.com/elastic/beats/metricbeat/module/docker/cpu
module/docker/cpu/cpu_test.go:46:10: undefined: perCpuUsage
module/docker/cpu/cpu_test.go:73:10: undefined: totalUsage
module/docker/cpu/cpu_test.go:96:10: undefined: usageInKernelmode
module/docker/cpu/cpu_test.go:119:10: undefined: usageInUsermode

@jsoriano jsoriano force-pushed the docker-stats-equivalence branch from 9b3a805 to 8a24b98 Compare March 21, 2018 11:59
@@ -6,31 +6,37 @@
fields:
- name: kernel.pct
type: scaled_float
scaling_factor: 1000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember a scaling factor > 1000 will not have any compression advantage anymore over just using float instead of scaled_float. So I suggest we just change to float here. Also below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, if I understood well, scaled_float is stored as an integer, wouldn't it be a breaking change to change it to float? (OTOH changing the scaling_factor could also be a breaking change)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As each version has a new template, things should not break here. From a query perspective I think it was already like a float before so having a float afterwards should just work.

If someone ingest with the old template he would just have lower accuracy but not get an error.

@jpountz Is the above correct? The ingestion and quering of float and scaled_float looks identical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming all values are between 0 and 1, this field will likely require only 20 bits per value, which is still better than a float. However I'm very surprised you need so much precision for a percentage, why are 100 or 1000 not enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is not precision, but that this is a percentage that can be over 100%, it is for CPU usage as can be seen in top or docker stats, and this can be up to 100 times the number of cpus.
Maybe the best is to use float directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz wdyt? would it be ok to change to float for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact it may go over 1 is not an issue, I still think scaled_float is a better fit. However let's use a more reasonable scaling factor like 100?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh now I wonder that you might have interpreted the scaling factor as an upper bound of the values, while it's not. It just means that Elasticsearch will store x * scaling_factor as a long internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, understood now :) Then 100/1000 will be probably enough, yes. Thanks @jpountz !!

I'll leave it empty as it was before, default seems to be 1000.

@ruflin
Copy link
Contributor

ruflin commented Mar 21, 2018

This also cleans up the implementation a bit. A CHANGELOG should be added for this change.

Perhaps you can add a few more details in the PR message about the previous value compare to the current value which is more focused on users coming to this PR to understand the change?

@jsoriano jsoriano force-pushed the docker-stats-equivalence branch from 8a24b98 to a256cad Compare March 22, 2018 08:59
@jsoriano
Copy link
Member Author

jsoriano commented Mar 22, 2018

Added a line to changelog and more information to the commit message.

@jsoriano jsoriano force-pushed the docker-stats-equivalence branch from a256cad to 807a57c Compare March 22, 2018 09:04
@ruflin
Copy link
Contributor

ruflin commented Mar 27, 2018

@jsoriano Needs another rebase.

@jsoriano jsoriano force-pushed the docker-stats-equivalence branch from 807a57c to 3cae5b6 Compare March 27, 2018 08:23
@jsoriano
Copy link
Member Author

Rebased, but let's wait to make a decission on the type of percentage fields before merging.

This implementation is more coherent with `docker stats`
implementation. It is based in the ratio between the number of CPU
ticks of the container and the number of ticks of the whole system
during a fixed period of time. Previous implementation considered
that the whole system had a cpu time of a second during each
measurement, what is not always 100% accurate.

New implementation also takes into account the number of CPUs, as other
tools like `top` or `docker stats` do, so if a container is using 25%
of CPU time of a 4 CPUs machine, the reported percentage will be 100%,
with a maximum of 400% total usage if it is using all the CPU time.

These changes can modify the resulting values of docker CPU metrics.
@jsoriano jsoriano force-pushed the docker-stats-equivalence branch from 3cae5b6 to 82e4585 Compare March 27, 2018 17:45
@jsoriano
Copy link
Member Author

Changes to scaling_factor in percentages removed.

@ruflin ruflin merged commit fa39f47 into elastic:master Mar 28, 2018
@ruflin
Copy link
Contributor

ruflin commented Mar 28, 2018

@jsoriano Could you update the PR description with details on what the logic is before and what it is now? So users coming here to figure out how their values changed get it without reading the code.

@jsoriano
Copy link
Member Author

PR description updated.

ruflin pushed a commit that referenced this pull request Apr 13, 2018
This PR continues with #6608 in the aim to provide docker metrics more consistent with `docker stats`. Currently we report for disk I/O the number of operations per second.

This PR adds accumulated I/O stats per number of operations and per data volume since the container started, as `docker stats` does. I think this information is more valuable, as it permits to have further aggregations when visualizing it. It also addresses some other issues in the metricset.

Fields in events have been reorganized and old ones have been marked as deprecated.

We could also add some extra functionality, as obtaining metrics also on Windows hosts (as is this module now it would only work with docker daemons running on Linux and not sure about Mac). In Linux we could also collect metrics per block device.

- [x] Add accumulated I/O stats to diskio metricset
- [x] Fix diskio metricset when docker reads from multiple devices
- [x] Some naming and documentation fixes
- [x] Fix memory leak in blkio service (last seen stats were being stored, but never removed)
- [x] More tests

To follow-up in other issues:

- ~~Group by device in Linux?~~ To be done if requested in the future
- Support for Windows storage stats? Follow-up issue #6815
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement feedback needed in progress Pull request is currently in progress. Metricbeat Metricbeat module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants