Metricbeat: Use system-wide ticks to calculate docker cpu usage #6608

jsoriano · 2018-03-20T18:01:53Z

This implementation looks for a better coherence with docker stats.

It is based on the number of cpus and the ratio between the number of CPU ticks of the container and the number of ticks of the whole system during a fixed period of time, as docker stats, top or similar tools do. This will make CPU percentage metrics to be greater than 100% at some cases (e.g. it will be 200% if a container is running during 50% of ticks in a 4 CPUs system).
Before it was based only in the ratio between container and system ticks.

Some documentation texts have been rewritten ~~, and scaling factor of percentages has been increased, as cpu percentage can be over 100, up to 100 times the number of cores (as in top, docker stats or similar tools).~~

I think we might deprecate, or move to other part the "system" CPU metric. It is used to calculate percentages of metrics for containers, but I don't think it is very useful beyond that. Also, with current implementation its percentage should always constant, as it is calculated as a ratio with itself.

houndci-bot · 2018-03-20T18:02:00Z

metricbeat/module/docker/cpu/helper.go

+	return u.Stats.CPUStats.SystemUsage - u.Stats.PreCPUStats.SystemUsage
+}
+
+func (u *cpuUsage) perCpu() common.MapStr {


func perCpu should be perCPU

exekias · 2018-03-21T09:38:05Z

The reasoning looks good to me, have a look at tests, something is wrong there:

# github.com/elastic/beats/metricbeat/module/docker/cpu
module/docker/cpu/cpu_test.go:46:10: undefined: perCpuUsage
module/docker/cpu/cpu_test.go:73:10: undefined: totalUsage
module/docker/cpu/cpu_test.go:96:10: undefined: usageInKernelmode
module/docker/cpu/cpu_test.go:119:10: undefined: usageInUsermode

ruflin · 2018-03-21T13:03:30Z

metricbeat/module/docker/cpu/_meta/fields.yml

@@ -6,31 +6,37 @@
  fields:
    - name: kernel.pct
      type: scaled_float
+      scaling_factor: 1000000


I remember a scaling factor > 1000 will not have any compression advantage anymore over just using float instead of scaled_float. So I suggest we just change to float here. Also below.

Umm, if I understood well, scaled_float is stored as an integer, wouldn't it be a breaking change to change it to float? (OTOH changing the scaling_factor could also be a breaking change)

As each version has a new template, things should not break here. From a query perspective I think it was already like a float before so having a float afterwards should just work.

If someone ingest with the old template he would just have lower accuracy but not get an error.

@jpountz Is the above correct? The ingestion and quering of float and scaled_float looks identical?

Assuming all values are between 0 and 1, this field will likely require only 20 bits per value, which is still better than a float. However I'm very surprised you need so much precision for a percentage, why are 100 or 1000 not enough?

The thing is not precision, but that this is a percentage that can be over 100%, it is for CPU usage as can be seen in top or docker stats, and this can be up to 100 times the number of cpus.
Maybe the best is to use float directly.

@jpountz wdyt? would it be ok to change to float for this?

The fact it may go over 1 is not an issue, I still think scaled_float is a better fit. However let's use a more reasonable scaling factor like 100?

Oh now I wonder that you might have interpreted the scaling factor as an upper bound of the values, while it's not. It just means that Elasticsearch will store x * scaling_factor as a long internally.

Oh, ok, understood now :) Then 100/1000 will be probably enough, yes. Thanks @jpountz !!

I'll leave it empty as it was before, default seems to be 1000.

ruflin · 2018-03-21T13:08:56Z

This also cleans up the implementation a bit. A CHANGELOG should be added for this change.

Perhaps you can add a few more details in the PR message about the previous value compare to the current value which is more focused on users coming to this PR to understand the change?

jsoriano · 2018-03-22T09:01:29Z

Added a line to changelog and more information to the commit message.

ruflin · 2018-03-27T07:15:23Z

@jsoriano Needs another rebase.

jsoriano · 2018-03-27T08:25:20Z

Rebased, but let's wait to make a decission on the type of percentage fields before merging.

This implementation is more coherent with `docker stats` implementation. It is based in the ratio between the number of CPU ticks of the container and the number of ticks of the whole system during a fixed period of time. Previous implementation considered that the whole system had a cpu time of a second during each measurement, what is not always 100% accurate. New implementation also takes into account the number of CPUs, as other tools like `top` or `docker stats` do, so if a container is using 25% of CPU time of a 4 CPUs machine, the reported percentage will be 100%, with a maximum of 400% total usage if it is using all the CPU time. These changes can modify the resulting values of docker CPU metrics.

jsoriano · 2018-03-27T17:46:22Z

Changes to scaling_factor in percentages removed.

ruflin · 2018-03-28T08:44:04Z

@jsoriano Could you update the PR description with details on what the logic is before and what it is now? So users coming here to figure out how their values changed get it without reading the code.

jsoriano · 2018-03-28T10:19:39Z

PR description updated.

This PR continues with #6608 in the aim to provide docker metrics more consistent with `docker stats`. Currently we report for disk I/O the number of operations per second. This PR adds accumulated I/O stats per number of operations and per data volume since the container started, as `docker stats` does. I think this information is more valuable, as it permits to have further aggregations when visualizing it. It also addresses some other issues in the metricset. Fields in events have been reorganized and old ones have been marked as deprecated. We could also add some extra functionality, as obtaining metrics also on Windows hosts (as is this module now it would only work with docker daemons running on Linux and not sure about Mac). In Linux we could also collect metrics per block device. - [x] Add accumulated I/O stats to diskio metricset - [x] Fix diskio metricset when docker reads from multiple devices - [x] Some naming and documentation fixes - [x] Fix memory leak in blkio service (last seen stats were being stored, but never removed) - [x] More tests To follow-up in other issues: - ~~Group by device in Linux?~~ To be done if requested in the future - Support for Windows storage stats? Follow-up issue #6815

jsoriano added enhancement module feedback needed Metricbeat Metricbeat labels Mar 20, 2018

houndci-bot reviewed Mar 20, 2018

View reviewed changes

jsoriano force-pushed the docker-stats-equivalence branch from 31027c1 to 9b3a805 Compare March 20, 2018 18:05

jsoriano added the in progress Pull request is currently in progress. label Mar 20, 2018

jsoriano assigned ruflin and exekias and unassigned ruflin and exekias Mar 20, 2018

jsoriano requested review from ruflin and exekias March 20, 2018 18:44

jsoriano force-pushed the docker-stats-equivalence branch from 9b3a805 to 8a24b98 Compare March 21, 2018 11:59

ruflin reviewed Mar 21, 2018

View reviewed changes

jsoriano force-pushed the docker-stats-equivalence branch from 8a24b98 to a256cad Compare March 22, 2018 08:59

jsoriano force-pushed the docker-stats-equivalence branch from a256cad to 807a57c Compare March 22, 2018 09:04

jsoriano force-pushed the docker-stats-equivalence branch from 807a57c to 3cae5b6 Compare March 27, 2018 08:23

jsoriano force-pushed the docker-stats-equivalence branch from 3cae5b6 to 82e4585 Compare March 27, 2018 17:45

ruflin merged commit fa39f47 into elastic:master Mar 28, 2018

jsoriano mentioned this pull request Mar 29, 2018

Metricbeat: Improvements in docker diskio metricset #6701

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metricbeat: Use system-wide ticks to calculate docker cpu usage #6608

Metricbeat: Use system-wide ticks to calculate docker cpu usage #6608

jsoriano commented Mar 20, 2018 •

edited

Loading

houndci-bot Mar 20, 2018

exekias commented Mar 21, 2018

ruflin Mar 21, 2018

jsoriano Mar 21, 2018

ruflin Mar 22, 2018

jpountz Mar 26, 2018

jsoriano Mar 26, 2018

jsoriano Mar 27, 2018

jpountz Mar 27, 2018

jpountz Mar 27, 2018

jsoriano Mar 27, 2018

ruflin commented Mar 21, 2018

jsoriano commented Mar 22, 2018 •

edited

Loading

ruflin commented Mar 27, 2018

jsoriano commented Mar 27, 2018

jsoriano commented Mar 27, 2018

ruflin commented Mar 28, 2018

jsoriano commented Mar 28, 2018

Metricbeat: Use system-wide ticks to calculate docker cpu usage #6608

Metricbeat: Use system-wide ticks to calculate docker cpu usage #6608

Conversation

jsoriano commented Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

exekias commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented Mar 21, 2018

jsoriano commented Mar 22, 2018 • edited Loading

ruflin commented Mar 27, 2018

jsoriano commented Mar 27, 2018

jsoriano commented Mar 27, 2018

ruflin commented Mar 28, 2018

jsoriano commented Mar 28, 2018

jsoriano commented Mar 20, 2018 •

edited

Loading

jsoriano commented Mar 22, 2018 •

edited

Loading