Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total #871

Merged

Conversation

knweiss
Copy link
Contributor

@knweiss knweiss commented Mar 29, 2018

This fixes the node_cpu_core_throttles_total metrics on multi-socket systems as explained in detail in my comment to PR #836. As the core_ids are the same for each node (aka processor or package) we need to count them seperately. Otherwise we only count the last processor's cpu core throttles.

Refactor the core_throttles_count handling and move it from the cpu loop
to the NUMA node loop.

Reorganize the sys.ttar files and use the same symlinks as the Linux
kernel.

@knweiss knweiss force-pushed the node_cpu_core_throttles_total branch from 1f8b425 to f02f372 Compare March 29, 2018 10:42
@SuperQ
Copy link
Member

SuperQ commented Mar 29, 2018

Thanks! Can you add a fixture example for a second node?

@grobie
Copy link
Member

grobie commented Mar 29, 2018

Could we consider not using the term node for that label? It already has a prominent meaning in the node exporter.

@@ -74,7 +74,7 @@ func NewCPUCollector() (Collector, error) {
cpuCoreThrottle: prometheus.NewDesc(
prometheus.BuildFQName(namespace, cpuCollectorSubsystem, "core_throttles_total"),
"Number of times this cpu core has been throttled.",
[]string{"core"}, nil,
[]string{"node", "core"}, nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we call this package instead of node.

@discordianfish
Copy link
Member

Besie @SuperQ's comment, LGTM!

@knweiss
Copy link
Contributor Author

knweiss commented Apr 4, 2018

Sorry, for the late reply. I was thinking about the package vs node discussion and wanted to try a more complicated system first.

So I've tested a dual-socket AMD EPYC system (multi-chip module cpus) and I came to the conclusion that my commit is wrong. It only works on Intel x86_64 because (at the moment) the NUMA node number and the physical_package_id are identical. On more complicated systems like AMD EPYC that's not the case.

I.e. we can't iterate over the NUMA node number and then their local CPU / core_ids if we want this to work in the general case, too. (Important: AMD EPYC does not have the core_throttle_count file yet. So, right now, this is more a theoretical problem.)

I've created a gist to show the details (first Intel Haswell, then AMD EPYC). Please take a look at the EPYC output to see the difference.

Instead we should iterated the physical_package_ids and their local core_ids. And, yes, the label should be called package and not node. However, it's not a simple label rename but a different meaning altogether.

@SuperQ
Copy link
Member

SuperQ commented Apr 4, 2018

@knweiss Thanks for all the research to figure this out.

@SuperQ
Copy link
Member

SuperQ commented Apr 4, 2018

One additional request, I'd like to start including CHANGELOG entries in PRs. Would you please add a [BUGFIX] entry and mention it under **Breaking changes**.

@knweiss knweiss force-pushed the node_cpu_core_throttles_total branch from 7296f5a to 5d46d30 Compare April 5, 2018 15:30
@knweiss knweiss force-pushed the node_cpu_core_throttles_total branch from 5d46d30 to 77e1560 Compare April 5, 2018 15:31
@knweiss
Copy link
Contributor Author

knweiss commented Apr 5, 2018

I've tried to simplify and fix the code. This a first version (without much testing!).

@SuperQ I've added CHANGELOG entries.
@SuperQ I've also added a 2nd (actually a third) node. See the commit message for details.
@grobie The label of the metrics node_cpu_core_throttles_total and node_cpu_package_throttles_total is now called package instead of node.

@knweiss knweiss changed the title cpu: Add a 2nd label 'node' to metric node_cpu_core_throttles_total cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total Apr 5, 2018
CHANGELOG.md Outdated
@@ -2,10 +2,10 @@

**Breaking changes**

* [CHANGE]
* [CHANGE] Rename `node_package_throttle` label `node` to `package`. #871
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just listed under the breaking changes section, we typically list one item per PR.

@knweiss knweiss force-pushed the node_cpu_core_throttles_total branch from 77e1560 to 8394d3c Compare April 6, 2018 07:44
@mjtrangoni
Copy link
Contributor

👍 LGTM

This commit fixes the node_cpu_core_throttles_total metrics on
multi-socket systems as the core_ids are the same for each package.
I.e. we need to count them seperately.

Rename the node_package_throttles_total metric label `node` to `package`.

Reorganize the sys.ttar archive and use the same symlinks as the Linux
kernel. Also, the new fixtures now use a dual-socket dual-core cpu w/o
HT/SMT (node0: cpu0+1, node1: cpu2+3) as well as processor-less
(memory-only) NUMA node 'node2' (this is a very rare case).

Signed-off-by: Karsten Weiss <knweiss@gmail.com>
@knweiss knweiss force-pushed the node_cpu_core_throttles_total branch from 8394d3c to 4a07295 Compare April 6, 2018 11:01
knweiss added 2 commits April 6, 2018 14:16
Use the direct path /sys/devices/system/cpu/cpu[0-9]* (without symlinks)
instead of /sys/bus/cpu/devices/cpu[0-9]*.

The latter path also does not exist e.g. on RHEL 6.9's kernel.

Signed-off-by: Karsten Weiss <knweiss@gmail.com>
Signed-off-by: Karsten Weiss <knweiss@gmail.com>
@knweiss knweiss force-pushed the node_cpu_core_throttles_total branch from 4a07295 to 709bce6 Compare April 6, 2018 12:32
@knweiss
Copy link
Contributor Author

knweiss commented Apr 6, 2018

In the meantime I did some more testing on CentOS 6.9 and 7.4 machines with no, partial, and full core+package throttle information in /sys. Both Intel and AMD. So far it works fine and I even caught a couple of machines with a throttle count != 0 (both core and package). YMMV.

Signed-off-by: Karsten Weiss <knweiss@gmail.com>
@SuperQ
Copy link
Member

SuperQ commented Apr 6, 2018

Looking good. One note, we intentionally had cpu2 missing the cpufreq fixtures in order to test missing file handling. I guess we may want to change that whole setup to be more of a unit test, rather than fake unit testing with the e2e output.

@knweiss
Copy link
Contributor Author

knweiss commented Apr 6, 2018

@SuperQ It would be very useful to have fixtures from all the various cpu types, archs, and kernel versions and an e2e test that tries them all one after the other. This would allow making changes with more confidence...

On the other hand it would, of course, require more effort to keep all the e2e reference files up-to-date.

@SuperQ
Copy link
Member

SuperQ commented Apr 6, 2018

Yea, I think this is the case where we should move all of this code over to the procfs library and have a series of unit tests for various sample fixtures. This would be much more maintainable than doing it here in the node_exporter.

@mjtrangoni
Copy link
Contributor

@SuperQ I agree with you. I will be making a PR exporting /sys/devices/system/cpu/*. See #84.
IMHO you could merge this PR, and then in a separate PR use the procfs implementation.

@discordianfish
Copy link
Member

Let's merge it, we can use the updated procfs later.

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SuperQ SuperQ requested a review from grobie April 9, 2018 11:54
@discordianfish discordianfish merged commit efc1fdb into prometheus:master Apr 9, 2018
@knweiss knweiss deleted the node_cpu_core_throttles_total branch April 11, 2018 08:00
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
prometheus#871)

* cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total

This commit fixes the node_cpu_core_throttles_total metrics on
multi-socket systems as the core_ids are the same for each package.
I.e. we need to count them seperately.

Rename the node_package_throttles_total metric label `node` to `package`.

Reorganize the sys.ttar archive and use the same symlinks as the Linux
kernel. Also, the new fixtures now use a dual-socket dual-core cpu w/o
HT/SMT (node0: cpu0+1, node1: cpu2+3) as well as processor-less
(memory-only) NUMA node 'node2' (this is a very rare case).

Signed-off-by: Karsten Weiss <knweiss@gmail.com>

* cpu: Use the direct /sys path to the cpu files.

Use the direct path /sys/devices/system/cpu/cpu[0-9]* (without symlinks)
instead of /sys/bus/cpu/devices/cpu[0-9]*.

The latter path also does not exist e.g. on RHEL 6.9's kernel.

Signed-off-by: Karsten Weiss <knweiss@gmail.com>

* cpu: Reverse core+package throttle processing order

Signed-off-by: Karsten Weiss <knweiss@gmail.com>

* cpu: Add documentation URLs

Signed-off-by: Karsten Weiss <knweiss@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants