Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system/core: add CPU information for Linux hosts #31643

Merged
merged 13 commits into from
May 23, 2022

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented May 17, 2022

What does this PR do?

It adds the following information from /proc/cpuinfo to system.core metrics on Linux hosts:

  • model_number
  • model_name
  • mhz
  • core_id
  • pysical_id

Below is an example of the information added to the events from a laptop CPU with 8 cores and 16 threads.

It's interesting to notice that our current system.core.id is something like a "virtual core ID" (that matches the processor from /proc/cpuinfo) and is distinct even across different CPU sockets. I'm adding a system.core.core_id that is the "physical core ID" for a given "physical CPU".

"@timestamp" "system.core.model_name" "system.core.model_num" "system.core.id" "system.core.core_id" "system.core.physical_id" "system.core.mhz"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 0 0 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 1 1 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 2 2 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 3 3 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 4 4 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 5 5 0 "4,443.456"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 6 6 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 7 7 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 8 0 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 9 1 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 10 2 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 11 3 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 12 4 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 13 5 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 14 6 0 "2,400"
"May 18, 2022 @ 15:15:52.326" "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz" 165 15 7 0 "2,400"

The Go port of Sigar we use (gosigar) we use does not support fetching the cpuinfo. After some research I decided to read directly /proc/cpuinfo. This works well for Linux on x86/x86_64

Questions

A few questions I have:

  1. Should we add this information to system.cpu? There isn't a single value for the clock I can read. For now I'm averaging out the clock from all cores
  • No. The values can be quite different among different CPU/cores, it does not make sense trying to aggregate them.
  1. If some processors have got different core types (like the M1) having this info on system/cpu might be quite inaccurate. This might not be an immediate issue, but we should keep that in mind.
  • It won't be an issue as we will not add this info to system.cpu
  1. Anything against reading /proc/cpuinfo on Linux?
  • No
  1. Do we need to consider non-x86 CPUs at the moment? Like ARM?
  • No, this PR focus only on Linux x86/x86_64 CPUs

Why is it important?

It enables more visibility on the CPUs used, see the related issue for more details.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Naming of the metrics, specially core_id
  • Linux

How to test this PR locally

Run Metricbeat with system/core module enabled in one of the supported platforms, check for the metrics.

Related issues

## Use cases
## Screenshots
## Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 17, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 17, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-05-23T15:06:49.615+0000

  • Duration: 57 min 35 sec

Test stats 🧪

Test Results
Failed 0
Passed 3588
Skipped 887
Total 4475

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@@ -91,7 +104,18 @@ func (m *Monitor) Fetch() (Metrics, error) {
oldLastSample := m.lastSample
m.lastSample = metric

return Metrics{previousSample: oldLastSample.totals, currentSample: metric.totals, count: len(metric.list), isTotals: true}, nil
// There isn't a 'total' for the CPU/Core frequency, so we average all the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not report all of the values in separate metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system.cpu.* report the aggregated metrics to all cores, handling the CPU as a single unit. The fact that the OS already reports the metrics like that makes things way easier.

I just followed the same approach.

I think it would be odd to see metrics like:
system.cpu.core0.mhz
system.cpu.core1.mhz
system.cpu.core2.mhz
system.cpu.core3.mhz

We could also omit the clock frequency from system.cpu, and only report system.cpu.model_name and system.cpu.model_number if they're the same for all cores.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider looking at how the system/core metricset does things, that might be a better way to report a lot of this.

It's important to remember that CPU data that comes from sources like /proc/cpuinfo can be weirdly heterogeneous, particularly on multi-socket systems and VMs, I would strongly advise against trying to "sum" things.

Mhz should probably be reported individually, even if we also wanted to have an averaging metric somewhere. Wildly different core speeds across a system could indicate inefficiencies, scheduling issues, etc. A Precise count from the system itself, as opposed to an average, is also useful for verification that the CPU is turbo'ing as it should under load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider looking at how the system/core metricset does things, that might be a better way to report a lot of this.

I was looking at it, and what it does is to get the aggregated metrics provided by the OS (at least on Linux), /proc/stat has got an aggregated line there. It really makes things easy.

It's important to remember that CPU data that comes from sources like /proc/cpuinfo can be weirdly heterogeneous, particularly on multi-socket systems and VMs, I would strongly advise against trying to "sum" things.

Indeed. I think I'll just keep them out of system/cpu. At least for now.
It feels pretty odd to introduce this 'core' concept into system/cpu.

@@ -91,7 +104,18 @@ func (m *Monitor) Fetch() (Metrics, error) {
oldLastSample := m.lastSample
m.lastSample = metric

return Metrics{previousSample: oldLastSample.totals, currentSample: metric.totals, count: len(metric.list), isTotals: true}, nil
// There isn't a 'total' for the CPU/Core frequency, so we average all the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider looking at how the system/core metricset does things, that might be a better way to report a lot of this.

It's important to remember that CPU data that comes from sources like /proc/cpuinfo can be weirdly heterogeneous, particularly on multi-socket systems and VMs, I would strongly advise against trying to "sum" things.

Mhz should probably be reported individually, even if we also wanted to have an averaging metric somewhere. Wildly different core speeds across a system could indicate inefficiencies, scheduling issues, etc. A Precise count from the system itself, as opposed to an average, is also useful for verification that the CPU is turbo'ing as it should under load.

metricbeat/internal/metrics/cpu/metrics.go Show resolved Hide resolved
@belimawr belimawr changed the title [WIP] system/cpu add cpuinfo [WIP] system/core: add cpuinfo May 18, 2022
@belimawr belimawr force-pushed the add-cpuinfo-metricset branch 3 times, most recently from cda193d to d8ec7cf Compare May 20, 2022 14:16
@@ -157,7 +156,6 @@ func (r *Reader) CgroupsVersion(pid int) (CgroupsVersion, error) {
// V1 and V2 controllers on a cgroup. If the V2 controller has no actual controllers associated with it,
// We revert to V1. If it does, report V2. In the future, we may want to "combine" V2 and V1 metrics somehow.
if len(controllers) > 0 {
fmt.Printf("fetching V2 controller: %#v for pid %d\n", controllers, pid)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looked like a print debug that was forgotten. Is it ok to remove it @fearful-symmetry ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, just confused as to why the linter is touching this file to begin with. I don't see any changes besides this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the only change on this file, but the linter runs on all files that had any changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is going to be removed in #31615

Please move this fix to the elastic-agent-system-metrics repo.

@@ -280,7 +282,6 @@ def test_filesystem(self):
self.assertGreater(len(output), 0)

for evt in output:
print(evt)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also looked like a forgotten debug line, so I removed it.

@belimawr belimawr changed the title [WIP] system/core: add cpuinfo system/core: add CPU information for Linux hosts May 20, 2022
@belimawr belimawr added review Metricbeat Metricbeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 20, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 20, 2022
@belimawr belimawr marked this pull request as ready for review May 20, 2022 14:54
@belimawr belimawr requested a review from a team as a code owner May 20, 2022 14:54
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@belimawr
Copy link
Contributor Author

@kvch @fearful-symmetry, it's ready for review. I reduced the scope to only include Linux so this PR can make it into 8.3.

I'm also ignoring the linter because it's mostly noisy regarding pkg/errors on a file I removes what seemed to be a forgotten print debug.

Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few small nits. @kvch should we merge this here, and then later migrate all the code in internal/ over to the elastic-agent-system-metrics repo?

@@ -157,7 +156,6 @@ func (r *Reader) CgroupsVersion(pid int) (CgroupsVersion, error) {
// V1 and V2 controllers on a cgroup. If the V2 controller has no actual controllers associated with it,
// We revert to V1. If it does, report V2. In the future, we may want to "combine" V2 and V1 metrics somehow.
if len(controllers) > 0 {
fmt.Printf("fetching V2 controller: %#v for pid %d\n", controllers, pid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, just confused as to why the linter is touching this file to begin with. I don't see any changes besides this one?

metricbeat/internal/metrics/cpu/metrics_procfs_common.go Outdated Show resolved Hide resolved
@kvch
Copy link
Contributor

kvch commented May 23, 2022

Let's merge it here. We can move it around after FF.

Copy link
Contributor

@kvch kvch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the fix from libbeat/metric/system/cgroup/reader.go and rather address it in the new repo elastic-agent-system-metrics.

@mergify
Copy link
Contributor

mergify bot commented May 23, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-cpuinfo-metricset upstream/add-cpuinfo-metricset
git merge upstream/main
git push upstream add-cpuinfo-metricset

@belimawr
Copy link
Contributor Author

fmt.Printf("fetching V2 controller: %#v for pid %d\n", controllers, pid)

Done on e329ee891f

@belimawr belimawr requested a review from kvch May 23, 2022 10:45
@belimawr belimawr merged commit 108be1d into elastic:main May 23, 2022
@belimawr belimawr deleted the add-cpuinfo-metricset branch May 23, 2022 16:13
kvch added a commit to kvch/elastic-agent-system-metrics that referenced this pull request Jun 9, 2022
kvch added a commit to kvch/elastic-agent-system-metrics that referenced this pull request Jun 9, 2022
kvch added a commit to elastic/elastic-agent-system-metrics that referenced this pull request Jun 9, 2022
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
This commit adds the following information from `/proc/cpuinfo`  to `system.core` metrics on Linux hosts:
- `model_number`
- `model_name`
- `mhz`
- `core_id`
- `physical_id`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metricbeat Metricbeat review Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Metricbeat] Add support for cpuinfo metricset
4 participants