-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
system/core: add CPU information for Linux hosts #31643
Conversation
@@ -91,7 +104,18 @@ func (m *Monitor) Fetch() (Metrics, error) { | |||
oldLastSample := m.lastSample | |||
m.lastSample = metric | |||
|
|||
return Metrics{previousSample: oldLastSample.totals, currentSample: metric.totals, count: len(metric.list), isTotals: true}, nil | |||
// There isn't a 'total' for the CPU/Core frequency, so we average all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not report all of the values in separate metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
system.cpu.*
report the aggregated metrics to all cores, handling the CPU as a single unit. The fact that the OS already reports the metrics like that makes things way easier.
I just followed the same approach.
I think it would be odd to see metrics like:
system.cpu.core0.mhz
system.cpu.core1.mhz
system.cpu.core2.mhz
system.cpu.core3.mhz
We could also omit the clock frequency from system.cpu
, and only report system.cpu.model_name
and system.cpu.model_number
if they're the same for all cores.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to consider looking at how the system/core
metricset does things, that might be a better way to report a lot of this.
It's important to remember that CPU data that comes from sources like /proc/cpuinfo
can be weirdly heterogeneous, particularly on multi-socket systems and VMs, I would strongly advise against trying to "sum" things.
Mhz should probably be reported individually, even if we also wanted to have an averaging metric somewhere. Wildly different core speeds across a system could indicate inefficiencies, scheduling issues, etc. A Precise count from the system itself, as opposed to an average, is also useful for verification that the CPU is turbo'ing as it should under load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to consider looking at how the system/core metricset does things, that might be a better way to report a lot of this.
I was looking at it, and what it does is to get the aggregated metrics provided by the OS (at least on Linux), /proc/stat
has got an aggregated line there. It really makes things easy.
It's important to remember that CPU data that comes from sources like /proc/cpuinfo can be weirdly heterogeneous, particularly on multi-socket systems and VMs, I would strongly advise against trying to "sum" things.
Indeed. I think I'll just keep them out of system/cpu
. At least for now.
It feels pretty odd to introduce this 'core' concept into system/cpu
.
643876b
to
3432e92
Compare
@@ -91,7 +104,18 @@ func (m *Monitor) Fetch() (Metrics, error) { | |||
oldLastSample := m.lastSample | |||
m.lastSample = metric | |||
|
|||
return Metrics{previousSample: oldLastSample.totals, currentSample: metric.totals, count: len(metric.list), isTotals: true}, nil | |||
// There isn't a 'total' for the CPU/Core frequency, so we average all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to consider looking at how the system/core
metricset does things, that might be a better way to report a lot of this.
It's important to remember that CPU data that comes from sources like /proc/cpuinfo
can be weirdly heterogeneous, particularly on multi-socket systems and VMs, I would strongly advise against trying to "sum" things.
Mhz should probably be reported individually, even if we also wanted to have an averaging metric somewhere. Wildly different core speeds across a system could indicate inefficiencies, scheduling issues, etc. A Precise count from the system itself, as opposed to an average, is also useful for verification that the CPU is turbo'ing as it should under load.
cda193d
to
d8ec7cf
Compare
@@ -157,7 +156,6 @@ func (r *Reader) CgroupsVersion(pid int) (CgroupsVersion, error) { | |||
// V1 and V2 controllers on a cgroup. If the V2 controller has no actual controllers associated with it, | |||
// We revert to V1. If it does, report V2. In the future, we may want to "combine" V2 and V1 metrics somehow. | |||
if len(controllers) > 0 { | |||
fmt.Printf("fetching V2 controller: %#v for pid %d\n", controllers, pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looked like a print debug that was forgotten. Is it ok to remove it @fearful-symmetry ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, just confused as to why the linter is touching this file to begin with. I don't see any changes besides this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the only change on this file, but the linter runs on all files that had any changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is going to be removed in #31615
Please move this fix to the elastic-agent-system-metrics
repo.
@@ -280,7 +282,6 @@ def test_filesystem(self): | |||
self.assertGreater(len(output), 0) | |||
|
|||
for evt in output: | |||
print(evt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also looked like a forgotten debug line, so I removed it.
7608419
to
5a0a0ca
Compare
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
@kvch @fearful-symmetry, it's ready for review. I reduced the scope to only include Linux so this PR can make it into 8.3. I'm also ignoring the linter because it's mostly noisy regarding |
5a0a0ca
to
b7a1706
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few small nits. @kvch should we merge this here, and then later migrate all the code in internal/
over to the elastic-agent-system-metrics
repo?
@@ -157,7 +156,6 @@ func (r *Reader) CgroupsVersion(pid int) (CgroupsVersion, error) { | |||
// V1 and V2 controllers on a cgroup. If the V2 controller has no actual controllers associated with it, | |||
// We revert to V1. If it does, report V2. In the future, we may want to "combine" V2 and V1 metrics somehow. | |||
if len(controllers) > 0 { | |||
fmt.Printf("fetching V2 controller: %#v for pid %d\n", controllers, pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, just confused as to why the linter is touching this file to begin with. I don't see any changes besides this one?
Let's merge it here. We can move it around after FF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the fix from libbeat/metric/system/cgroup/reader.go
and rather address it in the new repo elastic-agent-system-metrics
.
ebfe830
to
f9f5ab7
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
Done on e329ee891f |
Some log debugs were removed
- fix error messge - better variable naming
e329ee8
to
38ec0f5
Compare
This commit adds the following information from `/proc/cpuinfo` to `system.core` metrics on Linux hosts: - `model_number` - `model_name` - `mhz` - `core_id` - `physical_id`
What does this PR do?
It adds the following information from
/proc/cpuinfo
tosystem.core
metrics on Linux hosts:model_number
model_name
mhz
core_id
pysical_id
Below is an example of the information added to the events from a laptop CPU with 8 cores and 16 threads.
It's interesting to notice that our current
system.core.id
is something like a "virtual core ID" (that matches theprocessor
from/proc/cpuinfo
) and is distinct even across different CPU sockets. I'm adding asystem.core.core_id
that is the "physical core ID" for a given "physical CPU".The Go port of Sigar we use (gosigar) we use does not support fetching the
cpuinfo
. After some research I decided to read directly/proc/cpuinfo
. This works well for Linux on x86/x86_64Questions
A few questions I have:
system.cpu
? There isn't a single value for the clock I can read. For now I'm averaging out the clock from all coressystem/cpu
might be quite inaccurate. This might not be an immediate issue, but we should keep that in mind.system.cpu
/proc/cpuinfo
on Linux?Why is it important?
It enables more visibility on the CPUs used, see the related issue for more details.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
core_id
How to test this PR locally
Run Metricbeat with
system/core
module enabled in one of the supported platforms, check for the metrics.Related issues
## Use cases## Screenshots## Logs