-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative cpu utilization reporting #1002
base: main
Are you sure you want to change the base?
Conversation
Old Energy EstimationEco-CI Output:
🌳 CO2 Data: |
Ok, I have it. it is the overhead of the X11 window compositor. If I turn that off the values strongly align. This is why the process was not spotted in the idle phase as existing overhead. It only activates once the windows are painted ... |
Reopening to finalize discussion. Also worth discussing: How to distribute power between containers with idea that some container "work" might be done outside of what the cpu utilization captures e.g. in X11 / Wayland |
Old Energy EstimationEco-CI Output:
🌳 CO2 Data: |
Old Energy EstimationEco-CI Output:
🌳 CO2 Data: |
* main: Adding measurement settings to the measurement view
Old Energy EstimationEco-CI Output:
🌳 CO2 Data: |
Eco-CI Output:
🌳 CO2 Data: |
* main: Tcp dump (#919) Hash must be decoded to understand spaces [skip ci] Allowing Deeplinks to specific phases [skip ci] (#1016) Bump python from 3.13.0-slim-bookworm to 3.13.1-slim-bookworm in /docker (#1015) Bump pydantic from 2.10.2 to 2.10.3 (#1010) Bump fastapi[standard] from 0.115.5 to 0.115.6 (#1011) Bump aiohttp from 3.11.9 to 3.11.10 (#1013) Bump redis from 5.2.0 to 5.2.1 (#1014) Bump python from 3.12.7-slim-bookworm to 3.13.0-slim-bookworm in /docker (#949) Bump hiredis from 3.0.0 to 3.1.0 (#1012) Bump pylint from 3.3.1 to 3.3.2 (#1008) Bump pytest from 8.3.3 to 8.3.4 (#1007) Bump aiohttp from 3.11.7 to 3.11.9 (#1009) Added kill script for GMT Adding cachetools as requirement EE Update
* main: Adding not implemented error Kill script for GMT must use full commandline when killing reporters +x for tcpdump
2 similar comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces system-level metric providers for CPU, memory, disk, and network monitoring using cgroups, with a focus on addressing CPU utilization reporting discrepancies between cgroup and procfs measurements.
- Identified ~6% discrepancy in CPU utilization between cgroup and procfs when running multiple containers, but <1% with single containers
- Added new cgroup path detection patterns in
detect_cgroup_path.c
for Window Managers and Session applications - Removed virtualization-related CPU times (steal_time, guest_time) from calculations in
source.c
files - Potential memory management issues in several providers'
source.c
files, including memory leaks and unsafe realloc usage - Incorrect super() calls in
provider.py
files could cause initialization problems in system metric providers
💡 (1/5) You can manually trigger the bot by mentioning @greptileai in a comment!
23 file(s) reviewed, 21 comment(s)
Edit PR Review Bot Settings | Greptile
@@ -179,7 +181,7 @@ def build_and_store_phase_stats(run_id, sci=None): | |||
if phase['name'] == '[RUNTIME]' and machine_carbon_in_ug is not None and sci is not None and sci.get('R', 0) != 0: | |||
csv_buffer.write(generate_csv_line(run_id, 'software_carbon_intensity_global', '[SYSTEM]', f"{idx:03}_{phase['name']}", (machine_carbon_in_ug + embodied_carbon_share_ug + network_io_carbon_in_ug) / sci['R'], 'TOTAL', None, None, f"ugCO2e/{sci['R_d']}")) | |||
|
|||
if machine_power_baseline and cpu_utilization_machine and cpu_utilization_containers: | |||
if machine_power_phase and machine_power_baseline and cpu_utilization_machine and cpu_utilization_containers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The condition should also check if machine_energy_phase is not None before calculations since it's used in line 186
if found_cgroups := len(lines) != 1: | ||
raise RuntimeError(f"Could not find GMT\'s own cgroup or found too many. Amount: {found_cgroups}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Logic error in condition. found_cgroups := len(lines) != 1
assigns a boolean, but error message refers to amount. Should be found_cgroups = len(lines)
followed by if found_cgroups != 1:
with open(f"/proc/{current_pid}/cgroup", 'r', encoding='utf-8') as file: | ||
lines = file.readlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: No error handling for file not found or permission denied when accessing /proc/{pid}/cgroup
raise RuntimeError(f"Could not find GMT\'s own cgroup or found too many. Amount: {found_cgroups}") | ||
return lines[0].split('/')[-1].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Assumes cgroup path format is consistent. Should handle different cgroup hierarchy formats and empty lines
@@ -0,0 +1,4 @@ | |||
CFLAGS = -O3 -Wall -I../../../../../lib/c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: consider adding -Werror to catch potential issues at compile time
|
||
class MemoryUsedCgroupSystemProvider(MemoryUsedCgroupContainerProvider): | ||
def __init__(self, resolution, skip_check=False, cgroups: dict = None): | ||
super(MemoryUsedCgroupContainerProvider, self).__init__( # this will call BaseMetricProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: super() call uses wrong class name - should be super().init() or super(MemoryUsedCgroupSystemProvider, self).init()
@@ -0,0 +1,6 @@ | |||
CFLAGS = -O3 -Wall -lc -I../../../../../lib/c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: -lc flag is unnecessary since libc is linked by default in gcc
sudo chown root $@ | ||
sudo chmod u+s $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: sudo commands in Makefile may fail in some CI environments - consider making these optional or documenting requirements
@@ -0,0 +1,3 @@ | |||
# Documentation | |||
|
|||
Please see https://docs.green-coding.io/docs/measuring/metric-providers/network-io-cgroup-container/ for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: URL points to network-io-cgroup-container instead of network-io-cgroup-system documentation
def start_profiling(self, containers=None): | ||
super().start_profiling(self._cgroups) # we hook here into the mechanism that can supply container names to the parent function | ||
|
||
def read_metrics(self, run_id, containers=None): | ||
return super().read_metrics(run_id, self._cgroups) # this will call NetworkIoCgroupContainerProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: check if containers parameter is actually needed since it's always overridden with self._cgroups
@ribalba Can you please review. There is still the same discrepancy between the sum of cgroup reported values and the procfs one.
Example: https://metrics.green-coding.io/stats.html?id=c87ef224-9137-46d4-b5d2-60fca94dfc39
I also tried on another linux framebook and also tried with only one container.
If one container is on the machine the discrepancy is < 1% (absolute). But with multiple containers it stays at around 6%