Monitors CPU, RAM, and disk usage #773

AlexandreKempf · 2024-02-06T16:23:46Z

New feature: monitoring for CPU

Link to GPU monitoring.
Link to fix Studio

Monitoring CPU hardware

In this PR we add the possibility for the user to monitor the CPU, RAM, and disk during one experiment.

To use this feature you can use it with a simple argument:

from dvclive import Live

with Live(monitor_system=True) as live:
    ...

If you want to use advance features you can specify each parameter this way:

from dvclive import Live
from dvclive.monitor_system import CPUMonitor

with Live(monitor_system=True) as live:
    live.cpu_monitor = CPUMonitor(interval = 0.1, num_samples=15, directories_to_monitor={"data": "/path/to/data/disk", "home": "/home"}))

And this is how the editor help should looks like:

If you allow the monitoring of your system, if will track:

system/cpu/count -> number of CPU cores
system/cpu/usage (%) -> the average usage of the CPUs.
system/cpu/parallelization (%) -> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.
system/ram/usage (%) -> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.
system/ram/usage (GB) -> RAM used. Useful to increase batch size or data processed at the same time.
system/ram/total (GB) -> Total RAM in your system
system/disk/usage (%) -> Amount of disk used at a given path in %. By default uses "." meaning where the python executor was launched. You can specify the paths you want to monitor. For instance the code example above monitor /data and /home. Data and code often lives in very different paths/volumes, so it is useful for the user to be able to specify its own path. Note that as their can be several paths specified, the full metric name is system/disk/usage (%)/<user defined name>. For instance it would be system/disk/usage (%)/data for the /path/to/data/disk and system/disk/usage (%)/home for /home.
system/disk/usage (GB) -> Amount of disk used at a given path.
system/disk/total (GB) -> Amount of total disk storage at a given path.

All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.

I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.

Files generated

The metrics about the CPU are stored with the log_metric function. It means that the .tsv files are stored in the dvclive/plots folder. A specific folder, system, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in the dvclive/metrics.json file.

Plot display

Here is what VScode extension looks like:

Here is what Studio looks like:

Note that studio update is a little buggy, but it is fixed in this PR

❗ I have followed the Contributing to DVCLive guide.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

pyproject.toml

for more information, see https://pre-commit.ci

dberenbaum · 2024-02-08T20:46:58Z

@AlexandreKempf Sorry for the delay here. For any PR like this where you introduce a new feature, it would be great if you could include a demo of the feature in the PR. Some useful ways to demo it can be:

Record a video via screencast, https://www.loom.com, https://asciinema.org, etc.
Include an example notebook, script, repo, etc.
Write a draft of the docs PR

Once it's merged, you can reuse it by posting it to https://iterativeai.slack.com/archives/C064XEWG2BD

dberenbaum · 2024-02-08T20:47:36Z

Also note that windows tests are failing

dberenbaum

Would be good to get another look from @shcheklein and/or @skshetry, but product-wise it LGTM. Thanks @AlexandreKempf!

shcheklein · 2024-02-18T02:10:03Z

tests/test_monitor_system.py

+        monitor_system=False,
+    ) as live:
+        live.cpu_monitor = CPUMonitor(interval=interval, num_samples=num_samples)
+        time.sleep(interval * num_samples + interval)  # log metrics once


this is still unreliable, I would do a while loop with a small sleep interval that reads from disk or checks some Live vat to detect an update

to test that it respects intervals, etc - mock sleep, run thread itself from the test and check the result

just some ideas ^^ that might be a better way, but we can't go with unreliable / flakey tests

I tried to respect what you explained. If there is still some misunderstanding between us, could you please refer to a link or a concrete example to illustrate your point? This way, I might be able to understand better what you meant :)

tests/test_monitor_system.py

shcheklein · 2024-02-18T02:20:58Z

tests/test_monitor_system.py

+    ) as live:
+        monitor = CPUMonitor(disks_to_monitor={"main": "/", "home": "/"})
+        monitor(live)
+        metrics = monitor._get_metrics()


what is the purpose of this test? why a single one that runs Live at the bottom not enough?

I wanted to ensure that the mock objects covered all the fields used in the cpu_monitoring. But after your comment, I realized this test was useless since it only tests behavior that is defined in the unit tests. I removed it.

pyproject.toml

shcheklein · 2024-02-18T02:26:23Z

src/dvclive/monitor_system.py

+        num_samples: int = 10,
+        plot: bool = True,
+    ):
+        if not isinstance(interval, (int, float)):


hmm, I don't think we are input types like this

in this case the idea was to test probably that value are meaningful and raise ValueError? or ignore CPU monitoring

Oh ok I see.
I changed to keep the monitoring even if the values are out of range, but I added a warning message.
I took the min and max values from here: https://github.com/wandb/wandb/blob/852781a852a5ae63ea009b40fe923e1a80603fd0/wandb/sdk/internal/system/assets/interfaces.py#L120

src/dvclive/monitor_system.py

shcheklein · 2024-02-18T02:47:46Z

src/dvclive/monitor_system.py

+            interval (float): interval in seconds between two measurements.
+                Defaults to 0.5.
+            num_samples (int): number of samples to average. Defaults to 10.
+            disks_to_monitor (Optional[Dict[str, str]]): paths to the disks or


I'm not sure I understand this "aths to the disks or partitions to monitor disk usage statistics.". what is the difference disk vs parittion?

I tried to change the description for this arg. Let me know what you think. I mentioned disk because I'm afraid our users don't know what a partition is.

Changes involve:

renaming to folders_to_monitor. While not perfectly accurate (because two folders on the same partition, that have different content, will have identical statistics with this function), it is the most understandable name I could find. I'm scared that partition_to_monitor doesn't speak to our users.

mentioning the perfectly accurate definition in the rest of the docstring so people looking for deeper information find what they need.

src/dvclive/monitor_system.py

shcheklein · 2024-02-20T02:53:58Z

src/dvclive/monitor_system.py

+        for disk_name, disk_path in folders_to_monitor.items():
+            if disk_name != os.path.normpath(disk_name):
+                raise ValueError(  # noqa: TRY003
+                    "Keys for `partitions_to_monitor` should be a valid name"


hmm, what is the valid name in this case?

I would still move try catch to where we use ... e.g. someone drops the directory, it should not be causing other metrics to disappear

shcheklein · 2024-02-20T03:01:25Z

tests/test_monitor_system.py

+        _, latest = parse_metrics(live)
+
+    schema = {}
+    for name in [


do we check for actual values in both tests?

I change one test to check for values :)

shcheklein

Good iteration! We are getting there :) Some final 🤞 checks and questions .

AlexandreKempf · 2024-02-20T11:02:34Z

@shcheklein @dberenbaum
Please consider this PR also, because it will probably be merge in this branch before we merge to main.

AlexandreKempf · 2024-02-20T17:32:31Z

@shcheklein @dberenbaum

The following PR concerning the GPU metrics changes more code than I usually planned. It is not a simple addition to this PR. It needs to change a lot of this PR code as well. So, to stop wasting your time on this PR, we will move the discussion to the final version of the code.

The next PR is here.

I'll close this PR, let's continue the discussion on #785

AlexandreKempf added 6 commits February 6, 2024 12:06

add cpu monitoring

b5a8171

add unit tests and more cpu metrics

e3b654c

change default value for callback

e6cff32

uses a percentage value for cpu parallelism

8663011

add ram total

1346750

remove total ram measure from plots

5f90bea

AlexandreKempf requested a review from dberenbaum February 6, 2024 16:23

shcheklein reviewed Feb 6, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

AlexandreKempf and others added 5 commits February 7, 2024 09:15

update pyproject.toml

2ac50f2

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c0a288

for more information, see https://pre-commit.ci

add tmpdir to metrics tests

30315bb

Merge branch 'main' into monitor-cpu-ressources

0f55943

default to no monitoring callbacks

40686c7

AlexandreKempf mentioned this pull request Feb 8, 2024

Add toml and yaml formatters to pre-commit iterative/py-template#121

Merged

AlexandreKempf force-pushed the monitor-cpu-ressources branch from e30f6f1 to 6458769 Compare February 9, 2024 11:36

fix tmp_dir for test on windows and macos

4b98144

AlexandreKempf force-pushed the monitor-cpu-ressources branch 4 times, most recently from e6f9c0c to f81e913 Compare February 9, 2024 13:35

fix tmp_dir for test on windows and macos

36fd94d

AlexandreKempf force-pushed the monitor-cpu-ressources branch from f81e913 to 36fd94d Compare February 9, 2024 13:40

fix update data to studio live experiment

59db522

AlexandreKempf force-pushed the monitor-cpu-ressources branch 5 times, most recently from 411472b to a7cb774 Compare February 9, 2024 14:42

AlexandreKempf requested review from dberenbaum, skshetry and shcheklein and removed request for skshetry February 16, 2024 16:29

dberenbaum approved these changes Feb 16, 2024

View reviewed changes