Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitors CPU, RAM, and disk usage #773

Closed
wants to merge 35 commits into from
Closed

Conversation

AlexandreKempf
Copy link
Contributor

@AlexandreKempf AlexandreKempf commented Feb 6, 2024

New feature: monitoring for CPU

Link to GPU monitoring.
Link to fix Studio

Monitoring CPU hardware

In this PR we add the possibility for the user to monitor the CPU, RAM, and disk during one experiment.

To use this feature you can use it with a simple argument:

from dvclive import Live

with Live(monitor_system=True) as live:
    ...

If you want to use advance features you can specify each parameter this way:

from dvclive import Live
from dvclive.monitor_system import CPUMonitor

with Live(monitor_system=True) as live:
    live.cpu_monitor = CPUMonitor(interval = 0.1, num_samples=15, directories_to_monitor={"data": "/path/to/data/disk", "home": "/home"}))

And this is how the editor help should looks like:
Screenshot from 2024-02-15 12-27-04

If you allow the monitoring of your system, if will track:

  • system/cpu/count -> number of CPU cores
  • system/cpu/usage (%) -> the average usage of the CPUs.
  • system/cpu/parallelization (%) -> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.
  • system/ram/usage (%) -> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.
  • system/ram/usage (GB) -> RAM used. Useful to increase batch size or data processed at the same time.
  • system/ram/total (GB) -> Total RAM in your system
  • system/disk/usage (%) -> Amount of disk used at a given path in %. By default uses "." meaning where the python executor was launched. You can specify the paths you want to monitor. For instance the code example above monitor /data and /home. Data and code often lives in very different paths/volumes, so it is useful for the user to be able to specify its own path. Note that as their can be several paths specified, the full metric name is system/disk/usage (%)/<user defined name>. For instance it would be system/disk/usage (%)/data for the /path/to/data/disk and system/disk/usage (%)/home for /home.
  • system/disk/usage (GB) -> Amount of disk used at a given path.
  • system/disk/total (GB) -> Amount of total disk storage at a given path.

All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.

I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.

Files generated

The metrics about the CPU are stored with the log_metric function. It means that the .tsv files are stored in the dvclive/plots folder. A specific folder, system, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in the dvclive/metrics.json file.

Plot display

Here is what VScode extension looks like:
image

Here is what Studio looks like:
image
image

Note that studio update is a little buggy, but it is fixed in this PR


@dberenbaum
Copy link
Collaborator

@AlexandreKempf Sorry for the delay here. For any PR like this where you introduce a new feature, it would be great if you could include a demo of the feature in the PR. Some useful ways to demo it can be:

Once it's merged, you can reuse it by posting it to https://iterativeai.slack.com/archives/C064XEWG2BD

@dberenbaum
Copy link
Collaborator

Also note that windows tests are failing

@AlexandreKempf AlexandreKempf force-pushed the monitor-cpu-ressources branch 4 times, most recently from e6f9c0c to f81e913 Compare February 9, 2024 13:35
@AlexandreKempf AlexandreKempf force-pushed the monitor-cpu-ressources branch 5 times, most recently from 411472b to a7cb774 Compare February 9, 2024 14:42
@AlexandreKempf AlexandreKempf requested review from dberenbaum, skshetry and shcheklein and removed request for skshetry February 16, 2024 16:29
Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to get another look from @shcheklein and/or @skshetry, but product-wise it LGTM. Thanks @AlexandreKempf!

monitor_system=False,
) as live:
live.cpu_monitor = CPUMonitor(interval=interval, num_samples=num_samples)
time.sleep(interval * num_samples + interval) # log metrics once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still unreliable, I would do a while loop with a small sleep interval that reads from disk or checks some Live vat to detect an update

to test that it respects intervals, etc - mock sleep, run thread itself from the test and check the result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some ideas ^^ that might be a better way, but we can't go with unreliable / flakey tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to respect what you explained. If there is still some misunderstanding between us, could you please refer to a link or a concrete example to illustrate your point? This way, I might be able to understand better what you meant :)

) as live:
monitor = CPUMonitor(disks_to_monitor={"main": "/", "home": "/"})
monitor(live)
metrics = monitor._get_metrics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this test? why a single one that runs Live at the bottom not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to ensure that the mock objects covered all the fields used in the cpu_monitoring. But after your comment, I realized this test was useless since it only tests behavior that is defined in the unit tests. I removed it.

num_samples: int = 10,
plot: bool = True,
):
if not isinstance(interval, (int, float)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't think we are input types like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case the idea was to test probably that value are meaningful and raise ValueError? or ignore CPU monitoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I see.
I changed to keep the monitoring even if the values are out of range, but I added a warning message.
I took the min and max values from here: https://github.com/wandb/wandb/blob/852781a852a5ae63ea009b40fe923e1a80603fd0/wandb/sdk/internal/system/assets/interfaces.py#L120

interval (float): interval in seconds between two measurements.
Defaults to 0.5.
num_samples (int): number of samples to average. Defaults to 10.
disks_to_monitor (Optional[Dict[str, str]]): paths to the disks or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this "aths to the disks or partitions to monitor disk usage statistics.". what is the difference disk vs parittion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change the description for this arg. Let me know what you think. I mentioned disk because I'm afraid our users don't know what a partition is.

Changes involve:

  • renaming to folders_to_monitor. While not perfectly accurate (because two folders on the same partition, that have different content, will have identical statistics with this function), it is the most understandable name I could find. I'm scared that partition_to_monitor doesn't speak to our users.
  • mentioning the perfectly accurate definition in the rest of the docstring so people looking for deeper information find what they need.

for disk_name, disk_path in folders_to_monitor.items():
if disk_name != os.path.normpath(disk_name):
raise ValueError( # noqa: TRY003
"Keys for `partitions_to_monitor` should be a valid name"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what is the valid name in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still move try catch to where we use ... e.g. someone drops the directory, it should not be causing other metrics to disappear

_, latest = parse_metrics(live)

schema = {}
for name in [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we check for actual values in both tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change one test to check for values :)

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good iteration! We are getting there :) Some final 🤞 checks and questions .

@AlexandreKempf AlexandreKempf changed the title Monitors cpu ressources Monitors CPU, RAM, and disk usage Feb 20, 2024
@AlexandreKempf
Copy link
Contributor Author

AlexandreKempf commented Feb 20, 2024

@shcheklein @dberenbaum
Please consider this PR also, because it will probably be merge in this branch before we merge to main.

@AlexandreKempf AlexandreKempf mentioned this pull request Feb 20, 2024
2 tasks
@AlexandreKempf
Copy link
Contributor Author

@shcheklein @dberenbaum

The following PR concerning the GPU metrics changes more code than I usually planned. It is not a simple addition to this PR. It needs to change a lot of this PR code as well. So, to stop wasting your time on this PR, we will move the discussion to the final version of the code.

The next PR is here.

I'll close this PR, let's continue the discussion on #785

@AlexandreKempf AlexandreKempf deleted the monitor-cpu-ressources branch February 22, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants