-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitors CPU, RAM, and disk usage #773
Conversation
@AlexandreKempf Sorry for the delay here. For any PR like this where you introduce a new feature, it would be great if you could include a demo of the feature in the PR. Some useful ways to demo it can be:
Once it's merged, you can reuse it by posting it to https://iterativeai.slack.com/archives/C064XEWG2BD |
Also note that windows tests are failing |
e30f6f1
to
6458769
Compare
e6f9c0c
to
f81e913
Compare
f81e913
to
36fd94d
Compare
411472b
to
a7cb774
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to get another look from @shcheklein and/or @skshetry, but product-wise it LGTM. Thanks @AlexandreKempf!
tests/test_monitor_system.py
Outdated
monitor_system=False, | ||
) as live: | ||
live.cpu_monitor = CPUMonitor(interval=interval, num_samples=num_samples) | ||
time.sleep(interval * num_samples + interval) # log metrics once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still unreliable, I would do a while loop with a small sleep interval that reads from disk or checks some Live vat to detect an update
to test that it respects intervals, etc - mock sleep, run thread itself from the test and check the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some ideas ^^ that might be a better way, but we can't go with unreliable / flakey tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to respect what you explained. If there is still some misunderstanding between us, could you please refer to a link or a concrete example to illustrate your point? This way, I might be able to understand better what you meant :)
tests/test_monitor_system.py
Outdated
) as live: | ||
monitor = CPUMonitor(disks_to_monitor={"main": "/", "home": "/"}) | ||
monitor(live) | ||
metrics = monitor._get_metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this test? why a single one that runs Live at the bottom not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to ensure that the mock objects covered all the fields used in the cpu_monitoring. But after your comment, I realized this test was useless since it only tests behavior that is defined in the unit tests. I removed it.
src/dvclive/monitor_system.py
Outdated
num_samples: int = 10, | ||
plot: bool = True, | ||
): | ||
if not isinstance(interval, (int, float)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't think we are input types like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case the idea was to test probably that value are meaningful and raise ValueError? or ignore CPU monitoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok I see.
I changed to keep the monitoring even if the values are out of range, but I added a warning message.
I took the min and max values from here: https://github.com/wandb/wandb/blob/852781a852a5ae63ea009b40fe923e1a80603fd0/wandb/sdk/internal/system/assets/interfaces.py#L120
src/dvclive/monitor_system.py
Outdated
interval (float): interval in seconds between two measurements. | ||
Defaults to 0.5. | ||
num_samples (int): number of samples to average. Defaults to 10. | ||
disks_to_monitor (Optional[Dict[str, str]]): paths to the disks or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this "aths to the disks or partitions to monitor disk usage statistics.". what is the difference disk vs parittion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to change the description for this arg. Let me know what you think. I mentioned disk
because I'm afraid our users don't know what a partition is.
Changes involve:
- renaming to
folders_to_monitor
. While not perfectly accurate (because two folders on the same partition, that have different content, will have identical statistics with this function), it is the most understandable name I could find. I'm scared thatpartition_to_monitor
doesn't speak to our users. - mentioning the perfectly accurate definition in the rest of the docstring so people looking for deeper information find what they need.
src/dvclive/monitor_system.py
Outdated
for disk_name, disk_path in folders_to_monitor.items(): | ||
if disk_name != os.path.normpath(disk_name): | ||
raise ValueError( # noqa: TRY003 | ||
"Keys for `partitions_to_monitor` should be a valid name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, what is the valid name in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still move try catch to where we use ... e.g. someone drops the directory, it should not be causing other metrics to disappear
tests/test_monitor_system.py
Outdated
_, latest = parse_metrics(live) | ||
|
||
schema = {} | ||
for name in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we check for actual values in both tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change one test to check for values :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good iteration! We are getting there :) Some final 🤞 checks and questions .
3359001
to
4d42f0e
Compare
@shcheklein @dberenbaum |
The following PR concerning the GPU metrics changes more code than I usually planned. It is not a simple addition to this PR. It needs to change a lot of this PR code as well. So, to stop wasting your time on this PR, we will move the discussion to the final version of the code. The next PR is here. I'll close this PR, let's continue the discussion on #785 |
New feature: monitoring for CPU
Link to GPU monitoring.
Link to fix Studio
Monitoring CPU hardware
In this PR we add the possibility for the user to monitor the CPU, RAM, and disk during one experiment.
To use this feature you can use it with a simple argument:
If you want to use advance features you can specify each parameter this way:
And this is how the editor help should looks like:

If you allow the monitoring of your system, if will track:
system/cpu/count
-> number of CPU coressystem/cpu/usage (%)
-> the average usage of the CPUs.system/cpu/parallelization (%)
-> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.system/ram/usage (%)
-> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.system/ram/usage (GB)
-> RAM used. Useful to increase batch size or data processed at the same time.system/ram/total (GB)
-> Total RAM in your systemsystem/disk/usage (%)
-> Amount of disk used at a given path in %. By default uses "." meaning where the python executor was launched. You can specify the paths you want to monitor. For instance the code example above monitor/data
and/home
. Data and code often lives in very different paths/volumes, so it is useful for the user to be able to specify its own path. Note that as their can be several paths specified, the full metric name issystem/disk/usage (%)/<user defined name>
. For instance it would besystem/disk/usage (%)/data
for the/path/to/data/disk
andsystem/disk/usage (%)/home
for/home
.system/disk/usage (GB)
-> Amount of disk used at a given path.system/disk/total (GB)
-> Amount of total disk storage at a given path.All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.
I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.
Files generated
The metrics about the CPU are stored with the
log_metric
function. It means that the.tsv
files are stored in thedvclive/plots
folder. A specific folder,system
, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in thedvclive/metrics.json
file.Plot display
Here is what VScode extension looks like:

Here is what Studio looks like:


Note that studio update is a little buggy, but it is fixed in this PR