Monitoring system resources during training using MLFlow

### Discussed in https://github.com/Project-MONAI/MONAI/discussions/7404

<div type='discussions-op-text'>

<sup>Originally posted by **kavmar** January 18, 2024</sup>
Hi,

I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows:

`import mlflow as resource_monitor`

`resource_monitor.set_tracking_uri(mlflow_uri)`
`resource_monitor.set_experiment(experiment_name=exp_name)`
`resource_monitor.set_system_metrics_sampling_interval(interval=2)`
`resource_monitor.start_run(log_system_metrics=True)`
`run_name = resource_monitor.active_run().info.run_name`

and then for validation and training similarly as

`mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)`
`resource_monitor.stop_run()`

This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm. 
I would love to hear if it make sense to think about enhancing this approach.

Thanks

PS: It might make sense to include this in mlflow integration tutorials</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Monitoring system resources during training using MLFlow #7405

Discussed in #7404

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Monitoring system resources during training using MLFlow #7405

Description

Discussed in #7404

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions