Skip to content

Monitoring system resources during training using MLFlow #7405

Open
@KumoLiu

Description

@KumoLiu

Discussed in #7404

Originally posted by kavmar January 18, 2024
Hi,

I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows:

import mlflow as resource_monitor

resource_monitor.set_tracking_uri(mlflow_uri)
resource_monitor.set_experiment(experiment_name=exp_name)
resource_monitor.set_system_metrics_sampling_interval(interval=2)
resource_monitor.start_run(log_system_metrics=True)
run_name = resource_monitor.active_run().info.run_name

and then for validation and training similarly as

mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)
resource_monitor.stop_run()

This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm.
I would love to hear if it make sense to think about enhancing this approach.

Thanks

PS: It might make sense to include this in mlflow integration tutorials

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions