Description
Discussed in #7404
Originally posted by kavmar January 18, 2024
Hi,
I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows:
import mlflow as resource_monitor
resource_monitor.set_tracking_uri(mlflow_uri)
resource_monitor.set_experiment(experiment_name=exp_name)
resource_monitor.set_system_metrics_sampling_interval(interval=2)
resource_monitor.start_run(log_system_metrics=True)
run_name = resource_monitor.active_run().info.run_name
and then for validation and training similarly as
mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)
resource_monitor.stop_run()
This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm.
I would love to hear if it make sense to think about enhancing this approach.
Thanks
PS: It might make sense to include this in mlflow integration tutorials