Add resource (CPU,RAM,GPU,thread count) monitoring to AutoML experiments #6320

andrasfuchs · 2022-09-12T07:56:26Z

Is your feature request related to a problem? Please describe.
As others also experienced, AutoML training is heavy on CPU and RAM and it can cause slowdowns and crashes (#6175, #6286, #6288, #6297). I sometimes run into an issue where some of my trials run longer than expected, potentially because my systems ran out of one of my resources. I had a few system crashes as well, when running AutoML forced Windows to start closing other applications.

Describe the solution you'd like
It would be great to have more information about the running AutoML trials, including how much CPU, RAM, GPU are using on how many threads. Ideally it would be included in a new, periodically called method on AutoML's IMonitor interface.
If this was combined with an extended experiment control (#5736), we could make clever decisions about a trial or experiment depending on its resource usage. We could pause the experiment if the system is out of resources, or even cancel a trial if it uses suspiciously high amount of RAM to prevent system failure, for example. (As it happens sometimes with my experiments.)

Describe alternatives you've considered
Well, theoretically I could monitor my system resources constantly on a separate thread, but I still couldn't determine if AutoML is the reason for an elevated CPU, RAM or GPU usage, or something else running on the system independently from AutoML.

Additional context
This issue is related to AutoML experiment resource usage limiting (#6061) and AutoML experiment control (#5736).

LittleLittleCloud · 2022-09-12T17:40:43Z

#6293

Also FYI we just add monitoring of CPU and memory usage in #6305. Monitoring GPU usage is not currently on roadmap since most of ML trainer doesn't run on GPU

andrasfuchs · 2022-09-13T07:12:05Z

Excellent, thank you @LittleLittleCloud!!

andrasfuchs added the enhancement New feature or request label Sep 12, 2022

ghost added the untriaged New issue has not been triaged label Sep 12, 2022

dakersnar removed the untriaged New issue has not been triaged label Sep 12, 2022

dakersnar added this to the ML.NET Future milestone Sep 12, 2022

andrasfuchs closed this as completed Sep 13, 2022

ghost locked as resolved and limited conversation to collaborators Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resource (CPU,RAM,GPU,thread count) monitoring to AutoML experiments #6320

Add resource (CPU,RAM,GPU,thread count) monitoring to AutoML experiments #6320

andrasfuchs commented Sep 12, 2022

LittleLittleCloud commented Sep 12, 2022

andrasfuchs commented Sep 13, 2022

Add resource (CPU,RAM,GPU,thread count) monitoring to AutoML experiments #6320

Add resource (CPU,RAM,GPU,thread count) monitoring to AutoML experiments #6320

Comments

andrasfuchs commented Sep 12, 2022

LittleLittleCloud commented Sep 12, 2022

andrasfuchs commented Sep 13, 2022