You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have experienced an issue running Ray Tune experiments together with ACME distributed SAC training and ASHA scheduler. The idea behind ASHA scheduler is that it will terminate the jobs earlier in order to find the correct hyperparameters. When the single-process experiment is used, the job does not leave any hanging processes after ray terminates a job. Here is an example how the job is started
On the contrary when the job is started in multi-processing way, the termination of job, done by ray, does not affect the processes that launchpad had spawned. So, what happens is that the training ACME job continues running instead of being terminated. Here is an example how the job function is defined and ray tune config.
Hi,
I have experienced an issue running Ray Tune experiments together with ACME distributed SAC training and ASHA scheduler. The idea behind ASHA scheduler is that it will terminate the jobs earlier in order to find the correct hyperparameters. When the single-process experiment is used, the job does not leave any hanging processes after ray terminates a job. Here is an example how the job is started
On the contrary when the job is started in multi-processing way, the termination of job, done by ray, does not affect the processes that launchpad had spawned. So, what happens is that the training ACME job continues running instead of being terminated. Here is an example how the job function is defined and ray tune config.
My question is there a way to somehow forward the termination signal from ray (when it terminates its job) to all the node processes?
Thank you in advance.
The text was updated successfully, but these errors were encountered: