Remove Placement Group on Train Run Abort #56011

JasonLi1909 · 2025-08-27T19:56:49Z

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This pull request correctly addresses a resource leak by ensuring placement groups are removed when a train run is aborted. The change adds a call to _worker_group_state.shutdown() in the abort method, which is the right approach.

My review includes two main points for improvement:

After aborting, the WorkerGroup object is left in an inconsistent state. I've suggested adding a call to _clear_state() to resolve this.
A TODO comment has become outdated due to the change and should be updated for clarity and maintainability.

Overall, the change is in the right direction to fix the bug, and with these adjustments, it will be more robust.

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

…oup.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

justinvyu

Thanks, can you add to the ABORTED test here: https://github.com/anyscale/rayturbo/blob/master/python/ray/train/v2/tests/test_worker_group.py#L512

justinvyu · 2025-08-27T22:51:52Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

        # TODO: consider shutting down the workers in the future.
        # We don't do this for now due to this risk of hanging e.g. when calling
-        # `destroy_process_group` on an active group.
+        # `destroy_process_group` on an active group. A solution is to use a timeout
+        # in TorchConfig.on_shutdown in case of a hang.


"TODO: add shutdown callback hooks"

TimothySeah · 2025-08-28T04:13:15Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

        # TODO: consider shutting down the workers in the future.
        # We don't do this for now due to this risk of hanging e.g. when calling
-        # `destroy_process_group` on an active group.
+        # `destroy_process_group` on an active group. A solution is to use a timeout


Wait I think consider shutting down the workers in the future. is no longer applicable because worker_group_state.shutdown does do that right? Do we need to fix the destroy_process_group on an active group issue in this PR too?

Ah thanks for the catch! I will remove the comment regarding the worker shutdown. As for the destroy_process_goup on an active group that is not triggered unless we also perform the before_worker_group_abort callbacks so it will not be included in this PR

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

justinvyu

Thanks!

justinvyu · 2025-09-03T18:40:06Z

python/ray/train/_internal/worker_group.py

+        # TODO: Add a timeout in the case of a hang, particularly
+        # relevant when func is TorchConfig.on_shutdown


we can remove this, Tim added the timeout at a different layer

#56182

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: sampan <sampan@anyscale.com>

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

shutdown worker group state on abort

8f3106f

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner August 27, 2025 19:56

JasonLi1909 requested a review from justinvyu August 27, 2025 19:57

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

Update python/ray/train/v2/_internal/execution/worker_group/worker_gr…

02999e3

…oup.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

justinvyu reviewed Aug 27, 2025

View reviewed changes

ray-gardener bot added the train Ray Train Related Issue label Aug 28, 2025

TimothySeah reviewed Aug 28, 2025

View reviewed changes

JasonLi1909 added 2 commits August 28, 2025 14:25

updated todo comments

7b7d571

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

extended test_worker_group_abort to test for worker group state shutdown

667c772

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 added the go add ONLY when ready to merge, run all tests label Aug 29, 2025

JasonLi1909 and others added 5 commits August 29, 2025 11:14

Merge branch 'master' into placement-group-cleanup-on-train-abort

426e525

testing

ff2c288

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

removed shutdown on abort test

bcc41c0

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

updated abort to fix cleanup

7b8c3e4

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Merge branch 'master' into placement-group-cleanup-on-train-abort

59ccdfa

justinvyu approved these changes Sep 3, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) September 3, 2025 18:41

justinvyu merged commit a040e6a into ray-project:master Sep 3, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove Placement Group on Train Run Abort #56011

Remove Placement Group on Train Run Abort #56011

Uh oh!

JasonLi1909 commented Aug 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Uh oh!

justinvyu Aug 27, 2025

Uh oh!

TimothySeah Aug 28, 2025 •

edited

Loading

Uh oh!

JasonLi1909 Aug 28, 2025

Uh oh!

justinvyu left a comment

Uh oh!

justinvyu Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# TODO: Add a timeout in the case of a hang, particularly
		# relevant when func is TorchConfig.on_shutdown

Remove Placement Group on Train Run Abort #56011

Remove Placement Group on Train Run Abort #56011

Uh oh!

Conversation

JasonLi1909 commented Aug 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

justinvyu Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

justinvyu Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TimothySeah Aug 28, 2025 •

edited

Loading