Skip to content

Conversation

@JasonLi1909
Copy link
Contributor

This PR addresses a bug that occurs when users abort a Train Run from within a Python notebook. When a train run is aborted by stopping a cell execution, the associated placement groups are not removed. And because the train job persists while the notebook kernel is still running, it is never cleaned- preventing the subsequent train run from progressing. This fix manually shuts down the worker group state, which includes the placement group, upon abort- allowing the user to immediately kickoff another train run without having to restart the notebook.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner August 27, 2025 19:56
@JasonLi1909 JasonLi1909 requested a review from justinvyu August 27, 2025 19:57
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a resource leak by ensuring placement groups are removed when a train run is aborted. The change adds a call to _worker_group_state.shutdown() in the abort method, which is the right approach.

My review includes two main points for improvement:

  1. After aborting, the WorkerGroup object is left in an inconsistent state. I've suggested adding a call to _clear_state() to resolve this.
  2. A TODO comment has become outdated due to the change and should be updated for clarity and maintainability.

Overall, the change is in the right direction to fix the bug, and with these adjustments, it will be more robust.

…oup.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 477 to 480
# TODO: consider shutting down the workers in the future.
# We don't do this for now due to this risk of hanging e.g. when calling
# `destroy_process_group` on an active group.
# `destroy_process_group` on an active group. A solution is to use a timeout
# in TorchConfig.on_shutdown in case of a hang.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"TODO: add shutdown callback hooks"

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Aug 28, 2025
# TODO: consider shutting down the workers in the future.
# We don't do this for now due to this risk of hanging e.g. when calling
# `destroy_process_group` on an active group.
# `destroy_process_group` on an active group. A solution is to use a timeout
Copy link
Contributor

@TimothySeah TimothySeah Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I think consider shutting down the workers in the future. is no longer applicable because worker_group_state.shutdown does do that right? Do we need to fix the destroy_process_group on an active group issue in this PR too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the catch! I will remove the comment regarding the worker shutdown. As for the destroy_process_goup on an active group that is not triggered unless we also perform the before_worker_group_abort callbacks so it will not be included in this PR

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 added the go add ONLY when ready to merge, run all tests label Aug 29, 2025
JasonLi1909 and others added 5 commits August 29, 2025 11:14
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +272 to +273
# TODO: Add a timeout in the case of a hang, particularly
# relevant when func is TorchConfig.on_shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this, Tim added the timeout at a different layer

#56182

@justinvyu justinvyu enabled auto-merge (squash) September 3, 2025 18:41
@justinvyu justinvyu merged commit a040e6a into ray-project:master Sep 3, 2025
6 checks passed
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
This PR addresses a bug that occurs when users abort a Train Run from
within a Python notebook. When a train run is aborted by stopping a cell
execution, the associated placement groups are not removed. And because
the train job persists while the notebook kernel is still running, it is
never cleaned- preventing the subsequent train run from progressing.
This fix manually shuts down the worker group state, which includes the
placement group, upon abort- allowing the user to immediately kickoff
another train run without having to restart the notebook.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
This PR addresses a bug that occurs when users abort a Train Run from
within a Python notebook. When a train run is aborted by stopping a cell
execution, the associated placement groups are not removed. And because
the train job persists while the notebook kernel is still running, it is
never cleaned- preventing the subsequent train run from progressing.
This fix manually shuts down the worker group state, which includes the
placement group, upon abort- allowing the user to immediately kickoff
another train run without having to restart the notebook.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
This PR addresses a bug that occurs when users abort a Train Run from
within a Python notebook. When a train run is aborted by stopping a cell
execution, the associated placement groups are not removed. And because
the train job persists while the notebook kernel is still running, it is
never cleaned- preventing the subsequent train run from progressing.
This fix manually shuts down the worker group state, which includes the
placement group, upon abort- allowing the user to immediately kickoff
another train run without having to restart the notebook.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
This PR addresses a bug that occurs when users abort a Train Run from
within a Python notebook. When a train run is aborted by stopping a cell
execution, the associated placement groups are not removed. And because
the train job persists while the notebook kernel is still running, it is
never cleaned- preventing the subsequent train run from progressing.
This fix manually shuts down the worker group state, which includes the
placement group, upon abort- allowing the user to immediately kickoff
another train run without having to restart the notebook.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
This PR addresses a bug that occurs when users abort a Train Run from
within a Python notebook. When a train run is aborted by stopping a cell
execution, the associated placement groups are not removed. And because
the train job persists while the notebook kernel is still running, it is
never cleaned- preventing the subsequent train run from progressing.
This fix manually shuts down the worker group state, which includes the
placement group, upon abort- allowing the user to immediately kickoff
another train run without having to restart the notebook.

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants