Skip to content

Conversation

@JasonLi1909
Copy link
Contributor

@JasonLi1909 JasonLi1909 commented Oct 30, 2025

Following a worker failure or a user abort during a Train job, the execution of sharded datasets (provided through get_dataset_shard) is ungracefully shutdown. Consequently, any ongoing resource request made by a sharded dataset's SplitCoordinator to the AutoscalingRequester is not cancelled. This can result in resources being held for a preset timeout, leading to inefficient cluster utilization and slower train job turnarounds.

To address the issue, this PR:

  • Implements an eager shutdown path to cleanup resource requests made to the AutoscalingRequester (depicted below)
  • Adds new WorkerGroupCallback hooks(after_worker_group_abort and after_worker_group_shutdown) to DatasetsSetupCallback for the new shutdown path
  • Implements tests for the new cleanup path

Note on new WorkerGroupCallback hooks:
The new WorkerGroupCallback hooks after_worker_group_abort and after_worker_group_shutdown were added to ensure that the StreamingExecutor shutdown logic ran prior to the shutdown of train workers. This helps to avoid any race conditions and additional complexity related to timing executor shutdown while train workers are still alive.

Diagram of the new cleanup path:

Screenshot 2025-11-18 at 3 37 33 PM

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner October 30, 2025 21:56
@JasonLi1909 JasonLi1909 changed the title renamed DatasetsSetupCallback to DatasetsCallback Renaming DatasetsSetupCallback to DatasetsCallback Oct 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR correctly renames DatasetsSetupCallback to DatasetsCallback. I've added a few suggestions to improve consistency in related parts of the code, such as a variable name, a test function name, and a docstring, to fully align with this change.

JasonLi1909 and others added 2 commits October 30, 2025 14:59
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Oct 31, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 14, 2025
@JasonLi1909 JasonLi1909 changed the title Renaming DatasetsSetupCallback to DatasetsCallback Eager Resource Cleanup on Train Run Failures and Aborts Nov 17, 2025
@JasonLi1909 JasonLi1909 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 17, 2025
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner November 18, 2025 23:16
…creation

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 changed the title Eager Resource Cleanup on Train Run Failures and Aborts Eager Data Resource Cleanup on Train Run Failures and Aborts Nov 18, 2025
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
JasonLi1909 and others added 3 commits November 18, 2025 19:23
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should be good after this. Nice tests

Comment on lines +57 to +59
# Two coordinator actors, one for each sharded dataset
coordinator_actors = callback._coordinator_actors
assert len(coordinator_actors) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add a call to after_worker_group_shutdown/abort and check the liveness of the actors?

Copy link
Contributor Author

@JasonLi1909 JasonLi1909 Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callbacks don't kill the SplitCoordinator actors as of now, they only shutdown their data executors. Ref counting should take care of them. That said, I added two more tests for the after_worker_group_shutdown/abort hooks.

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
JasonLi1909 and others added 3 commits November 21, 2025 17:12
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
"""Eagerly shutdown the data executors of the split coordinator actors."""
self._shutdown_refs = [
coord.shutdown_executor.remote() for coord in self._coordinator_actors
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Shutdown refs overwritten across worker group restarts

When a worker group is restarted due to failure or rescaling, _shutdown_data_executors is called for each worker group instance, but it overwrites self._shutdown_refs instead of appending to it. This means shutdown refs from previous worker group attempts are lost and never awaited in before_controller_shutdown, potentially causing incomplete resource cleanup from earlier failed attempts.

Fix in Cursor Fix in Web

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
with self._lock:
# Call shutdown on the executor
if self._executor is not None:
self._executor.shutdown(force=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to shutdown multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. That said, we only ever call it once per SplitCoordinator.

@justinvyu justinvyu changed the title Eager Data Resource Cleanup on Train Run Failures and Aborts [train] Eager Data Resource Cleanup on Train Run Failures and Aborts Nov 25, 2025
@justinvyu justinvyu enabled auto-merge (squash) November 25, 2025 23:32
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 25, 2025
@justinvyu justinvyu merged commit 75f8562 into ray-project:master Nov 26, 2025
8 checks passed
KaisennHu pushed a commit to KaisennHu/ray that referenced this pull request Nov 26, 2025
…ject#58325)

Following a worker failure or a user abort during a Train job, the
execution of sharded datasets (provided through get_dataset_shard) is
ungracefully shutdown. Consequently, any ongoing resource request made
by a sharded dataset's SplitCoordinator to the AutoscalingRequester is
not cancelled. This can result in resources being held for a preset
timeout, leading to inefficient cluster utilization and slower train job
turnarounds.

- Implements an eager shutdown path to cleanup resource requests made to
the AutoscalingRequester (depicted below)
- Adds new WorkerGroupCallback hooks(`after_worker_group_abort` and
`after_worker_group_shutdown`) to DatasetsSetupCallback for the new
shutdown path
- Implements tests for the new cleanup path

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…ject#58325)

Following a worker failure or a user abort during a Train job, the
execution of sharded datasets (provided through get_dataset_shard) is
ungracefully shutdown. Consequently, any ongoing resource request made
by a sharded dataset's SplitCoordinator to the AutoscalingRequester is
not cancelled. This can result in resources being held for a preset
timeout, leading to inefficient cluster utilization and slower train job
turnarounds.

- Implements an eager shutdown path to cleanup resource requests made to
the AutoscalingRequester (depicted below)
- Adds new WorkerGroupCallback hooks(`after_worker_group_abort` and
`after_worker_group_shutdown`) to DatasetsSetupCallback for the new
shutdown path
- Implements tests for the new cleanup path

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
matthewdeng pushed a commit that referenced this pull request Dec 18, 2025
PR #58325 adds shutdown and abort hooks to enhance resource-cleanup
logic in DatasetsSetupCallback, the callback’s responsibilities have
expanded beyond initial setup. Accordingly, this PR renames it to
DatasetsCallback for better alignment with its behavior.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
…#59423)

PR ray-project#58325 adds shutdown and abort hooks to enhance resource-cleanup
logic in DatasetsSetupCallback, the callback’s responsibilities have
expanded beyond initial setup. Accordingly, this PR renames it to
DatasetsCallback for better alignment with its behavior.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Jan 10, 2026
…ject#58325)

Following a worker failure or a user abort during a Train job, the
execution of sharded datasets (provided through get_dataset_shard) is
ungracefully shutdown. Consequently, any ongoing resource request made
by a sharded dataset's SplitCoordinator to the AutoscalingRequester is
not cancelled. This can result in resources being held for a preset
timeout, leading to inefficient cluster utilization and slower train job
turnarounds.

- Implements an eager shutdown path to cleanup resource requests made to
the AutoscalingRequester (depicted below)
- Adds new WorkerGroupCallback hooks(`after_worker_group_abort` and
`after_worker_group_shutdown`) to DatasetsSetupCallback for the new
shutdown path
- Implements tests for the new cleanup path

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants