[train] Eager Data Resource Cleanup on Train Run Failures and Aborts #58325

JasonLi1909 · 2025-10-30T21:56:45Z

Following a worker failure or a user abort during a Train job, the execution of sharded datasets (provided through get_dataset_shard) is ungracefully shutdown. Consequently, any ongoing resource request made by a sharded dataset's SplitCoordinator to the AutoscalingRequester is not cancelled. This can result in resources being held for a preset timeout, leading to inefficient cluster utilization and slower train job turnarounds.

To address the issue, this PR:

Implements an eager shutdown path to cleanup resource requests made to the AutoscalingRequester (depicted below)
Adds new WorkerGroupCallback hooks(after_worker_group_abort and after_worker_group_shutdown) to DatasetsSetupCallback for the new shutdown path
Implements tests for the new cleanup path

Note on new WorkerGroupCallback hooks:
The new WorkerGroupCallback hooks after_worker_group_abort and after_worker_group_shutdown were added to ensure that the StreamingExecutor shutdown logic ran prior to the shutdown of train workers. This helps to avoid any race conditions and additional complexity related to timing executor shutdown while train workers are still alive.

Diagram of the new cleanup path:

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This PR correctly renames DatasetsSetupCallback to DatasetsCallback. I've added a few suggestions to improve consistency in related parts of the code, such as a variable name, a test function name, and a docstring, to fully align with this change.

python/ray/train/v2/_internal/callbacks/datasets.py

python/ray/train/v2/api/data_parallel_trainer.py

python/ray/train/v2/tests/test_data_integration.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

github-actions · 2025-11-14T12:26:03Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

…creation Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/callbacks/datasets.py

python/ray/data/_internal/iterator/stream_split_iterator.py

python/ray/train/v2/_internal/execution/callback.py

python/ray/data/_internal/iterator/stream_split_iterator.py

python/ray/train/v2/_internal/callbacks/datasets.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

python/ray/train/v2/_internal/callbacks/datasets.py

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/callbacks/datasets.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/data/_internal/iterator/stream_split_iterator.py

python/ray/train/v2/_internal/callbacks/datasets.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/callbacks/datasets.py

justinvyu

Thanks! Should be good after this. Nice tests

python/ray/data/_internal/iterator/stream_split_iterator.py

python/ray/train/v2/_internal/callbacks/datasets.py

justinvyu · 2025-11-21T22:24:23Z

python/ray/train/v2/tests/test_data_resource_cleanup.py

+    # Two coordinator actors, one for each sharded dataset
+    coordinator_actors = callback._coordinator_actors
+    assert len(coordinator_actors) == 2


let's also add a call to after_worker_group_shutdown/abort and check the liveness of the actors?

The callbacks don't kill the SplitCoordinator actors as of now, they only shutdown their data executors. Ref counting should take care of them. That said, I added two more tests for the after_worker_group_shutdown/abort hooks.

python/ray/train/v2/tests/test_data_resource_cleanup.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

python/ray/train/v2/_internal/callbacks/datasets.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/data/_internal/iterator/stream_split_iterator.py

justinvyu

🚢

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor · 2025-11-24T20:03:35Z

python/ray/train/v2/_internal/callbacks/datasets.py

+        """Eagerly shutdown the data executors of the split coordinator actors."""
+        self._shutdown_refs = [
+            coord.shutdown_executor.remote() for coord in self._coordinator_actors
+        ]


Bug: Shutdown refs overwritten across worker group restarts

When a worker group is restarted due to failure or rescaling, _shutdown_data_executors is called for each worker group instance, but it overwrites self._shutdown_refs instead of appending to it. This means shutdown refs from previous worker group attempts are lost and never awaited in before_controller_shutdown, potentially causing incomplete resource cleanup from earlier failed attempts.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

iamjustinhsu · 2025-11-24T20:42:55Z

python/ray/data/_internal/iterator/stream_split_iterator.py

+        with self._lock:
+            # Call shutdown on the executor
+            if self._executor is not None:
+                self._executor.shutdown(force=False)


is it safe to shutdown multiple times?

Yes, it is. That said, we only ever call it once per SplitCoordinator.

…ject#58325) Following a worker failure or a user abort during a Train job, the execution of sharded datasets (provided through get_dataset_shard) is ungracefully shutdown. Consequently, any ongoing resource request made by a sharded dataset's SplitCoordinator to the AutoscalingRequester is not cancelled. This can result in resources being held for a preset timeout, leading to inefficient cluster utilization and slower train job turnarounds. - Implements an eager shutdown path to cleanup resource requests made to the AutoscalingRequester (depicted below) - Adds new WorkerGroupCallback hooks(`after_worker_group_abort` and `after_worker_group_shutdown`) to DatasetsSetupCallback for the new shutdown path - Implements tests for the new cleanup path --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

PR #58325 adds shutdown and abort hooks to enhance resource-cleanup logic in DatasetsSetupCallback, the callback’s responsibilities have expanded beyond initial setup. Accordingly, this PR renames it to DatasetsCallback for better alignment with its behavior. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

…#59423) PR ray-project#58325 adds shutdown and abort hooks to enhance resource-cleanup logic in DatasetsSetupCallback, the callback’s responsibilities have expanded beyond initial setup. Accordingly, this PR renames it to DatasetsCallback for better alignment with its behavior. Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

…ject#58325) Following a worker failure or a user abort during a Train job, the execution of sharded datasets (provided through get_dataset_shard) is ungracefully shutdown. Consequently, any ongoing resource request made by a sharded dataset's SplitCoordinator to the AutoscalingRequester is not cancelled. This can result in resources being held for a preset timeout, leading to inefficient cluster utilization and slower train job turnarounds. - Implements an eager shutdown path to cleanup resource requests made to the AutoscalingRequester (depicted below) - Adds new WorkerGroupCallback hooks(`after_worker_group_abort` and `after_worker_group_shutdown`) to DatasetsSetupCallback for the new shutdown path - Implements tests for the new cleanup path --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

renamed DatasetsSetupCallback to DatasetsCallback

f80cc16

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner October 30, 2025 21:56

JasonLi1909 changed the title ~~renamed DatasetsSetupCallback to DatasetsCallback~~ Renaming DatasetsSetupCallback to DatasetsCallback Oct 30, 2025

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/datasets.py Outdated Show resolved Hide resolved

python/ray/train/v2/api/data_parallel_trainer.py Outdated Show resolved Hide resolved

python/ray/train/v2/tests/test_data_integration.py Outdated Show resolved Hide resolved

JasonLi1909 and others added 2 commits October 30, 2025 14:59

Update python/ray/train/v2/tests/test_data_integration.py

4296e1f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

Update python/ray/train/v2/api/data_parallel_trainer.py

2ab1d2b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

JasonLi1909 added 2 commits October 30, 2025 15:00

updated docstrings for callback

6fd653e

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

fixed variable usage

973217a

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

ray-gardener bot added the train Ray Train Related Issue label Oct 31, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 14, 2025

JasonLi1909 changed the title ~~Renaming DatasetsSetupCallback to DatasetsCallback~~ Eager Resource Cleanup on Train Run Failures and Aborts Nov 17, 2025

JasonLi1909 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 17, 2025

JasonLi1909 added 2 commits November 17, 2025 02:35

added shutdown path

caa5075

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added after worker group shutdown/abort hooks

4d20091

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner November 18, 2025 23:16

cursor bot reviewed Nov 18, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

JasonLi1909 added 2 commits November 18, 2025 15:19

added additional shutdown max_concurrency thread to SplitCoordinator …

de4f811

…creation Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added callbacks to WorkerGroupCallback base class

4b8a633

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 changed the title ~~Eager Resource Cleanup on Train Run Failures and Aborts~~ Eager Data Resource Cleanup on Train Run Failures and Aborts Nov 18, 2025

justinvyu reviewed Nov 19, 2025

View reviewed changes

Update python/ray/data/_internal/iterator/stream_split_iterator.py

69520f1

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/datasets.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated Show resolved Hide resolved

JasonLi1909 and others added 3 commits November 18, 2025 19:23

Update python/ray/train/v2/_internal/callbacks/datasets.py

efe27e6

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

fixes

ed46a79

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

passing WorkerGroupContext into hooks instead

8c0fed7

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/datasets.py Outdated Show resolved Hide resolved

fix

cd15b82

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/data/_internal/iterator/stream_split_iterator.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 19, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/datasets.py Show resolved Hide resolved

JasonLi1909 added 4 commits November 20, 2025 13:22

added new WorkerGroupCallback hooks and tests

f14047d

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

tests

609842d

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

fix nits

d454b47

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

desc

bd0d8c7

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/datasets.py Show resolved Hide resolved

justinvyu reviewed Nov 21, 2025

View reviewed changes

Update python/ray/train/v2/_internal/callbacks/datasets.py

feaea93

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/train/v2/_internal/callbacks/datasets.py Show resolved Hide resolved

JasonLi1909 and others added 3 commits November 21, 2025 17:12

Update python/ray/data/_internal/iterator/stream_split_iterator.py

c7ef5b9

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

updated and added new tests

eab654a

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Merge branch 'master' into rename-datasets-callback

c57e29d

cursor bot reviewed Nov 24, 2025

View reviewed changes

python/ray/data/_internal/iterator/stream_split_iterator.py Show resolved Hide resolved

justinvyu approved these changes Nov 24, 2025

View reviewed changes

JasonLi1909 added 2 commits November 24, 2025 11:45

renamed DatasetsCallback back to DatasetsSetupCallback

882ad73

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cleanup

43bba53

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Nov 24, 2025

View reviewed changes

added tests to bazel build

c7221cb

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

iamjustinhsu reviewed Nov 24, 2025

View reviewed changes

iamjustinhsu approved these changes Nov 25, 2025

View reviewed changes

justinvyu changed the title ~~Eager Data Resource Cleanup on Train Run Failures and Aborts~~ [train] Eager Data Resource Cleanup on Train Run Failures and Aborts Nov 25, 2025

justinvyu enabled auto-merge (squash) November 25, 2025 23:32

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 25, 2025

justinvyu merged commit 75f8562 into ray-project:master Nov 26, 2025
8 checks passed

JasonLi1909 mentioned this pull request Dec 14, 2025

[train] Rename DatasetsSetupCallback to DatasetsCallback #59423

Merged

[train] Eager Data Resource Cleanup on Train Run Failures and Aborts #58325

[train] Eager Data Resource Cleanup on Train Run Failures and Aborts #58325

Uh oh!

Conversation

JasonLi1909 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinvyu Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 24, 2025

Choose a reason for hiding this comment

Bug: Shutdown refs overwritten across worker group restarts

Uh oh!

iamjustinhsu Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JasonLi1909 commented Oct 30, 2025 •

edited

Loading

JasonLi1909 Nov 22, 2025 •

edited

Loading