[train v2][doc] Add updated fault tolerance user guide #51083

justinvyu · 2025-03-05T01:39:35Z

Summary

<Framework>Trainer.restore has been deprecated. Update the fault tolerance user guide to use the new pattern of just instantiating the <Framework>Trainer from scratch again with the same (storage_path, name).

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng · 2025-03-06T01:38:04Z

doc/source/train/user-guides/fault-tolerance-deprecated-api.rst

+Handling Failures and Node Preemption (Deprecated API)
+======================================================


same as other pr

Suggested change

Handling Failures and Node Preemption (Deprecated API)

======================================================

(Deprecated) Handling Failures and Node Preemption

==================================================

doc/source/train/user-guides/fault-tolerance.rst

matthewdeng · 2025-03-06T01:48:10Z

doc/source/train/user-guides/fault-tolerance.rst

-        :dedent:
-        :start-after: __ft_restore_from_cloud_restored_start__
-        :end-before: __ft_restore_from_cloud_restored_end__
+    python entrypoint.py --storage_path s3://my_bucket/ --run_name unique_run_name-id=da823d5


nit: unique_run_name-id=da823d5 was kind of confusing to me, I thought this was setting a run ID at first. Maybe we can just call it unique_run_name here?

Also IDK if the user will have the wandb/mlflow run ID at this point, or if it would be generated inside the entrypoint script.

If you want to connect to the same wandb run upon the job retry, you should set the run id on the outside and pass it in, wandb.init(..., run_id=run_id). Otherwise it's randomly generated.

I am setting a run id here as a best practice for the user submitting a job.

…n_v2/doc/fault_tolerance

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…n_v2/doc/fault_tolerance

…1083) `<Framework>Trainer.restore` has been deprecated. Update the fault tolerance user guide to use the new pattern of just instantiating the `<Framework>Trainer` from scratch again with the same `(storage_path, name)`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 6 commits March 4, 2025 16:25

update fault tolerance guide

6a8a9f9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

some copy edits

16b665b

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add illustrated example

047c159

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

start job level fault tolerance edits

c2578aa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add captions to illustrated example

c074a4a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add example script + usage pattern

d8eadf7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned matthewdeng Mar 5, 2025

justinvyu requested review from hongpeng-guo, matthewdeng, raulchen, woshiyyya and a team as code owners March 5, 2025 01:39

justinvyu added 3 commits March 4, 2025 22:36

scaling config

53a5861

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add orphan page of old docs + warnings at top of user guides

e4dbfe2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

small typo

1fb600f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng reviewed Mar 6, 2025

View reviewed changes

justinvyu added 4 commits March 6, 2025 12:00

Merge branch 'master' of https://github.com/ray-project/ray into trai…

e2566a6

…n_v2/doc/fault_tolerance

split up images

81b9ae6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix image

5d9d619

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

move to deprecated-user-guides

29f1de9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng approved these changes Mar 7, 2025

View reviewed changes

justinvyu added 3 commits March 7, 2025 11:45

fix end section

452b741

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

edits

52f351a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into trai…

8874ee4

…n_v2/doc/fault_tolerance

justinvyu enabled auto-merge (squash) March 7, 2025 20:04

github-actions bot added the go add ONLY when ready to merge, run all tests label Mar 7, 2025

justinvyu merged commit 5d2c0e3 into ray-project:master Mar 7, 2025
6 of 7 checks passed

justinvyu deleted the train_v2/doc/fault_tolerance branch March 7, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train v2][doc] Add updated fault tolerance user guide #51083

[train v2][doc] Add updated fault tolerance user guide #51083

justinvyu commented Mar 5, 2025 •

edited

Loading

matthewdeng Mar 6, 2025

matthewdeng Mar 6, 2025

justinvyu Mar 6, 2025

		Handling Failures and Node Preemption (Deprecated API)
		======================================================

[train v2][doc] Add updated fault tolerance user guide #51083

[train v2][doc] Add updated fault tolerance user guide #51083

Conversation

justinvyu commented Mar 5, 2025 • edited Loading

Summary

matthewdeng Mar 6, 2025

Choose a reason for hiding this comment

matthewdeng Mar 6, 2025

Choose a reason for hiding this comment

justinvyu Mar 6, 2025

Choose a reason for hiding this comment

justinvyu commented Mar 5, 2025 •

edited

Loading