Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train v2][doc] Add updated fault tolerance user guide #51083

Merged
merged 16 commits into from
Mar 7, 2025

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Mar 5, 2025

Summary

<Framework>Trainer.restore has been deprecated. Update the fault tolerance user guide to use the new pattern of just instantiating the <Framework>Trainer from scratch again with the same (storage_path, name).

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Comment on lines 5 to 6
Handling Failures and Node Preemption (Deprecated API)
======================================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as other pr

Suggested change
Handling Failures and Node Preemption (Deprecated API)
======================================================
(Deprecated) Handling Failures and Node Preemption
==================================================

:dedent:
:start-after: __ft_restore_from_cloud_restored_start__
:end-before: __ft_restore_from_cloud_restored_end__
python entrypoint.py --storage_path s3://my_bucket/ --run_name unique_run_name-id=da823d5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unique_run_name-id=da823d5 was kind of confusing to me, I thought this was setting a run ID at first. Maybe we can just call it unique_run_name here?

Also IDK if the user will have the wandb/mlflow run ID at this point, or if it would be generated inside the entrypoint script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to connect to the same wandb run upon the job retry, you should set the run id on the outside and pass it in, wandb.init(..., run_id=run_id). Otherwise it's randomly generated.

I am setting a run id here as a best practice for the user submitting a job.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) March 7, 2025 20:04
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Mar 7, 2025
@justinvyu justinvyu merged commit 5d2c0e3 into ray-project:master Mar 7, 2025
6 of 7 checks passed
@justinvyu justinvyu deleted the train_v2/doc/fault_tolerance branch March 7, 2025 21:10
elimelt pushed a commit to elimelt/ray that referenced this pull request Mar 9, 2025
…1083)

`<Framework>Trainer.restore` has been deprecated. Update the fault
tolerance user guide to use the new pattern of just instantiating the
`<Framework>Trainer` from scratch again with the same `(storage_path,
name)`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants