-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train v2][doc] Add updated fault tolerance user guide #51083
[train v2][doc] Add updated fault tolerance user guide #51083
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Handling Failures and Node Preemption (Deprecated API) | ||
====================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as other pr
Handling Failures and Node Preemption (Deprecated API) | |
====================================================== | |
(Deprecated) Handling Failures and Node Preemption | |
================================================== |
:dedent: | ||
:start-after: __ft_restore_from_cloud_restored_start__ | ||
:end-before: __ft_restore_from_cloud_restored_end__ | ||
python entrypoint.py --storage_path s3://my_bucket/ --run_name unique_run_name-id=da823d5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unique_run_name-id=da823d5
was kind of confusing to me, I thought this was setting a run ID at first. Maybe we can just call it unique_run_name
here?
Also IDK if the user will have the wandb/mlflow run ID at this point, or if it would be generated inside the entrypoint script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to connect to the same wandb run upon the job retry, you should set the run id on the outside and pass it in, wandb.init(..., run_id=run_id)
. Otherwise it's randomly generated.
I am setting a run id here as a best practice for the user submitting a job.
…n_v2/doc/fault_tolerance
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…n_v2/doc/fault_tolerance
…1083) `<Framework>Trainer.restore` has been deprecated. Update the fault tolerance user guide to use the new pattern of just instantiating the `<Framework>Trainer` from scratch again with the same `(storage_path, name)`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Summary
<Framework>Trainer.restore
has been deprecated. Update the fault tolerance user guide to use the new pattern of just instantiating the<Framework>Trainer
from scratch again with the same(storage_path, name)
.