-
Notifications
You must be signed in to change notification settings - Fork 7k
[train] Unblock get_all_reported_checkpoints if reporting only metrics
#58870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…orted_checkpoints Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where get_all_reported_checkpoints could hang if no worker reports a checkpoint. The fix in CheckpointManager correctly notifies waiting threads even when no checkpoint is provided, which is the right approach. The changes are supported by a new unit test that specifically covers this scenario, and another test was moved for better organization. My feedback includes suggestions to improve the new tests for better clarity and robustness by making the test logic more symmetric and adding explicit assertions.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a motivating example for this PR?
python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py
Outdated
Show resolved
Hide resolved
Added to PR description: |
get_all_reported_checkpoints if reporting only metrics
Signed-off-by: Timothy Seah <tseah@anyscale.com>
get_all_reported_checkpoints if reporting only metricsget_all_reported_checkpoints if reporting only metrics
justinvyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…ics (ray-project#58870) When reporting a checkpoint to Ray Train, every worker needs to form a barrier with a `ray.train.report` call. If every worker reports an empty checkpoint, we should notify the condition to unblock `ray.train.get_all_reported_checkpoint` calls. Before this fix, reporting an empty checkpoint and calling `get_all_reported_checkpoints` would result in a hang. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ics (ray-project#58870) When reporting a checkpoint to Ray Train, every worker needs to form a barrier with a `ray.train.report` call. If every worker reports an empty checkpoint, we should notify the condition to unblock `ray.train.get_all_reported_checkpoint` calls. Before this fix, reporting an empty checkpoint and calling `get_all_reported_checkpoints` would result in a hang. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Summary
When reporting a checkpoint to Ray Train, every worker needs to form a barrier with a
ray.train.reportcall. If every worker reports an empty checkpoint, we should notify the condition to unblockray.train.get_all_reported_checkpointcalls.Before this fix, reporting an empty checkpoint and calling
get_all_reported_checkpointswould result in a hang.Testing
Unit tests