Skip to content

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Nov 21, 2025

Summary

When reporting a checkpoint to Ray Train, every worker needs to form a barrier with a ray.train.report call. If every worker reports an empty checkpoint, we should notify the condition to unblock ray.train.get_all_reported_checkpoint calls.

Before this fix, reporting an empty checkpoint and calling get_all_reported_checkpoints would result in a hang.

Testing

Unit tests

…orted_checkpoints

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner November 21, 2025 01:39
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where get_all_reported_checkpoints could hang if no worker reports a checkpoint. The fix in CheckpointManager correctly notifies waiting threads even when no checkpoint is provided, which is the right approach. The changes are supported by a new unit test that specifically covers this scenario, and another test was moved for better organization. My feedback includes suggestions to improve the new tests for better clarity and robustness by making the test logic more symmetric and adding explicit assertions.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Nov 21, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a motivating example for this PR?

@TimothySeah
Copy link
Contributor Author

Can you give a motivating example for this PR?

Added to PR description: Before this fix, reporting an empty checkpoint and calling get_all_reported_checkpoints would result in a hang.

@justinvyu justinvyu changed the title [train] When reporting empty checkpoint, warn and unblock get_all_reported_checkpoints [train] Warn and unblock get_all_reported_checkpoints if reporting only metrics Nov 24, 2025
@TimothySeah TimothySeah changed the title [train] Warn and unblock get_all_reported_checkpoints if reporting only metrics [train] Unblock get_all_reported_checkpoints if reporting only metrics Nov 24, 2025
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Nov 24, 2025
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu merged commit 0c4f9d2 into ray-project:master Nov 26, 2025
6 checks passed
KaisennHu pushed a commit to KaisennHu/ray that referenced this pull request Nov 26, 2025
…ics (ray-project#58870)

When reporting a checkpoint to Ray Train, every worker needs to form a
barrier with a `ray.train.report` call. If every worker reports an empty
checkpoint, we should notify the condition to unblock
`ray.train.get_all_reported_checkpoint` calls.

Before this fix, reporting an empty checkpoint and calling
`get_all_reported_checkpoints` would result in a hang.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…ics (ray-project#58870)

When reporting a checkpoint to Ray Train, every worker needs to form a
barrier with a `ray.train.report` call. If every worker reports an empty
checkpoint, we should notify the condition to unblock
`ray.train.get_all_reported_checkpoint` calls.

Before this fix, reporting an empty checkpoint and calling
`get_all_reported_checkpoints` would result in a hang.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants