Skip to content

Conversation

@justinvyu
Copy link
Contributor

Description

Ray Train's framework agnostic collective utilities (ray.train.collective.barrier, ray.train.collective.broadcast_from_rank_zero) currently timeout after 30 minutes if not all ranks join the operation. ray.train.report uses these collective utilities internally, so users who don't call report on every rank can run into deadlocks. For example, the report barrier can deadlock with another worker waiting on others to join a backward pass collective.

This PR changes the default Ray Train collective behavior to never timeout and to only log warning messages about the missing ranks. User code typically already has timeouts such as NCCL timeouts (also 30 minutes by default), so the extra timeout here doesn't really help and increases the user burden of keeping track of environment variables to set when debugging hanging jobs.

New default: RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1

This PR also generalizes the environment variable name: RAY_TRAIN_REPORT_BARRIER -> RAY_TRAIN_COLLECTIVE.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner October 27, 2025 22:57
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements the plan to disable Ray Train collective utility timeouts by default, which should improve user experience by reducing deadlocks and configuration overhead. The renaming of environment variables from REPORT_BARRIER to a more general COLLECTIVE is a good improvement. The logic changes and the addition of a new test case to verify the behavior with no timeout are solid. I've added a couple of suggestions: one to fix a formatting issue in an error message and another to make a new test more explicit and robust.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Oct 28, 2025
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) November 7, 2025 01:31
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 7, 2025
@justinvyu justinvyu merged commit 959602f into ray-project:master Nov 7, 2025
7 of 8 checks passed
@justinvyu justinvyu deleted the remove_collective_timeout branch November 7, 2025 02:12
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…project#58229)

Ray Train's framework agnostic collective utilities
(`ray.train.collective.barrier`,
`ray.train.collective.broadcast_from_rank_zero`) currently timeout after
30 minutes if not all ranks join the operation. `ray.train.report` uses
these collective utilities internally, so users who don't call report on
every rank can run into deadlocks. For example, the report barrier can
deadlock with another worker waiting on others to join a backward pass
collective.

This PR changes the default Ray Train collective behavior to never
timeout and to only log warning messages about the missing ranks. User
code typically already has timeouts such as NCCL timeouts (also 30
minutes by default), so the extra timeout here doesn't really help and
increases the user burden of keeping track of environment variables to
set when debugging hanging jobs.

New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1`

This PR also generalizes the environment variable name:
`RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…project#58229)

Ray Train's framework agnostic collective utilities
(`ray.train.collective.barrier`,
`ray.train.collective.broadcast_from_rank_zero`) currently timeout after
30 minutes if not all ranks join the operation. `ray.train.report` uses
these collective utilities internally, so users who don't call report on
every rank can run into deadlocks. For example, the report barrier can
deadlock with another worker waiting on others to join a backward pass
collective.

This PR changes the default Ray Train collective behavior to never
timeout and to only log warning messages about the missing ranks. User
code typically already has timeouts such as NCCL timeouts (also 30
minutes by default), so the extra timeout here doesn't really help and
increases the user burden of keeping track of environment variables to
set when debugging hanging jobs.

New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1`

This PR also generalizes the environment variable name:
`RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…project#58229)

Ray Train's framework agnostic collective utilities
(`ray.train.collective.barrier`,
`ray.train.collective.broadcast_from_rank_zero`) currently timeout after
30 minutes if not all ranks join the operation. `ray.train.report` uses
these collective utilities internally, so users who don't call report on
every rank can run into deadlocks. For example, the report barrier can
deadlock with another worker waiting on others to join a backward pass
collective.

This PR changes the default Ray Train collective behavior to never
timeout and to only log warning messages about the missing ranks. User
code typically already has timeouts such as NCCL timeouts (also 30
minutes by default), so the extra timeout here doesn't really help and
increases the user burden of keeping track of environment variables to
set when debugging hanging jobs.

New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1`

This PR also generalizes the environment variable name:
`RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…project#58229)

Ray Train's framework agnostic collective utilities
(`ray.train.collective.barrier`,
`ray.train.collective.broadcast_from_rank_zero`) currently timeout after
30 minutes if not all ranks join the operation. `ray.train.report` uses
these collective utilities internally, so users who don't call report on
every rank can run into deadlocks. For example, the report barrier can
deadlock with another worker waiting on others to join a backward pass
collective.

This PR changes the default Ray Train collective behavior to never
timeout and to only log warning messages about the missing ranks. User
code typically already has timeouts such as NCCL timeouts (also 30
minutes by default), so the extra timeout here doesn't really help and
increases the user burden of keeping track of environment variables to
set when debugging hanging jobs.

New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1`

This PR also generalizes the environment variable name:
`RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…project#58229)

Ray Train's framework agnostic collective utilities
(`ray.train.collective.barrier`,
`ray.train.collective.broadcast_from_rank_zero`) currently timeout after
30 minutes if not all ranks join the operation. `ray.train.report` uses
these collective utilities internally, so users who don't call report on
every rank can run into deadlocks. For example, the report barrier can
deadlock with another worker waiting on others to join a backward pass
collective.

This PR changes the default Ray Train collective behavior to never
timeout and to only log warning messages about the missing ranks. User
code typically already has timeouts such as NCCL timeouts (also 30
minutes by default), so the extra timeout here doesn't really help and
increases the user burden of keeping track of environment variables to
set when debugging hanging jobs.

New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1`

This PR also generalizes the environment variable name:
`RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…project#58229)

Ray Train's framework agnostic collective utilities
(`ray.train.collective.barrier`,
`ray.train.collective.broadcast_from_rank_zero`) currently timeout after
30 minutes if not all ranks join the operation. `ray.train.report` uses
these collective utilities internally, so users who don't call report on
every rank can run into deadlocks. For example, the report barrier can
deadlock with another worker waiting on others to join a backward pass
collective.

This PR changes the default Ray Train collective behavior to never
timeout and to only log warning messages about the missing ranks. User
code typically already has timeouts such as NCCL timeouts (also 30
minutes by default), so the extra timeout here doesn't really help and
increases the user burden of keeping track of environment variables to
set when debugging hanging jobs.

New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1`

This PR also generalizes the environment variable name:
`RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants