Skip to content

Conversation

@justinvyu
Copy link
Contributor

Summary

This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the TrainController, which results in the run finishing successfully when it shouldn't.

The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then thread_runner.get_error() always returns the final value.

Problem

  1. The user target function which runs in the TrainingThread raises an error. This adds the error to a queue to be processed by the MonitorThread. The training thread exits and sets is_running = False.
  2. The controller polls this worker actor, which calls is_running() and get_error() from the main thread.
  3. get_error() acquires the lock, and then reads the error which is currently unset.
  4. The monitor thread also tries acquiring the lock, but loses the race and only updates the _exc attribute after the poll status call has finished. The controller sees (finished=True, error=None) and thinks that the run succeeded even though the worker errored.
Screenshot 2025-10-06 at 4 23 46 PM

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner October 7, 2025 00:42
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a race condition in the ThreadRunner where an exception in the user's target function could be missed, causing a failed run to appear successful. The fix, which involves joining the monitor thread before the target function's thread completes, is correct and robustly implemented. The removal of the manual _is_running flag in favor of thread.is_alive() simplifies the code and improves reliability. The accompanying tests are excellent, using threading.Event to deterministically reproduce the race condition and validate the fix. I have one minor suggestion to improve code clarity in the tests.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Oct 7, 2025
Copy link
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the fix!

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
@matthewdeng matthewdeng added go add ONLY when ready to merge, run all tests and removed go add ONLY when ready to merge, run all tests labels Oct 7, 2025
cursor[bot]

This comment was marked as outdated.

@justinvyu justinvyu enabled auto-merge (squash) October 7, 2025 19:00
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 7, 2025
@justinvyu justinvyu merged commit 2ad6d83 into ray-project:master Oct 7, 2025
7 of 8 checks passed
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Oct 13, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ct#57249)

This PR fixes a race condition where an exception raised directly from
the user target function doesn't get propagated to the
`TrainController`, which results in the run finishing successfully when
it shouldn't.

The fix is to join the monitor queue before before considering the
target function finished. This ensures that any outstanding exception is
processed. If is_running=False, then `thread_runner.get_error()` always
returns the final value.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants