-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[train] Fix exception queue race condition in ThreadRunner #57249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a race condition in the ThreadRunner where an exception in the user's target function could be missed, causing a failed run to appear successful. The fix, which involves joining the monitor thread before the target function's thread completes, is correct and robustly implemented. The removal of the manual _is_running flag in favor of thread.is_alive() simplifies the code and improves reliability. The accompanying tests are excellent, using threading.Event to deterministically reproduce the race condition and validate the fix. I have one minor suggestion to improve code clarity in the tests.
TimothySeah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for the fix!
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…into fix_exc_queue_race
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com>
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com>
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com>
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com>
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ct#57249) This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the `TrainController`, which results in the run finishing successfully when it shouldn't. The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then `thread_runner.get_error()` always returns the final value. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Summary
This PR fixes a race condition where an exception raised directly from the user target function doesn't get propagated to the
TrainController, which results in the run finishing successfully when it shouldn't.The fix is to join the monitor queue before before considering the target function finished. This ensures that any outstanding exception is processed. If is_running=False, then
thread_runner.get_error()always returns the final value.Problem
is_running = False._excattribute after the poll status call has finished. The controller sees(finished=True, error=None)and thinks that the run succeeded even though the worker errored.