-
-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P2P shuffle restart may deadlock if no running workers exist #8088
Comments
There's another race hidden in this edge case that causes the new test to flake due to a problem with the transition logic:
|
2 tasks
This was referenced Aug 15, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
An edge case exists where attempting to restart a P2P shuffle may deadlock.
In this scenario, there are some non-running workers holding inputs to the shuffle, but no more running workers when the shuffle is restarted. This will cause a
processing
shuffle-transfer
to transitionprocessing -> released -> waiting -> waiting -> no-worker -> released
. The task will then remain stuck inreleased
, causing a deadlock.There are two possible ways of fixing this:
waiting
when transitioning a taskno-worker -> released
if it has tasks or clients waiting on its results (similar to theprocessing -> released
andwaiting -> released
.Reproducer:
The text was updated successfully, but these errors were encountered: