-
Notifications
You must be signed in to change notification settings - Fork 416
[CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost #3496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e of worker crash/lost
cxzl25
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala
Show resolved
Hide resolved
RexXiong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
SteNicholas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
Merged to main(v0.7.0). |
|
We can merge this PR into branch-0.6 as well, will update the config version. |
…e of worker lost - Fix the WorkerStatusTracker logic, so unknown workers are marked correctly in excluded workers. - Trigger shuffle data lost if the worker hosting the shuffle data is lost. This can be extended to – - fast fail mapper stages as well before the commit starts. - with push replicate enabled with multiple workers loss. Currently even if worker crashs or became unavailable for some reason and marked as lost by Master, reduce stage still try to read data from it and fail after running for sometime which is in-efficient. We can detect this early and fail the reduce stage with SHUFFLE_DATA_LOST before starting the stage. NA WIP Closes #3496 from s0nskar/CELEBORN-2166. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit 1157d6a) Signed-off-by: Wang, Fei <fwang12@ebay.com>
|
thanks, merged to 0.6.3 as well |
…stOnUnknownWorker.enabled version to 0.6.3 ### What changes were proposed in this pull request? Update config celeborn.client.shuffleDataLostOnUnknownWorker.enabled version to 0.6.3 ### Why are the changes needed? Followup for #3496, it is better to merge into branch-0.6 as well. ### Does this PR resolve a correctness bug? No. ### Does this PR introduce _any_ user-facing change? No, it has not been releases yet. ### How was this patch tested? GA. Closes #3576 from turboFei/update_conf. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>
…stOnUnknownWorker.enabled version to 0.6.3 ### What changes were proposed in this pull request? Update config celeborn.client.shuffleDataLostOnUnknownWorker.enabled version to 0.6.3 ### Why are the changes needed? Followup for #3496, it is better to merge into branch-0.6 as well. ### Does this PR resolve a correctness bug? No. ### Does this PR introduce _any_ user-facing change? No, it has not been releases yet. ### How was this patch tested? GA. Closes #3576 from turboFei/update_conf. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 38532d7) Signed-off-by: Wang, Fei <fwang12@ebay.com>
What changes were proposed in this pull request?
This can be extended to –
Why are the changes needed?
Currently even if worker crashs or became unavailable for some reason and marked as lost by Master, reduce stage still try to read data from it and fail after running for sometime which is in-efficient. We can detect this early and fail the reduce stage with SHUFFLE_DATA_LOST before starting the stage.
Does this PR introduce any user-facing change?
NA
How was this patch tested?
WIP