[CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost #3496

s0nskar · 2025-10-07T18:05:31Z

What changes were proposed in this pull request?

Fix the WorkerStatusTracker logic, so unknown workers are marked correctly in excluded workers.
Trigger shuffle data lost if the worker hosting the shuffle data is lost.

This can be extended to –

fast fail mapper stages as well before the commit starts.
with push replicate enabled with multiple workers loss.

Why are the changes needed?

Currently even if worker crashs or became unavailable for some reason and marked as lost by Master, reduce stage still try to read data from it and fail after running for sometime which is in-efficient. We can detect this early and fail the reduce stage with SHUFFLE_DATA_LOST before starting the stage.

Does this PR introduce any user-facing change?

NA

How was this patch tested?

WIP

…e of worker crash/lost

cxzl25

LGTM

client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala

client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala

RexXiong

Thanks, LGTM

SteNicholas

LGTM.

SteNicholas · 2025-10-29T02:04:54Z

Merged to main(v0.7.0).

turboFei · 2025-12-30T03:03:40Z

We can merge this PR into branch-0.6 as well, will update the config version.

…e of worker lost - Fix the WorkerStatusTracker logic, so unknown workers are marked correctly in excluded workers. - Trigger shuffle data lost if the worker hosting the shuffle data is lost. This can be extended to – - fast fail mapper stages as well before the commit starts. - with push replicate enabled with multiple workers loss. Currently even if worker crashs or became unavailable for some reason and marked as lost by Master, reduce stage still try to read data from it and fail after running for sometime which is in-efficient. We can detect this early and fail the reduce stage with SHUFFLE_DATA_LOST before starting the stage. NA WIP Closes #3496 from s0nskar/CELEBORN-2166. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: SteNicholas <programgeek@163.com> (cherry picked from commit 1157d6a) Signed-off-by: Wang, Fei <fwang12@ebay.com>

turboFei · 2025-12-30T05:22:46Z

thanks, merged to 0.6.3 as well

…stOnUnknownWorker.enabled version to 0.6.3 ### What changes were proposed in this pull request? Update config celeborn.client.shuffleDataLostOnUnknownWorker.enabled version to 0.6.3 ### Why are the changes needed? Followup for #3496, it is better to merge into branch-0.6 as well. ### Does this PR resolve a correctness bug? No. ### Does this PR introduce _any_ user-facing change? No, it has not been releases yet. ### How was this patch tested? GA. Closes #3576 from turboFei/update_conf. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>

…stOnUnknownWorker.enabled version to 0.6.3 ### What changes were proposed in this pull request? Update config celeborn.client.shuffleDataLostOnUnknownWorker.enabled version to 0.6.3 ### Why are the changes needed? Followup for #3496, it is better to merge into branch-0.6 as well. ### Does this PR resolve a correctness bug? No. ### Does this PR introduce _any_ user-facing change? No, it has not been releases yet. ### How was this patch tested? GA. Closes #3576 from turboFei/update_conf. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> (cherry picked from commit 38532d7) Signed-off-by: Wang, Fei <fwang12@ebay.com>

[CELEBORN-2166] Fast fail reduce stage if shuffle data is lost becaus…

09a33c7

…e of worker crash/lost

github-actions bot added the module:client label Oct 7, 2025

fixed code

1e14832

github-actions bot added kind:documentation module:common labels Oct 8, 2025

s0nskar marked this pull request as ready for review October 8, 2025 12:02

fix tests

84c75b8

cxzl25 approved these changes Oct 9, 2025

View reviewed changes

SteNicholas reviewed Oct 10, 2025

View reviewed changes

client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala Show resolved Hide resolved

SteNicholas reviewed Oct 10, 2025

View reviewed changes

client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala Show resolved Hide resolved

SteNicholas requested a review from RexXiong October 20, 2025 02:07

RexXiong approved these changes Oct 24, 2025

View reviewed changes

SteNicholas approved these changes Oct 29, 2025

View reviewed changes

SteNicholas changed the title ~~[CELEBORN-2166] Fastfail reduce stage if shuffle data is lost because of worker lost~~ [CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost Oct 29, 2025

SteNicholas closed this in 1157d6a Oct 29, 2025

turboFei mentioned this pull request Dec 30, 2025

[CELEBORN-2166][FOLLOWUP] Update config celeborn.client.shuffleDataLostOnUnknownWorker.enabled version to 0.6.3 #3576

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost #3496

[CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost #3496

Uh oh!

s0nskar commented Oct 7, 2025

Uh oh!

cxzl25 left a comment

Uh oh!

Uh oh!

Uh oh!

RexXiong left a comment

Uh oh!

SteNicholas left a comment

Uh oh!

SteNicholas commented Oct 29, 2025

Uh oh!

turboFei commented Dec 30, 2025

Uh oh!

turboFei commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost #3496

[CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost #3496

Uh oh!

Conversation

s0nskar commented Oct 7, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cxzl25 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RexXiong left a comment

Choose a reason for hiding this comment

Uh oh!

SteNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

SteNicholas commented Oct 29, 2025

Uh oh!

turboFei commented Dec 30, 2025

Uh oh!

turboFei commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants