Skip to content

[CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold#3601

Open
yew1eb wants to merge 4 commits intoapache:mainfrom
yew1eb:CELEBORN_2264
Open

[CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold#3601
yew1eb wants to merge 4 commits intoapache:mainfrom
yew1eb:CELEBORN_2264

Conversation

@yew1eb
Copy link
Contributor

@yew1eb yew1eb commented Feb 11, 2026

What changes were proposed in this pull request?

This patch adds configurable threshold check for shuffle write bytes.

Why are the changes needed?

Shuffle will be canceled automatically if write bytes exceed the threshold to avoid cluster resource exhaustion.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI and Manual testing

@yew1eb yew1eb marked this pull request as draft February 11, 2026 03:47
@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.91%. Comparing base (2dd1b7a) to head (d15feca).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3601      +/-   ##
==========================================
- Coverage   67.13%   66.91%   -0.22%     
==========================================
  Files         357      357              
  Lines       21860    21932      +72     
  Branches     1943     1949       +6     
==========================================
  Hits        14674    14674              
- Misses       6166     6244      +78     
+ Partials     1020     1014       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yew1eb yew1eb marked this pull request as ready for review February 11, 2026 07:21
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left one question about protocol compatibility

Can you please add some unit tests ?

bytesWrittenPerPartition: Array[Long],
serdeVersion: SerdeVersion)
serdeVersion: SerdeVersion,
bytesWritten: Long)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this patch is changing the protocol

how is this handled ?
will the new client be compatible with the new server version and viceversa ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal since MapperEnd is only utilized on the engine side.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, BTW when the configuration is changed, you should also execute UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite


private val shuffleWriteLimitEnabled = conf.shuffleWriteLimitEnabled
private val shuffleWriteLimitThreshold = conf.shuffleWriteLimitThreshold
private val shuffleTotalWrittenBytes = JavaUtils.newConcurrentHashMap[Int, AtomicLong]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should clean shuffleId related data when shuffle expires.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

bytesWrittenPerPartition: Array[Long],
serdeVersion: SerdeVersion)
serdeVersion: SerdeVersion,
bytesWritten: Long)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal since MapperEnd is only utilized on the engine side.

crc32PerPartition = crc32PerPartition,
bytesWrittenPerPartition = bytesWrittenPerPartition)

if (mapperAttemptFinishedSuccess && shuffleWriteLimitEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding a negative test case for this feature?

.createWithDefault(false)

val SHUFFLE_WRITE_LIMIT_ENABLED: ConfigEntry[Boolean] =
buildConf("celeborn.client.shuffle.write.limit.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it only take effect when using Spark, so it might be better to change the key to celeborn.client.spark.shuffle.write.limit.enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants