[SPARK-54312][CORE] Avoid repeatedly scheduling tasks for SendHeartbeat/WorkDirClean in standalone worker #53054

ivoson · 2025-11-14T01:55:20Z

What changes were proposed in this pull request?

Currently, worker will schedule tasks forwarding SendHeartbeat and WorkDirCleanup while handleRegisterResponse.

While worker registration could happen multiple times in case of heartbeat timeout/disconnected from master, in these cases the tasks would be scheduled multiple times.

To fix the issue:

Adding heartbeatTask and workDirCleanupTask in worker to tell whether these tasks have been scheduled
heartbeatTask and workDirCleanupTask will be initialized after the 1st registration, and then skipped scheduling these tasks in later registration.
Cancel the task and reset heartbeatTask and workDirCleanupTask when worker stops.

Why are the changes needed?

Fix the issue repeatedly scheduling SendHeartbeat/WorkDirClean tasks after worker registration.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT added

Was this patch authored or co-authored using generative AI tooling?

No

…up after registratiion

ivoson · 2025-11-14T02:10:42Z

cc @Ngone51 @LuciferYang @dongjoon-hyun can you please take a look? Thanks

Ngone51

Good catch! LGTM.

LuciferYang · 2025-11-17T09:54:54Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala


  private var registerMasterFutures: Array[JFuture[_]] = null
  private var registrationRetryTimer: Option[JScheduledFuture[_]] = None
+  private[worker] var heartbeatTask: Option[JScheduledFuture[_]] = None


The identifier marked as [work] seems to serve the purpose of merely being callable within test cases, right? Given that the current WorkerSuite already has with PrivateMethodTester, can we adopt the approach of using invokePrivate for testing?

Thanks, updated.

LuciferYang · 2025-11-17T09:58:58Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

    cleanupThreadExecutor.shutdownNow()
    metricsSystem.report()
    cancelLastRegistrationRetry()
+    heartbeatTask.foreach(_.cancel(true))


The handleRegisterResponse is a synchronized code block. Don't the operations on heartbeatTask and workDirCleanupTask within onStop also require synchronized protection?

The synchronized block was introduced by #9138 to avoid some race conditions in very early implementation with some async call back...

Looks like not be necessary now...

Worker is a ThreadSafeRpcEndpoint already. The synchronized protection seems to be unnecessary today.

OK, so can we remove that unnecessary synchronized in a separate pr ?

I will create a separate task to revisit the synchronized usage here.

dongjoon-hyun

+1, LGTM for Apache Spark 4.1.0. Thank you, @ivoson , @Ngone51 , @LuciferYang .
Merged to master/4.1.

…at/WorkDirClean in standalone worker ### What changes were proposed in this pull request? Currently, [worker](https://github.com/apache/spark/blob/87b3b94232436528f88c9a7aa7ee70758b85a33a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L495) will schedule tasks forwarding `SendHeartbeat` and `WorkDirCleanup` while `handleRegisterResponse`. While worker registration could happen multiple times in case of heartbeat timeout/disconnected from master, in these cases the tasks would be scheduled multiple times. To fix the issue: - Adding `heartbeatTask` and `workDirCleanupTask` in worker to tell whether these tasks have been scheduled - `heartbeatTask` and `workDirCleanupTask` will be initialized after the 1st registration, and then skipped scheduling these tasks in later registration. - Cancel the task and reset `heartbeatTask` and `workDirCleanupTask` when worker stops. ### Why are the changes needed? Fix the issue repeatedly scheduling SendHeartbeat/WorkDirClean tasks after worker registration. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT added ### Was this patch authored or co-authored using generative AI tooling? No Closes #53054 from ivoson/duplicate-worker-heartbeat. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d51b433) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…at/WorkDirClean in standalone worker ### What changes were proposed in this pull request? Currently, [worker](https://github.com/apache/spark/blob/87b3b94232436528f88c9a7aa7ee70758b85a33a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L495) will schedule tasks forwarding `SendHeartbeat` and `WorkDirCleanup` while `handleRegisterResponse`. While worker registration could happen multiple times in case of heartbeat timeout/disconnected from master, in these cases the tasks would be scheduled multiple times. To fix the issue: - Adding `heartbeatTask` and `workDirCleanupTask` in worker to tell whether these tasks have been scheduled - `heartbeatTask` and `workDirCleanupTask` will be initialized after the 1st registration, and then skipped scheduling these tasks in later registration. - Cancel the task and reset `heartbeatTask` and `workDirCleanupTask` when worker stops. ### Why are the changes needed? Fix the issue repeatedly scheduling SendHeartbeat/WorkDirClean tasks after worker registration. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT added ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53054 from ivoson/duplicate-worker-heartbeat. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Avoid repeatedly schedule tasks forwarding SendHeartbeat/WorkDirClean…

68872a4

…up after registratiion

github-actions bot added the CORE label Nov 14, 2025

ivoson marked this pull request as ready for review November 14, 2025 02:08

Ngone51 approved these changes Nov 17, 2025

View reviewed changes

LuciferYang reviewed Nov 17, 2025

View reviewed changes

ivoson added 2 commits November 18, 2025 05:45

address comments

8d935d7

Change access modifier of handleRegisterResponse method

f27b7f6

LuciferYang approved these changes Nov 18, 2025

View reviewed changes

dongjoon-hyun approved these changes Nov 20, 2025

View reviewed changes

dongjoon-hyun closed this in d51b433 Nov 20, 2025

ivoson deleted the duplicate-worker-heartbeat branch November 21, 2025 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54312][CORE] Avoid repeatedly scheduling tasks for SendHeartbeat/WorkDirClean in standalone worker #53054

[SPARK-54312][CORE] Avoid repeatedly scheduling tasks for SendHeartbeat/WorkDirClean in standalone worker #53054

Uh oh!

ivoson commented Nov 14, 2025

Uh oh!

ivoson commented Nov 14, 2025

Uh oh!

Ngone51 left a comment

Uh oh!

LuciferYang Nov 17, 2025

Uh oh!

ivoson Nov 18, 2025

Uh oh!

LuciferYang Nov 17, 2025

Uh oh!

ivoson Nov 18, 2025

Uh oh!

Ngone51 Nov 18, 2025

Uh oh!

LuciferYang Nov 18, 2025

Uh oh!

ivoson Nov 18, 2025

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-54312][CORE] Avoid repeatedly scheduling tasks for SendHeartbeat/WorkDirClean in standalone worker #53054

[SPARK-54312][CORE] Avoid repeatedly scheduling tasks for SendHeartbeat/WorkDirClean in standalone worker #53054

Uh oh!

Conversation

ivoson commented Nov 14, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ivoson commented Nov 14, 2025

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ivoson Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ivoson Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Ngone51 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ivoson Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun left a comment •

edited

Loading