-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25231] : Fix synchronization of executor heartbeat receiver in TaskSchedulerImpl #22221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,7 +19,7 @@ package org.apache.spark.scheduler | |
|
|
||
| import java.nio.ByteBuffer | ||
| import java.util.{Locale, Timer, TimerTask} | ||
| import java.util.concurrent.TimeUnit | ||
| import java.util.concurrent.{ConcurrentHashMap, TimeUnit} | ||
| import java.util.concurrent.atomic.AtomicLong | ||
|
|
||
| import scala.collection.Set | ||
|
|
@@ -91,7 +91,7 @@ private[spark] class TaskSchedulerImpl( | |
| private val taskSetsByStageIdAndAttempt = new HashMap[Int, HashMap[Int, TaskSetManager]] | ||
|
|
||
| // Protected by `this` | ||
| private[scheduler] val taskIdToTaskSetManager = new HashMap[Long, TaskSetManager] | ||
| private[scheduler] val taskIdToTaskSetManager = new ConcurrentHashMap[Long, TaskSetManager] | ||
| val taskIdToExecutorId = new HashMap[Long, String] | ||
|
|
||
| @volatile private var hasReceivedTask = false | ||
|
|
@@ -315,7 +315,7 @@ private[spark] class TaskSchedulerImpl( | |
| for (task <- taskSet.resourceOffer(execId, host, maxLocality)) { | ||
| tasks(i) += task | ||
| val tid = task.taskId | ||
| taskIdToTaskSetManager(tid) = taskSet | ||
| taskIdToTaskSetManager.put(tid, taskSet) | ||
| taskIdToExecutorId(tid) = execId | ||
| executorIdToRunningTaskIds(execId).add(tid) | ||
| availableCpus(i) -= CPUS_PER_TASK | ||
|
|
@@ -465,7 +465,7 @@ private[spark] class TaskSchedulerImpl( | |
| var reason: Option[ExecutorLossReason] = None | ||
| synchronized { | ||
| try { | ||
| taskIdToTaskSetManager.get(tid) match { | ||
| Option(taskIdToTaskSetManager.get(tid)) match { | ||
| case Some(taskSet) => | ||
| if (state == TaskState.LOST) { | ||
| // TaskState.LOST is only used by the deprecated Mesos fine-grained scheduling mode, | ||
|
|
@@ -517,10 +517,10 @@ private[spark] class TaskSchedulerImpl( | |
| accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])], | ||
| blockManagerId: BlockManagerId): Boolean = { | ||
| // (taskId, stageId, stageAttemptId, accumUpdates) | ||
| val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] = synchronized { | ||
| val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] = { | ||
| accumUpdates.flatMap { case (id, updates) => | ||
| val accInfos = updates.map(acc => acc.toInfo(Some(acc.value), None)) | ||
| taskIdToTaskSetManager.get(id).map { taskSetMgr => | ||
| Option(taskIdToTaskSetManager.get(id)).map { taskSetMgr => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just leave a small concern here, original code locked hole scope of ids in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this could happen, but it shouldn't cause issues because before this change the executor could have been removed right before this function was called (its all timing dependent), so that does not change this functionality. This is only to update accumulators for running tasks. If the tasks had finished then the accumulator updates would have been processed via the task end events. |
||
| (id, taskSetMgr.stageId, taskSetMgr.taskSet.stageAttemptId, accInfos) | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I like this direction.
I have a question about the change of semantics. By removing
synchronizationataccumUpdatesWithTaskIds(), a pair of operations in thissynchronizedget()andremove()incleanupTaskState()is not atomic regardingget.Is this change ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It think your question is about the get() and remove() inside cleanupTaskState all being inside a single synchronize block, correct?
I don't see that as being a problem here since taskIdToTaskSetManager is a concurrentHashMap. That protects the operations from being atomic and if you do a remove on an object that isn't there then it does nothing. There is no other code that removes from there so I don't think that can happen anyway. With this change the only thing outside of a synchronize block is a get in accumUpdatesWithTaskIds which will be harmless if it had been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConcurrentHashMapmake each operation likeget(),remove(), and others. Thus, I reviewed places more than one operations are within asynchronized. The place is here.When we apply this PR, the
getinaccumUpdatesWithTaskIdscan be executed betweenget()andremove(). My question is like a confirmation whether it is safe or not.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes as far as I can see its safe. If the get happened before its removed it calculates the accumulators, if its after its removed it just gets an empty array back. This isn't any different then when it was synchronized. There is nothing in the statusUpdate between the get and call to cleanupTaskState where it removes that I see depends on accumulators or anything else.