-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27112] : Spark Scheduler encounters two independent Deadlocks … #24035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…when trying to kill executors either due to dynamic allocation or blacklisting There are two deadlocks as a result of the interplay between three different threads: task-result-getter thread spark-dynamic-executor-allocation thread dispatcher-event-loop thread(makeOffers()) The fix for the deadlock between dynamic allocation thread and result getter thread involves moving the method isExecutorBusy outside of the lock CoarseGrainedSchedulerBackend.this. The fix for the deadlock between event loop thread and result getter thread involves removing synchronized on CoarseGrainedSchedulerBackend.this. The same synchronization has been replaced by a dummy lock to ensure synchronization between dynamic allocation thread and event loop thread.
|
ok to test |
|
Test build #103238 has finished for PR 24035 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgandhi999 I think the InvalidUseOfMatchersException suggests that an any() should be added to the killExecutors calls in a couple of the failing suites, as you added an argument. The verify calls also expect the blacklistingOnTaskCompletion argument. Note, on ExecutorAllocationManagerSuite we need the extra argument on the when(killExecutor...) calls, looks like Mockito doesn't like the default arguments not specified, but I am not 100% on this.
The other failure is a bug in killExecutors, commented on below.
| numPendingExecutors += executorsToKill.size | ||
| Future.successful(true) | ||
| var response: Future[Seq[String]] = null | ||
| val idleExecutorIds = executorIds.filter { id => !scheduler.isExecutorBusy(id) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be force || !scheduler.isExecutorBusy(id). That's the test failure "org.apache.spark.deploy.StandaloneDynamicAllocationSuite.disable force kill for busy executors (SPARK-9552)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, have fixed it. Thank you @abellina .
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide the stack traces in text format as well? I would like to open them in my IDE to follow the call stacks.
| private val reviveThread = | ||
| ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-revive-thread") | ||
|
|
||
| // YSPARK-1163: This lock is explicitly added here to keep the changes introduced by SPARK-19757 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT, extra Y prefix: "YSPARK-1163" => "SPARK-1163"
And also at the end of this comment block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep took care of that. Thank you.
| Future.successful(true) | ||
| var response: Future[Seq[String]] = null | ||
| val idleExecutorIds = executorIds.filter { id => !scheduler.isExecutorBusy(id) } | ||
| if (!blacklistingOnTaskCompletion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the performance cost to always use makeOffersLock (and deplementing the flag blacklistingOnTaskCompletion)? As this code is already quite complex and with the boolean flag dependent locking I think it will be even harder to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the code fix is a little tricky here, however, as far as I have tested, I have not seen a performance degradation in the job running time by addition of the extra lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there must be some misunderstanding here.
My suggestion is removing this if condition completely and using always:
makeOffersLock.synchronized {
response = synchronized {
killExecutorsImpl(idleExecutorIds, adjustTargetNumExecutors, countFailures, force)
}
}
And as you got rid of the if you can remove the blacklistingOnTaskCompletion from the methods's arguments as well.
As the order of locking always starts makeOffersLock I think this should be enough to avoid the deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag blacklistingOnTaskCompletion is needed to ensure that the thread "task-result-getter-x" should not try to acquire the lock on makeOffersLock which is a necessary condition to avoid the deadlock between "task-result-getter" thread and "dispatcher-event-loop" thread.
The reason is that when "task-result-getter" thread reaches the method killExecutors(), it has already acquired the lock on TaskSchedulerImpl and will try to acquire makeOffersLock. The "dispatcher-event-loop" thread on the other hand, acquires makeOffersLock and will wait on acquiring TaskSchedulerImpl lock in the method resourceOffers(), thus leading to the deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. I checked the first deadlock and I think the problem is in org.apache.spark.scheduler.TaskSchedulerImpl#isExecutorBusy:
spark/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
Lines 824 to 826 in b154233
| def isExecutorBusy(execId: String): Boolean = synchronized { | |
| executorIdToRunningTaskIds.get(execId).exists(_.nonEmpty) | |
| } |
That synchronised is too restrictive here for reading a snapshot state of the executorIdToRunningTaskIds map. For this problem a solution could be just using TrieMap, which is "A concurrent hash-trie or TrieMap is a concurrent thread-safe lock-free implementation of a hash array mapped trie".
If you change the type of executorIdToRunningTaskIds from HashMap to TrieMap then you can remove the synchronised from isExecutorBusy.
I have checked and the isExecutorBusy is only used from two places:
- org.apache.spark.scheduler.TaskSchedulerImpl#resourceOffers where we already in a synchronised block, so with the type change the behaviour is the same as before
- org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend#killExecutors where we already lived with a snapshot state which could be outdated after the method call
Regarding the second deadlock I will continue my analyses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just focused on saving the extra lock first.
But we could keep track of the executor IDs where tasks are scheduled/running separately in a concurrently accessable set (volatile reference for an Immutable Set or CopyOnWriteArraySet).
The method isExecutorBusy could use this new set. So we can keep HashMap for executorIdToRunningTaskIds and still we are not introducing that lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that makeOffersLock solves the deadlock here. You wont' get a deadlock between the same two locks, but now it can be with makeOffersLock instead. Consider this sequence (some simplification of full call stack, but showing the important locks at least)
-
taskresultgetter: handleFailedTask --> lock on taskSchedulerImpl
-
taskresultgetter: BlacklistTracker.killExecutor
-
dispatcher: receive --> lock on CoarseGrainedSchedulerBackendkk
-
dispatcher: makeOffers --> lock on makeOffersLock
-
dispatcher: blocked on TaskSchedulerImpl lock
-
taskResultGetter: makeOffers, but blocked on makeOffersLock
As Attila suggested, I would consider creating an ordering between the TaskSchedulerImpl lock and the CoarseGrainedSchedulerBackend lock, so that we always get the TaskSchedulerImpl lock first. Of course that comes with a performance penalty, and we will have to audit all other uses of the CoarseGrainedSchedulerBackend lock too.
Still thinking about any other options ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squito I agree with you and @attilapiros about creating an ordering. I shall definitely follow the approach and try it out.
Regarding your comment on the deadlock between makeOffersLock and task-result-getter thread, that should ideally not happen as the task-result-getter thread will never compete for acquiring makeOffersLock. The reason I have added the flag blacklistingForTaskCompletion is to ensure that task-result-getter thread never acquires lock on makeOffersLock.
Also, you are right in saying that makeOffersLock does not solve the deadlock. I have explained the purpose of makeOffersLock in my comment below. Quoting it here:
I can explain more about the makeOffersLock here.
PR #17091 introduced the part about acquiring synchronization on CoarseGrainedSchedulerBackend object in the method makeOffers(). This particular piece of code introduced a deadlock between task-result-getter thread and dispatcher-event-loop thread. I can simply removed the synchronized statement in makeOffers() and the deadlock would be resolved and we really do not need makeOffersLock.
However, removing the synchronized statement will once again expose the race condition described in JIRA https://issues.apache.org/jira/browse/SPARK-19757 for which the fix in the corresponding PR was merged. makeOffersLock here serves as the solution to the above problem. By synchronizing on makeOffersLock, the race condition between dynamic-executor-allocation thread and dispatcher-event-loop thread is avoided. That is indeed it's sole purpose. I am however, open to discussing and working on better solutions to the above problem, if any. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am basically trying to solve the two deadlocks and also fix the race condition issue for SPARK-19757. I think the approach of ordering alongwith using a concurrently accessible separate Set as suggested by @attilapiros and @squito should work out. Let me work on that and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see, sorry I misread part of the logic, thanks
|
Test build #103333 has finished for PR 24035 at commit
|
|
@attilapiros Have attached the stack trace in text format here: Deadlock between task-result-getter-thread and spark-dynamic-executor-allocation thread: Deadlock between task-result-getter-thread and dispatcher-event-loop thread: |
|
Test build #103335 has finished for PR 24035 at commit
|
squito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't fully digested this yet, but I do at least agree about the bug, thanks for reporting.
I don't understand the use of makeOffersLock yet, but I will give it a more careful read later
|
@squito Thank you for your response. I can explain more about the PR #17091 introduced the part about acquiring synchronization on However, removing the synchronized statement will once again expose the race condition described in JIRA https://issues.apache.org/jira/browse/SPARK-19757 for which the fix in the corresponding PR was merged. |
|
I think if we fix the lock ordering for the involved threads, this will solve the issue. The current order in which locks are being acquired for individual threads is: TaskResultGetter Order:
DispatcherEventLoop Order:
SparkDynamicExecutorAllocation Order:
Solution:
These 2 changes should align the ordering sequence and seem to be simple to reason about. I think this should solve the issue, but it would be good to have more contributors eyeball this change. |
|
@squito @attilapiros @abellina @dhruve @vanzin Thank you for all your suggestions and comments. Have combined @squito @attilapiros and @dhruve 's suggestions into another PR: #24072. This PR uses ordering on lock acquisition and fixes the deadlock issue. It is relatively a lot simpler to reason about than the current PR. Request you to have a look and let me know so that I can close this PR. Thank you. |
|
I guess we can close this one? |
…when trying to kill executors either due to dynamic allocation or blacklisting
Recently, a few spark users in the organization have reported that their jobs were getting stuck. On further analysis, it was found out that there exist two independent deadlocks and either of them occur under different circumstances. The screenshots for these two deadlocks are attached here.
We were able to reproduce the deadlocks with the following piece of code:
Additionally, to ensure that the deadlock surfaces up soon enough, I also added a small delay in the Spark code here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256
Also make sure that the following configs are set when launching the above spark job:
spark.blacklist.enabled=true
spark.blacklist.killBlacklistedExecutors=true
spark.blacklist.application.maxFailedTasksPerExecutor=1
Screenshots for deadlock between task-result-getter-thread and spark-dynamic-executor-allocation thread:

Screenshots for deadlock between task-result-getter-thread and dispatcher-event-loop thread:
What changes were proposed in this pull request?
There are two deadlocks as a result of the interplay between three different threads:
task-result-getter thread
spark-dynamic-executor-allocation thread
dispatcher-event-loop thread(makeOffers())
The fix for the deadlock between dynamic allocation thread and result getter thread involves moving the method isExecutorBusy outside of the lock CoarseGrainedSchedulerBackend.this.
The fix for the deadlock between event loop thread and result getter thread involves removing synchronized on CoarseGrainedSchedulerBackend.this. The same synchronization has been replaced by a dummy lock to ensure synchronization between dynamic allocation thread and event loop thread in order to fix https://issues.apache.org/jira/browse/SPARK-19757 but avoid the deadlock scenario which was introduced by the code changes for the above JIRA.
How was this patch tested?
The code used to reproduce the deadlock issue is documented above.