-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27112] : Spark Scheduler encounters two independent Deadlocks … #24035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the performance cost to always use
makeOffersLock(and deplementing the flagblacklistingOnTaskCompletion)? As this code is already quite complex and with the boolean flag dependent locking I think it will be even harder to follow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the code fix is a little tricky here, however, as far as I have tested, I have not seen a performance degradation in the job running time by addition of the extra lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there must be some misunderstanding here.
My suggestion is removing this if condition completely and using always:
And as you got rid of the if you can remove the
blacklistingOnTaskCompletionfrom the methods's arguments as well.As the order of locking always starts
makeOffersLockI think this should be enough to avoid the deadlock.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag
blacklistingOnTaskCompletionis needed to ensure that the thread "task-result-getter-x" should not try to acquire the lock onmakeOffersLockwhich is a necessary condition to avoid the deadlock between "task-result-getter" thread and "dispatcher-event-loop" thread.The reason is that when "task-result-getter" thread reaches the method
killExecutors(), it has already acquired the lock onTaskSchedulerImpland will try to acquiremakeOffersLock. The "dispatcher-event-loop" thread on the other hand, acquiresmakeOffersLockand will wait on acquiringTaskSchedulerImpllock in the methodresourceOffers(), thus leading to the deadlock.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. I checked the first deadlock and I think the problem is in
org.apache.spark.scheduler.TaskSchedulerImpl#isExecutorBusy:spark/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
Lines 824 to 826 in b154233
That
synchronisedis too restrictive here for reading a snapshot state of theexecutorIdToRunningTaskIdsmap. For this problem a solution could be just using TrieMap, which is "A concurrent hash-trie or TrieMap is a concurrent thread-safe lock-free implementation of a hash array mapped trie".If you change the type of
executorIdToRunningTaskIdsfrom HashMap to TrieMap then you can remove the synchronised fromisExecutorBusy.I have checked and the
isExecutorBusyis only used from two places:Regarding the second deadlock I will continue my analyses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just focused on saving the extra lock first.
But we could keep track of the executor IDs where tasks are scheduled/running separately in a concurrently accessable set (volatile reference for an Immutable Set or CopyOnWriteArraySet).
The method
isExecutorBusycould use this new set. So we can keep HashMap forexecutorIdToRunningTaskIdsand still we are not introducing that lock.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that
makeOffersLocksolves the deadlock here. You wont' get a deadlock between the same two locks, but now it can be withmakeOffersLockinstead. Consider this sequence (some simplification of full call stack, but showing the important locks at least)taskresultgetter: handleFailedTask --> lock on taskSchedulerImpl
taskresultgetter: BlacklistTracker.killExecutor
dispatcher: receive --> lock on CoarseGrainedSchedulerBackendkk
dispatcher: makeOffers --> lock on makeOffersLock
dispatcher: blocked on TaskSchedulerImpl lock
taskResultGetter: makeOffers, but blocked on makeOffersLock
As Attila suggested, I would consider creating an ordering between the TaskSchedulerImpl lock and the CoarseGrainedSchedulerBackend lock, so that we always get the TaskSchedulerImpl lock first. Of course that comes with a performance penalty, and we will have to audit all other uses of the CoarseGrainedSchedulerBackend lock too.
Still thinking about any other options ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squito I agree with you and @attilapiros about creating an ordering. I shall definitely follow the approach and try it out.
Regarding your comment on the deadlock between
makeOffersLockand task-result-getter thread, that should ideally not happen as the task-result-getter thread will never compete for acquiringmakeOffersLock. The reason I have added the flagblacklistingForTaskCompletionis to ensure that task-result-getter thread never acquires lock onmakeOffersLock.Also, you are right in saying that
makeOffersLockdoes not solve the deadlock. I have explained the purpose ofmakeOffersLockin my comment below. Quoting it here:Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am basically trying to solve the two deadlocks and also fix the race condition issue for SPARK-19757. I think the approach of ordering alongwith using a concurrently accessible separate Set as suggested by @attilapiros and @squito should work out. Let me work on that and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see, sorry I misread part of the logic, thanks