-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46861][CORE] Avoid Deadlock in DAGScheduler #44882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * One thread might hold the lock on many of these, for a chain of RDD dependencies. Deadlocks | ||
| * are possible if we try to lock another resource while holding the stateLock, | ||
| * and the lock acquisition sequence of these locks is not guaranteed to be the same. | ||
| * This can lead lead to a deadlock as one thread might first acquire the stateLock, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. lead lead -> lead
|
|
||
| // Note that this test is NOT perfectly reproducible when there is a deadlock as it uses | ||
| // Thread.sleep, but it should never fail / flake when there is no deadlock. | ||
| // If this test starts to flake, this shows that there is a deadlock! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you for warning.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you for this fix, @fred-db .
* The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db <fredrik.klauss@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 617014c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
* The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db <fredrik.klauss@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 617014c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
Merged to master/3.5/3.4. |
|
Can you add details of the deadlock stack traces to the jira please ? |
|
+CC @dongjoon-hyun, since you merged this PR - any insights into how/why this deadlock can occur ? It is not clear to me how this is happening. |
|
Hi @mridulm , I'm able to reproduce the deadlock consistently when running the test I added and after removing the call to The issue exists essentially with CoalescedRDDs. When calling |
|
@dongjoon-hyun I meant the stack traces indicating a deadlock. |
* The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes apache#44882 from fred-db/fix-deadlock. Authored-by: fred-db <fredrik.klauss@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 617014c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
If somebody is interested in the original deadlock I have executed the test without the fix (and without the timeout). The stack trace is: |
…e#789) [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler * The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes apache#44882 from fred-db/fix-deadlock. Authored-by: fred-db <fredrik.klauss@databricks.com> (cherry picked from commit 617014c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: fred-db <fredrik.klauss@databricks.com>
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No