[SPARK-16533][CORE] resolve deadlocking in driver when executors die #14710

angolon · 2016-08-19T04:43:44Z

What changes were proposed in this pull request?

This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:

Use scheduleWithFixedDelay when calling ExecutorAllocationManager.schedule for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to schedule are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
Replace a number of calls to askWithRetry with ask inside of message handling code in CoarseGrainedSchedulerBackend and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

How was this patch tested?

Existing tests, and manual tests under yarn-client mode.

…ing messages" This reverts commit ea0bf91.

…llocatorManager.schedule to ease contention on locks.

petermaxlee · 2016-08-19T05:09:04Z

cc @vanzin and @kayousterhout

rxin · 2016-08-19T05:40:16Z

Can you put a more descriptive title for the change?

angolon · 2016-08-19T05:57:26Z

Done, sorry!

vanzin · 2016-08-23T16:50:49Z

ok to test

SparkQA · 2016-08-23T16:54:14Z

Test build #64292 has finished for PR 14710 at commit 920274a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-08-24T17:41:06Z

@angolon you need to fix your code to get tests passing.

SparkQA · 2016-08-25T05:09:04Z

Test build #64394 has finished for PR 14710 at commit 380291b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

angolon · 2016-08-25T05:59:11Z

Hrmm... SparkContextSuite passes all tests for me locally. Any idea what might be happening here?

vanzin · 2016-08-25T17:36:02Z

retest this please

SparkQA · 2016-08-25T17:50:22Z

Test build #64429 has finished for PR 14710 at commit 380291b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-08-25T23:17:28Z

wow a core dump in the build. retest this please

vanzin · 2016-08-26T00:20:09Z

core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala

+        case Success(b) => context.reply(b)
+        case Failure(ie: InterruptedException) => // Cancelled
+        case Failure(NonFatal(t)) => context.sendFailure(t)
+      }(askAndReplyExecutionContext)


Do you need askAndReplyExecutionContext anymore? It seems now all the heavy lifting is being done in the RPC thread pool, and the andThen code could just use ThreadUtils.sameThreadExecutionContext since it doesn't do much.

vanzin · 2016-08-26T00:32:27Z

Looks ok, a couple of minor suggestions that from my understanding should work now. I guess this is the next best thing without making all of these APIs properly asynchronous.

pinging @zsxwing also in case he wants to take a look.

angolon · 2016-08-26T01:19:00Z

Thanks for the feedback, @vanzin - all good points. I'll fix them up.

SparkQA · 2016-08-26T01:46:54Z

Test build #64442 has finished for PR 14710 at commit 380291b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-08-26T18:25:08Z

core/src/test/scala/org/apache/spark/deploy/client/AppClientSuite.scala


    // requests to master should fail immediately
-    assert(ci.client.requestTotalExecutors(3) === false)
+    whenReady(ci.client.requestTotalExecutors(3), timeout(0.seconds)) { success =>


nit: don't use 0 timeout. It assumes whenReady runs the command firstly then checks the timeout. But that could be changed in future.

zsxwing · 2016-08-26T18:28:47Z

Looks pretty good overall.

dhruve · 2016-08-29T18:26:30Z

@angolon - Kindly resolve the conflicts and update the PR.

tgravescs · 2016-08-29T22:22:10Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+      this.localityAwareTasks = localityAwareTasks
+      this.hostToLocalTaskCount = hostToLocalTaskCount
+
+      numPendingExecutors =


I'll look at this more tomorrow, but what happens if the ask does fail and we have now incremented numPendingExecutors? that issue was there before, but now if we are doing ask instead of askwithretry it might show up more often.

This is a longer discussion (and something I'd like to address thoroughly at some point when I find time), but askWithRetry is actually pretty useless with the new RPC implementation, and I'd say even harmful. An ask with a larger timeout has a much better chance of succeeding, and is cheaper than askWithRetry.

So I don't think that the change makes the particular situation you point out more common at all.

I guess I'll have to go look at the new implementation, can you clarify why ask would be better?

Note I would still like to know what happens if it occurs as it could have just been a bug before. If its harmless then I'm ok with it.

In this particular case, it's not that ask would be better, it's just that it would be no worse. With the new RPC code, the only time askWithRetry will actually retry, barring bugs in the RPC handlers, is when a timeout occurs, since the RPC layer does not drop messages. So an ask with a longer timeout has actually a better chance of succeeding, since with askWithRetry the remote end will receive and process the first message before the retries, even if the sender has given up on it.

As for the bug you mention, yes it exists, but it also existed before.

SparkQA · 2016-08-30T04:06:17Z

Test build #64613 has finished for PR 14710 at commit 3eb34fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-08-30T22:17:22Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+          _ => Future.successful(false)
+        }
+
+      adjustTotalExecutors.flatMap(killExecutors)(ThreadUtils.sameThread)


Please correct me if I'm wrong as I'm not that familiar with the future flatmap, but isn't this going to run the doRequestTotalExecutors, then once that comes back, apply the result to killExecutors? Which I think means the killExecutors is called outside of the synchronize block after we do the awaitResults for the doRequestTotalExecutors?

I'm pretty sure you're correct, but at the same time I don't think there's a requirement that doKillExecutors needs to be called from a synchronized block. Current implementations just send RPC messages, which is probably better done outside the synchronized block anyway.

When I originally started working on this I thought I wouldn't be able to avoid blocking on that call within the synchronized block. However my (admittedly novice) understanding of the code aligns with what @vanzin said - because all it does is send the kill message there's no need to synchronize over it.

Thanks, I was mostly just trying to make sure I understood correctly. I'm not worried about the rpc call outside of the synchronize block because as you say its best if it is done outside since its safe to call it multi-threaded. It was more to make sure other datastructures weren't modified outside synchronize block. In this case all its accessing is the local executorsToKill so doesn't matter.

vanzin · 2016-08-30T23:44:48Z

LGTM.

angolon · 2016-08-31T00:07:10Z

Thanks @vanzin. I'm on mobile at the moment - I'll take care of your nit when I get back to my desk in a couple of hours.

SparkQA · 2016-08-31T03:53:56Z

Test build #64692 has finished for PR 14710 at commit 5a2f30f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-08-31T04:04:42Z

retest this please

SparkQA · 2016-08-31T06:19:47Z

Test build #64697 has finished for PR 14710 at commit 5a2f30f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-08-31T17:54:13Z

LGTM. Running again to make sure mesos tests run, since the build's now properly running them.

retest this please

vanzin · 2016-08-31T20:41:48Z

retest this please

vanzin · 2016-09-01T01:18:18Z

@angolon there's a conflict now that needs to be resolved...

SparkQA · 2016-09-01T04:21:56Z

Test build #64751 has finished for PR 14710 at commit 0772e81.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AlterViewAsCommand(

angolon · 2016-09-01T04:23:22Z

...sigh

angolon · 2016-09-01T05:00:26Z

retest this please

tgravescs · 2016-09-01T13:13:51Z

Jenkins, test this please

SparkQA · 2016-09-01T15:30:40Z

Test build #64781 has finished for PR 14710 at commit 0772e81.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AlterViewAsCommand(

vanzin · 2016-09-01T17:35:07Z

LGTM, merging to master and will try 2.0.

vanzin · 2016-09-01T17:36:07Z

Didn't merge cleanly into 2.0, please open a separate PR if you want it in 2.0.1.

zsxwing · 2016-09-01T18:15:53Z

FYI. This one breaks Scala 2.10. @angolon @vanzin can anyone fix it, please?

vanzin · 2016-09-01T18:33:33Z

Let me take a look.

vanzin · 2016-09-01T18:46:48Z

@zsxwing can you be more specific? Compiles fine for me. Is it a specific test?

zsxwing · 2016-09-01T18:55:33Z

@vanzin See https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/2415/console

vanzin · 2016-09-01T19:00:36Z

ah, mesos. didn't have that enabled.

vanzin · 2016-09-01T19:42:48Z

#14925

This pull request reverts the changes made as a part of apache#14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. Existing tests, and manual tests under yarn-client mode. Author: Angus Gerry <angolon@gmail.com> Closes apache#14710 from angolon/SPARK-16533.

## What changes were proposed in this pull request? Backport changes from #14710 and #14925 to 2.0 Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Angus Gerry <angolon@gmail.com> Closes #14933 from angolon/SPARK-16533-2.0.

angolon added 3 commits August 19, 2016 11:52

Revert "[SPARK-17022][YARN] Handle potential deadlock in driver handl…

cef69bf

…ing messages" This reverts commit ea0bf91.

[SPARK-16533][CORE] Use scheduleWithFixedDelay when calling ExecutorA…

4970b3b

…llocatorManager.schedule to ease contention on locks.

[SPARK-16533][CORE] Replace many calls to askWithRetry to plain old ask.

920274a

angolon changed the title ~~[SPARK-16533][CORE]~~ [SPARK-16533][CORE] resolve deadlocking in driver when executors die Aug 19, 2016

[SPARK-16533][CORE] fix scalastyle errors.

380291b

vanzin reviewed Aug 26, 2016
View reviewed changes

zsxwing reviewed Aug 26, 2016
View reviewed changes

tgravescs reviewed Aug 29, 2016
View reviewed changes

angolon added 2 commits August 30, 2016 10:49

[SPARK-16533][CORE] implementing suggestions from @vanzin and @zsxwing

adb2969

Merge remote-tracking branch 'apache/master' into SPARK-16533

3eb34fd

tgravescs reviewed Aug 30, 2016
View reviewed changes

[SPARK-16533][CORE] tidy up unused import.

5a2f30f

Merge remote-tracking branch 'apache/master' into SPARK-16533

0772e81

asfgit closed this in a0aac4b Sep 1, 2016

angolon deleted the SPARK-16533 branch September 2, 2016 03:37

angolon mentioned this pull request Sep 2, 2016

[SPARK-16533][CORE] - backport driver deadlock fix to 2.0 #14933

Closed

[SPARK-16533][CORE] resolve deadlocking in driver when executors die #14710

[SPARK-16533][CORE] resolve deadlocking in driver when executors die #14710

Uh oh!

Conversation

angolon commented Aug 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

petermaxlee commented Aug 19, 2016

Uh oh!

rxin commented Aug 19, 2016

Uh oh!

angolon commented Aug 19, 2016

Uh oh!

vanzin commented Aug 23, 2016

Uh oh!

SparkQA commented Aug 23, 2016

Uh oh!

vanzin commented Aug 24, 2016

Uh oh!

SparkQA commented Aug 25, 2016

Uh oh!

angolon commented Aug 25, 2016

Uh oh!

vanzin commented Aug 25, 2016

Uh oh!

SparkQA commented Aug 25, 2016

Uh oh!

vanzin commented Aug 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 26, 2016

Uh oh!

angolon commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

zsxwing Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Aug 26, 2016

Uh oh!

dhruve commented Aug 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 30, 2016

Uh oh!

angolon commented Aug 31, 2016

Uh oh!

SparkQA commented Aug 31, 2016

Uh oh!

zsxwing commented Aug 31, 2016

Uh oh!

SparkQA commented Aug 31, 2016

Uh oh!

vanzin commented Aug 31, 2016

Uh oh!

vanzin commented Aug 31, 2016

Uh oh!

zsxwing Aug 26, 2016 •

edited

Loading