Skip to content

Conversation

@angolon
Copy link
Contributor

@angolon angolon commented Aug 19, 2016

What changes were proposed in this pull request?

This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:

  • Use scheduleWithFixedDelay when calling ExecutorAllocationManager.schedule for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to schedule are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
  • Replace a number of calls to askWithRetry with ask inside of message handling code in CoarseGrainedSchedulerBackend and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

How was this patch tested?

Existing tests, and manual tests under yarn-client mode.

@petermaxlee
Copy link
Contributor

cc @vanzin and @kayousterhout

@rxin
Copy link
Contributor

rxin commented Aug 19, 2016

Can you put a more descriptive title for the change?

@angolon angolon changed the title [SPARK-16533][CORE] [SPARK-16533][CORE] resolve deadlocking in driver when executors die Aug 19, 2016
@angolon
Copy link
Contributor Author

angolon commented Aug 19, 2016

Done, sorry!

@vanzin
Copy link
Contributor

vanzin commented Aug 23, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64292 has finished for PR 14710 at commit 920274a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 24, 2016

@angolon you need to fix your code to get tests passing.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64394 has finished for PR 14710 at commit 380291b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@angolon
Copy link
Contributor Author

angolon commented Aug 25, 2016

Hrmm... SparkContextSuite passes all tests for me locally. Any idea what might be happening here?

@vanzin
Copy link
Contributor

vanzin commented Aug 25, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64429 has finished for PR 14710 at commit 380291b.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 25, 2016

wow a core dump in the build. retest this please

case Success(b) => context.reply(b)
case Failure(ie: InterruptedException) => // Cancelled
case Failure(NonFatal(t)) => context.sendFailure(t)
}(askAndReplyExecutionContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need askAndReplyExecutionContext anymore? It seems now all the heavy lifting is being done in the RPC thread pool, and the andThen code could just use ThreadUtils.sameThreadExecutionContext since it doesn't do much.

@vanzin
Copy link
Contributor

vanzin commented Aug 26, 2016

Looks ok, a couple of minor suggestions that from my understanding should work now. I guess this is the next best thing without making all of these APIs properly asynchronous.

pinging @zsxwing also in case he wants to take a look.

@angolon
Copy link
Contributor Author

angolon commented Aug 26, 2016

Thanks for the feedback, @vanzin - all good points. I'll fix them up.

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64442 has finished for PR 14710 at commit 380291b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// requests to master should fail immediately
assert(ci.client.requestTotalExecutors(3) === false)
whenReady(ci.client.requestTotalExecutors(3), timeout(0.seconds)) { success =>
Copy link
Member

@zsxwing zsxwing Aug 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't use 0 timeout. It assumes whenReady runs the command firstly then checks the timeout. But that could be changed in future.

@zsxwing
Copy link
Member

zsxwing commented Aug 26, 2016

Looks pretty good overall.

@dhruve
Copy link
Contributor

dhruve commented Aug 29, 2016

@angolon - Kindly resolve the conflicts and update the PR.

this.localityAwareTasks = localityAwareTasks
this.hostToLocalTaskCount = hostToLocalTaskCount

numPendingExecutors =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at this more tomorrow, but what happens if the ask does fail and we have now incremented numPendingExecutors? that issue was there before, but now if we are doing ask instead of askwithretry it might show up more often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a longer discussion (and something I'd like to address thoroughly at some point when I find time), but askWithRetry is actually pretty useless with the new RPC implementation, and I'd say even harmful. An ask with a larger timeout has a much better chance of succeeding, and is cheaper than askWithRetry.

So I don't think that the change makes the particular situation you point out more common at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'll have to go look at the new implementation, can you clarify why ask would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I would still like to know what happens if it occurs as it could have just been a bug before. If its harmless then I'm ok with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, it's not that ask would be better, it's just that it would be no worse. With the new RPC code, the only time askWithRetry will actually retry, barring bugs in the RPC handlers, is when a timeout occurs, since the RPC layer does not drop messages. So an ask with a longer timeout has actually a better chance of succeeding, since with askWithRetry the remote end will receive and process the first message before the retries, even if the sender has given up on it.

As for the bug you mention, yes it exists, but it also existed before.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64613 has finished for PR 14710 at commit 3eb34fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

_ => Future.successful(false)
}

adjustTotalExecutors.flatMap(killExecutors)(ThreadUtils.sameThread)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong as I'm not that familiar with the future flatmap, but isn't this going to run the doRequestTotalExecutors, then once that comes back, apply the result to killExecutors? Which I think means the killExecutors is called outside of the synchronize block after we do the awaitResults for the doRequestTotalExecutors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you're correct, but at the same time I don't think there's a requirement that doKillExecutors needs to be called from a synchronized block. Current implementations just send RPC messages, which is probably better done outside the synchronized block anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I originally started working on this I thought I wouldn't be able to avoid blocking on that call within the synchronized block. However my (admittedly novice) understanding of the code aligns with what @vanzin said - because all it does is send the kill message there's no need to synchronize over it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was mostly just trying to make sure I understood correctly. I'm not worried about the rpc call outside of the synchronize block because as you say its best if it is done outside since its safe to call it multi-threaded. It was more to make sure other datastructures weren't modified outside synchronize block. In this case all its accessing is the local executorsToKill so doesn't matter.

@vanzin
Copy link
Contributor

vanzin commented Aug 30, 2016

LGTM.

@angolon
Copy link
Contributor Author

angolon commented Aug 31, 2016

Thanks @vanzin. I'm on mobile at the moment - I'll take care of your nit when I get back to my desk in a couple of hours.

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64692 has finished for PR 14710 at commit 5a2f30f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Aug 31, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Aug 31, 2016

Test build #64697 has finished for PR 14710 at commit 5a2f30f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 31, 2016

LGTM. Running again to make sure mesos tests run, since the build's now properly running them.

retest this please

@vanzin
Copy link
Contributor

vanzin commented Aug 31, 2016

retest this please

@vanzin
Copy link
Contributor

vanzin commented Sep 1, 2016

@angolon there's a conflict now that needs to be resolved...

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64751 has finished for PR 14710 at commit 0772e81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AlterViewAsCommand(

@angolon
Copy link
Contributor Author

angolon commented Sep 1, 2016

...sigh

@angolon
Copy link
Contributor Author

angolon commented Sep 1, 2016

retest this please

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64781 has finished for PR 14710 at commit 0772e81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AlterViewAsCommand(

@vanzin
Copy link
Contributor

vanzin commented Sep 1, 2016

LGTM, merging to master and will try 2.0.

@vanzin
Copy link
Contributor

vanzin commented Sep 1, 2016

Didn't merge cleanly into 2.0, please open a separate PR if you want it in 2.0.1.

@asfgit asfgit closed this in a0aac4b Sep 1, 2016
@zsxwing
Copy link
Member

zsxwing commented Sep 1, 2016

FYI. This one breaks Scala 2.10. @angolon @vanzin can anyone fix it, please?

@vanzin
Copy link
Contributor

vanzin commented Sep 1, 2016

Let me take a look.

@vanzin
Copy link
Contributor

vanzin commented Sep 1, 2016

@zsxwing can you be more specific? Compiles fine for me. Is it a specific test?

@zsxwing
Copy link
Member

zsxwing commented Sep 1, 2016

@vanzin
Copy link
Contributor

vanzin commented Sep 1, 2016

ah, mesos. didn't have that enabled.

@vanzin
Copy link
Contributor

vanzin commented Sep 1, 2016

#14925

@angolon angolon deleted the SPARK-16533 branch September 2, 2016 03:37
angolon added a commit to angolon/spark that referenced this pull request Sep 2, 2016
This pull request reverts the changes made as a part of apache#14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <angolon@gmail.com>

Closes apache#14710 from angolon/SPARK-16533.
asfgit pushed a commit that referenced this pull request Sep 7, 2016
## What changes were proposed in this pull request?
Backport changes from #14710 and #14925 to 2.0

Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: Angus Gerry <angolon@gmail.com>

Closes #14933 from angolon/SPARK-16533-2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants