-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16533][CORE] resolve deadlocking in driver when executors die #14710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ing messages" This reverts commit ea0bf91.
…llocatorManager.schedule to ease contention on locks.
|
cc @vanzin and @kayousterhout |
|
Can you put a more descriptive title for the change? |
|
Done, sorry! |
|
ok to test |
|
Test build #64292 has finished for PR 14710 at commit
|
|
@angolon you need to fix your code to get tests passing. |
|
Test build #64394 has finished for PR 14710 at commit
|
|
Hrmm... SparkContextSuite passes all tests for me locally. Any idea what might be happening here? |
|
retest this please |
|
Test build #64429 has finished for PR 14710 at commit
|
|
wow a core dump in the build. retest this please |
| case Success(b) => context.reply(b) | ||
| case Failure(ie: InterruptedException) => // Cancelled | ||
| case Failure(NonFatal(t)) => context.sendFailure(t) | ||
| }(askAndReplyExecutionContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need askAndReplyExecutionContext anymore? It seems now all the heavy lifting is being done in the RPC thread pool, and the andThen code could just use ThreadUtils.sameThreadExecutionContext since it doesn't do much.
|
Looks ok, a couple of minor suggestions that from my understanding should work now. I guess this is the next best thing without making all of these APIs properly asynchronous. pinging @zsxwing also in case he wants to take a look. |
|
Thanks for the feedback, @vanzin - all good points. I'll fix them up. |
|
Test build #64442 has finished for PR 14710 at commit
|
|
|
||
| // requests to master should fail immediately | ||
| assert(ci.client.requestTotalExecutors(3) === false) | ||
| whenReady(ci.client.requestTotalExecutors(3), timeout(0.seconds)) { success => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't use 0 timeout. It assumes whenReady runs the command firstly then checks the timeout. But that could be changed in future.
|
Looks pretty good overall. |
|
@angolon - Kindly resolve the conflicts and update the PR. |
| this.localityAwareTasks = localityAwareTasks | ||
| this.hostToLocalTaskCount = hostToLocalTaskCount | ||
|
|
||
| numPendingExecutors = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at this more tomorrow, but what happens if the ask does fail and we have now incremented numPendingExecutors? that issue was there before, but now if we are doing ask instead of askwithretry it might show up more often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a longer discussion (and something I'd like to address thoroughly at some point when I find time), but askWithRetry is actually pretty useless with the new RPC implementation, and I'd say even harmful. An ask with a larger timeout has a much better chance of succeeding, and is cheaper than askWithRetry.
So I don't think that the change makes the particular situation you point out more common at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'll have to go look at the new implementation, can you clarify why ask would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I would still like to know what happens if it occurs as it could have just been a bug before. If its harmless then I'm ok with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, it's not that ask would be better, it's just that it would be no worse. With the new RPC code, the only time askWithRetry will actually retry, barring bugs in the RPC handlers, is when a timeout occurs, since the RPC layer does not drop messages. So an ask with a longer timeout has actually a better chance of succeeding, since with askWithRetry the remote end will receive and process the first message before the retries, even if the sender has given up on it.
As for the bug you mention, yes it exists, but it also existed before.
|
Test build #64613 has finished for PR 14710 at commit
|
| _ => Future.successful(false) | ||
| } | ||
|
|
||
| adjustTotalExecutors.flatMap(killExecutors)(ThreadUtils.sameThread) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I'm wrong as I'm not that familiar with the future flatmap, but isn't this going to run the doRequestTotalExecutors, then once that comes back, apply the result to killExecutors? Which I think means the killExecutors is called outside of the synchronize block after we do the awaitResults for the doRequestTotalExecutors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you're correct, but at the same time I don't think there's a requirement that doKillExecutors needs to be called from a synchronized block. Current implementations just send RPC messages, which is probably better done outside the synchronized block anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I originally started working on this I thought I wouldn't be able to avoid blocking on that call within the synchronized block. However my (admittedly novice) understanding of the code aligns with what @vanzin said - because all it does is send the kill message there's no need to synchronize over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was mostly just trying to make sure I understood correctly. I'm not worried about the rpc call outside of the synchronize block because as you say its best if it is done outside since its safe to call it multi-threaded. It was more to make sure other datastructures weren't modified outside synchronize block. In this case all its accessing is the local executorsToKill so doesn't matter.
|
LGTM. |
|
Thanks @vanzin. I'm on mobile at the moment - I'll take care of your nit when I get back to my desk in a couple of hours. |
|
Test build #64692 has finished for PR 14710 at commit
|
|
retest this please |
|
Test build #64697 has finished for PR 14710 at commit
|
|
LGTM. Running again to make sure mesos tests run, since the build's now properly running them. retest this please |
|
retest this please |
|
@angolon there's a conflict now that needs to be resolved... |
|
Test build #64751 has finished for PR 14710 at commit
|
|
...sigh |
|
retest this please |
|
Jenkins, test this please |
|
Test build #64781 has finished for PR 14710 at commit
|
|
LGTM, merging to master and will try 2.0. |
|
Didn't merge cleanly into 2.0, please open a separate PR if you want it in 2.0.1. |
|
Let me take a look. |
|
@zsxwing can you be more specific? Compiles fine for me. Is it a specific test? |
|
ah, mesos. didn't have that enabled. |
This pull request reverts the changes made as a part of apache#14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. Existing tests, and manual tests under yarn-client mode. Author: Angus Gerry <angolon@gmail.com> Closes apache#14710 from angolon/SPARK-16533.
## What changes were proposed in this pull request? Backport changes from #14710 and #14925 to 2.0 Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Angus Gerry <angolon@gmail.com> Closes #14933 from angolon/SPARK-16533-2.0.
What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
scheduleWithFixedDelaywhen callingExecutorAllocationManager.schedulefor scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls toscheduleare made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.askWithRetrywithaskinside of message handling code inCoarseGrainedSchedulerBackendand its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.How was this patch tested?
Existing tests, and manual tests under yarn-client mode.