Skip to content

Conversation

@angolon
Copy link
Contributor

@angolon angolon commented Sep 2, 2016

What changes were proposed in this pull request?

Backport changes from #14710 and #14925 to 2.0

angolon and others added 2 commits September 2, 2016 13:36
This pull request reverts the changes made as a part of apache#14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <angolon@gmail.com>

Closes apache#14710 from angolon/SPARK-16533.
No idea why it was failing (the needed import was there), but
this makes things work.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#14925 from vanzin/SPARK-16533.
@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64855 has finished for PR 14933 at commit de488ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Sep 7, 2016

@angolon can you close the PR after I merge? (That doesn't happen automatically for branches.)

Merging to 2.0.

asfgit pushed a commit that referenced this pull request Sep 7, 2016
## What changes were proposed in this pull request?
Backport changes from #14710 and #14925 to 2.0

Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: Angus Gerry <angolon@gmail.com>

Closes #14933 from angolon/SPARK-16533-2.0.
@angolon
Copy link
Contributor Author

angolon commented Sep 8, 2016

Done. Thanks again @vanzin 😄

@angolon angolon closed this Sep 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants