[SPARK-17667] [YARN][WIP]Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest #15267

ashwinshankar77 · 2016-09-27T19:06:16Z

What changes were proposed in this pull request?

Following up on the discussion in SPARK-15725, one of the reason for AM hanging with dynamic allocation(DA) is the way locking is done in YarnAllocator. We noticed that when executors go down during the shrink phase of DA, AM gets locked up. On taking thread dump, we see threads trying to get loss for reason via YarnAllocator#enqueueGetLossReasonRequest, and they are all BLOCKED waiting for lock acquired by allocate call. This gets worse when the number of executors go down are in the thousands, and I've seen AM hang in the order of minutes. This patch is created to make the locking little more fine grained by remembering the executors that were killed via AM, and then serve the GetExecutorLossReason requests with that information.

This patch is not final. I want to get inputs on how I can go about removing executors from the executorsKilledByDriver set. Also if there is a better way to solve this, would be happy to make the changes.

How was this patch tested?

This was tested in our cluster with manually scaling the number of executors to thousands and then shrinking them down to something small, and made sure that we don't see BLOCKED threads stuck at YarnAllocator#enqueueGetLossReasonRequest.

vanzin · 2016-09-28T02:48:59Z

ok to test

SparkQA · 2016-09-28T03:10:35Z

Test build #66015 has finished for PR 15267 at commit 859718c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-09-29T15:04:38Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

            // returned in one AM-RM communication. So query RPC will be later than this completed
            // container process.
-            releasedExecutorLossReasons.put(eid, exitReason)
+            if (!executorsKilledByDriver.contains(eid)) {


there is a race here where if the executor went down for reason other then us kill after we called kill we won't get the right loss reason. ie killExecutor just adding it to list to be removed later, if it happened to die for another reason we would miss out on that. I don't think this is a big deal but we could put in a check here to compare reason.

tgravescs · 2016-09-29T15:07:56Z

SPARK-17365 should help with this situation have you tried that?

overall the approach seems ok to me. Its an easy short cut to prevent asking if we already know. There is the possible race condition but it doesn't seem very big issue since it should have been idle and if it happens to die for some other reason after we requested to kill it we probably don't care.

tgravescs · 2016-09-29T15:18:43Z

It looks like you should be able to remove it after the context.reply(ExecutorKilled) return. The killExecutors should be called before the loss reason and if there is a race there its just going to go to yarn to get the reason, which goes back to the same issue but I assume that will be rare. We shouldn't be asking for the loss reason more then once.

ashwinshankar77 · 2016-10-07T17:00:09Z

Thanks for the review @tgravescs! I'll update the patch shortly. We are also trying out SPARK-17365.

vanzin · 2016-12-02T02:39:59Z

@ashwinshankar77 do you plan on updating this and removing the "WIP"?

HyukjinKwon · 2017-02-09T14:19:10Z

Hi @ashwinshankar77, if you are not currently able to work on this further, maybe it should be closed for now. It seems inactive for few months.

deshanxiao · 2019-11-22T08:18:24Z

I meet this problem in spark 2.1 just now. RPC threads are all blocked by the lock.

## What changes were proposed in this pull request? This PR proposes to close stale PRs. What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month. I left some comments roughly a week ago to ping and the author looks still inactive in these PR below These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate. Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it. Closes apache#7963 Closes apache#8374 Closes apache#11192 Closes apache#11374 Closes apache#11692 Closes apache#12243 Closes apache#12583 Closes apache#12620 Closes apache#12675 Closes apache#12697 Closes apache#12800 Closes apache#13715 Closes apache#14266 Closes apache#15053 Closes apache#15159 Closes apache#15209 Closes apache#15264 Closes apache#15267 Closes apache#15871 Closes apache#15861 Closes apache#16319 Closes apache#16324 Closes apache#16890 Closes apache#12398 Closes apache#12933 Closes apache#14517 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#16937 from HyukjinKwon/stale-prs-close.

Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest

859718c

tgravescs reviewed Sep 29, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

wangyum mentioned this pull request Dec 16, 2019

Revert "[SPARK-30056][INFRA] Skip building test artifacts in dev/make-distribution.sh #26902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17667] [YARN][WIP]Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest #15267

[SPARK-17667] [YARN][WIP]Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest #15267

Uh oh!

ashwinshankar77 commented Sep 27, 2016

Uh oh!

vanzin commented Sep 28, 2016

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

tgravescs Sep 29, 2016

Uh oh!

tgravescs commented Sep 29, 2016

Uh oh!

tgravescs commented Sep 29, 2016

Uh oh!

ashwinshankar77 commented Oct 7, 2016

Uh oh!

vanzin commented Dec 2, 2016

Uh oh!

HyukjinKwon commented Feb 9, 2017

Uh oh!

deshanxiao commented Nov 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-17667] [YARN][WIP]Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest #15267

[SPARK-17667] [YARN][WIP]Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest #15267

Uh oh!

Conversation

ashwinshankar77 commented Sep 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

vanzin commented Sep 28, 2016

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

tgravescs Sep 29, 2016

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Sep 29, 2016

Uh oh!

tgravescs commented Sep 29, 2016

Uh oh!

ashwinshankar77 commented Oct 7, 2016

Uh oh!

vanzin commented Dec 2, 2016

Uh oh!

HyukjinKwon commented Feb 9, 2017

Uh oh!

deshanxiao commented Nov 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants