Skip to content

Conversation

@ashwinshankar77
Copy link

What changes were proposed in this pull request?

Following up on the discussion in SPARK-15725, one of the reason for AM hanging with dynamic allocation(DA) is the way locking is done in YarnAllocator. We noticed that when executors go down during the shrink phase of DA, AM gets locked up. On taking thread dump, we see threads trying to get loss for reason via YarnAllocator#enqueueGetLossReasonRequest, and they are all BLOCKED waiting for lock acquired by allocate call. This gets worse when the number of executors go down are in the thousands, and I've seen AM hang in the order of minutes. This patch is created to make the locking little more fine grained by remembering the executors that were killed via AM, and then serve the GetExecutorLossReason requests with that information.

This patch is not final. I want to get inputs on how I can go about removing executors from the executorsKilledByDriver set. Also if there is a better way to solve this, would be happy to make the changes.

How was this patch tested?

This was tested in our cluster with manually scaling the number of executors to thousands and then shrinking them down to something small, and made sure that we don't see BLOCKED threads stuck at YarnAllocator#enqueueGetLossReasonRequest.

@vanzin
Copy link
Contributor

vanzin commented Sep 28, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66015 has finished for PR 15267 at commit 859718c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// returned in one AM-RM communication. So query RPC will be later than this completed
// container process.
releasedExecutorLossReasons.put(eid, exitReason)
if (!executorsKilledByDriver.contains(eid)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a race here where if the executor went down for reason other then us kill after we called kill we won't get the right loss reason. ie killExecutor just adding it to list to be removed later, if it happened to die for another reason we would miss out on that. I don't think this is a big deal but we could put in a check here to compare reason.

@tgravescs
Copy link
Contributor

SPARK-17365 should help with this situation have you tried that?

overall the approach seems ok to me. Its an easy short cut to prevent asking if we already know. There is the possible race condition but it doesn't seem very big issue since it should have been idle and if it happens to die for some other reason after we requested to kill it we probably don't care.

@tgravescs
Copy link
Contributor

It looks like you should be able to remove it after the context.reply(ExecutorKilled) return. The killExecutors should be called before the loss reason and if there is a race there its just going to go to yarn to get the reason, which goes back to the same issue but I assume that will be rare. We shouldn't be asking for the loss reason more then once.

@ashwinshankar77
Copy link
Author

Thanks for the review @tgravescs! I'll update the patch shortly. We are also trying out SPARK-17365.

@vanzin
Copy link
Contributor

vanzin commented Dec 2, 2016

@ashwinshankar77 do you plan on updating this and removing the "WIP"?

@HyukjinKwon
Copy link
Member

Hi @ashwinshankar77, if you are not currently able to work on this further, maybe it should be closed for now. It seems inactive for few months.

@deshanxiao
Copy link
Contributor

I meet this problem in spark 2.1 just now. RPC threads are all blocked by the lock.

zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs.

What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month.

I left some comments roughly a week ago to ping and the author looks still inactive in these PR below

These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate.

Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it.

Closes apache#7963
Closes apache#8374
Closes apache#11192
Closes apache#11374
Closes apache#11692
Closes apache#12243
Closes apache#12583
Closes apache#12620
Closes apache#12675
Closes apache#12697
Closes apache#12800
Closes apache#13715
Closes apache#14266
Closes apache#15053
Closes apache#15159
Closes apache#15209
Closes apache#15264
Closes apache#15267
Closes apache#15871
Closes apache#15861
Closes apache#16319
Closes apache#16324
Closes apache#16890

Closes apache#12398
Closes apache#12933
Closes apache#14517

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#16937 from HyukjinKwon/stale-prs-close.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants