-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17667] [YARN][WIP]Make locking fine grained in YarnAllocator#enqueueGetLossReasonRequest #15267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #66015 has finished for PR 15267 at commit
|
| // returned in one AM-RM communication. So query RPC will be later than this completed | ||
| // container process. | ||
| releasedExecutorLossReasons.put(eid, exitReason) | ||
| if (!executorsKilledByDriver.contains(eid)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a race here where if the executor went down for reason other then us kill after we called kill we won't get the right loss reason. ie killExecutor just adding it to list to be removed later, if it happened to die for another reason we would miss out on that. I don't think this is a big deal but we could put in a check here to compare reason.
|
SPARK-17365 should help with this situation have you tried that? overall the approach seems ok to me. Its an easy short cut to prevent asking if we already know. There is the possible race condition but it doesn't seem very big issue since it should have been idle and if it happens to die for some other reason after we requested to kill it we probably don't care. |
|
It looks like you should be able to remove it after the context.reply(ExecutorKilled) return. The killExecutors should be called before the loss reason and if there is a race there its just going to go to yarn to get the reason, which goes back to the same issue but I assume that will be rare. We shouldn't be asking for the loss reason more then once. |
|
Thanks for the review @tgravescs! I'll update the patch shortly. We are also trying out SPARK-17365. |
|
@ashwinshankar77 do you plan on updating this and removing the "WIP"? |
|
Hi @ashwinshankar77, if you are not currently able to work on this further, maybe it should be closed for now. It seems inactive for few months. |
|
I meet this problem in spark 2.1 just now. RPC threads are all blocked by the lock. |
## What changes were proposed in this pull request? This PR proposes to close stale PRs. What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month. I left some comments roughly a week ago to ping and the author looks still inactive in these PR below These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate. Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it. Closes apache#7963 Closes apache#8374 Closes apache#11192 Closes apache#11374 Closes apache#11692 Closes apache#12243 Closes apache#12583 Closes apache#12620 Closes apache#12675 Closes apache#12697 Closes apache#12800 Closes apache#13715 Closes apache#14266 Closes apache#15053 Closes apache#15159 Closes apache#15209 Closes apache#15264 Closes apache#15267 Closes apache#15871 Closes apache#15861 Closes apache#16319 Closes apache#16324 Closes apache#16890 Closes apache#12398 Closes apache#12933 Closes apache#14517 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#16937 from HyukjinKwon/stale-prs-close.
What changes were proposed in this pull request?
Following up on the discussion in SPARK-15725, one of the reason for AM hanging with dynamic allocation(DA) is the way locking is done in YarnAllocator. We noticed that when executors go down during the shrink phase of DA, AM gets locked up. On taking thread dump, we see threads trying to get loss for reason via YarnAllocator#enqueueGetLossReasonRequest, and they are all BLOCKED waiting for lock acquired by allocate call. This gets worse when the number of executors go down are in the thousands, and I've seen AM hang in the order of minutes. This patch is created to make the locking little more fine grained by remembering the executors that were killed via AM, and then serve the GetExecutorLossReason requests with that information.
This patch is not final. I want to get inputs on how I can go about removing executors from the
executorsKilledByDriverset. Also if there is a better way to solve this, would be happy to make the changes.How was this patch tested?
This was tested in our cluster with manually scaling the number of executors to thousands and then shrinking them down to something small, and made sure that we don't see BLOCKED threads stuck at YarnAllocator#enqueueGetLossReasonRequest.