-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11718][Yarn][Core]Fix explicitly killed executor dies silently issue #9684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: here it would also be better to response a message, otherwise driver will pending on waiting the message until timeout. This may happens when issue a loss reason query, AM already processed this completed containers (since they run asynchronously).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, probably good to call context.sendFailure here.
|
Test build #45837 has finished for PR 9684 at commit
|
|
Jenkins, retest this please. |
|
Test build #45846 has finished for PR 9684 at commit
|
|
Test build #45848 has finished for PR 9684 at commit
|
Is there any reason why that solution wouldn't work? It feels unnecessary to ask for the loss reason of an executor you asked to be killed; it would also simplify the PR. |
|
Hi @vanzin , I have no strong inclination on either solution, I can change to the first solution. The problem here with first solution is that the executor loss reason is Besides, still I think there's a race condition between querying loss reason and process completed containers, since they're in two threads, here we assume processing completed containers should be later than querying loss reason, but I don't think we could guarantee this. Maybe we could receive Thanks a lot. |
That's true. It's probably unlikely that it would be hit (since the "executor disconnects, ask for loss reason" path is much quicker than the "nm sees process has died, nm notifies rm, spark am wakes up from sleep and sends heartbeat to rm and sees container has exited" path), But better be safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could just be:
} else {
executorsPendingToRemove.contains(executorId)
}
|
Just minor things, otherwise LGTM. |
|
Jenkins, retest this please. |
|
Test build #45948 has finished for PR 9684 at commit
|
|
Merging to master / 1.6. |
…y issue Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work. The problem is `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done. One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple. Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query. Here this PR chooses solution 2. Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9684 from jerryshao/SPARK-11718. (cherry picked from commit 24477d2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.
The problem is
disableExecutorreturns false for pending killing executors whenonDisconnectis detected, so no further implementation is done.One solution is to bypass these explicitly killed executors to use
super.onDisconnectto remove executor. This is simple.Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.
Here this PR chooses solution 2.
Please help to review. @vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.