[SPARK-11718][Yarn][Core]Fix explicitly killed executor dies silently issue #9684

jerryshao · 2015-11-13T05:59:23Z

Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.

The problem is disableExecutor returns false for pending killing executors when onDisconnect is detected, so no further implementation is done.

One solution is to bypass these explicitly killed executors to use super.onDisconnect to remove executor. This is simple.

Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.

Here this PR chooses solution 2.

Please help to review. @vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.

jerryshao · 2015-11-13T06:23:59Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

Note: here it would also be better to response a message, otherwise driver will pending on waiting the message until timeout. This may happens when issue a loss reason query, AM already processed this completed containers (since they run asynchronously).

yes, probably good to call context.sendFailure here.

SparkQA · 2015-11-13T07:56:59Z

Test build #45837 has finished for PR 9684 at commit 85d18d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2015-11-13T07:59:20Z

Jenkins, retest this please.

SparkQA · 2015-11-13T09:52:58Z

Test build #45846 has finished for PR 9684 at commit 85d18d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-13T10:42:57Z

Test build #45848 has finished for PR 9684 at commit 0ae6613.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-11-13T19:01:40Z

One solution is to bypass these explicitly killed executors to use super.onDisconnect to remove executor. This is simple.

Is there any reason why that solution wouldn't work? It feels unnecessary to ask for the loss reason of an executor you asked to be killed; it would also simplify the PR.

jerryshao · 2015-11-14T00:46:58Z

Hi @vanzin , I have no strong inclination on either solution, I can change to the first solution.

The problem here with first solution is that the executor loss reason is SlaveLost, not actually ExecutorExited, so it is not well informed to the user, of course I could change the loss reason to make it more reasonable.

Besides, still I think there's a race condition between querying loss reason and process completed containers, since they're in two threads, here we assume processing completed containers should be later than querying loss reason, but I don't think we could guarantee this. Maybe we could receive onDisconnected later than processing completed containers. I think it is possible, and in such case we're failed to get loss reason, so here in this PR I changed to store loss reason if processing completed containers is ahead of querying, unless we could accept failing to get loss reason. What's your opinion?

Thanks a lot.

vanzin · 2015-11-14T01:22:40Z

Besides, still I think there's a race condition between querying loss reason and process completed containers,

That's true. It's probably unlikely that it would be hit (since the "executor disconnects, ask for loss reason" path is much quicker than the "nm sees process has died, nm notifies rm, spark am wakes up from sleep and sends heartbeat to rm and sees container has exited" path), But better be safe.

vanzin · 2015-11-14T01:23:19Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

this could just be:

} else { executorsPendingToRemove.contains(executorId) }

vanzin · 2015-11-14T01:30:04Z

Just minor things, otherwise LGTM.

jerryshao · 2015-11-15T09:18:55Z

Jenkins, retest this please.

SparkQA · 2015-11-15T13:25:24Z

Test build #45948 has finished for PR 9684 at commit 5347734.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-11-16T19:42:57Z

Merging to master / 1.6.

…y issue Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work. The problem is `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done. One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple. Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query. Here this PR chooses solution 2. Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9684 from jerryshao/SPARK-11718. (cherry picked from commit 24477d2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

jerryshao added 2 commits November 13, 2015 13:47

Fix explicitly killing executor dies silently issue

8719ee4

Remove unnecessary log

85d18d2

jerryshao reviewed Nov 13, 2015
View reviewed changes

jerryshao changed the title ~~[SPARK-11718][Yarn][Core]Fix explicitly killing executor dies silently issue~~ [SPARK-11718][Yarn][Core]Fix explicitly killed executor dies silently issue Nov 13, 2015

refactor the codes

0ae6613

vanzin reviewed Nov 14, 2015
View reviewed changes

Address the comments

5347734

asfgit closed this in 24477d2 Nov 16, 2015

[SPARK-11718][Yarn][Core]Fix explicitly killed executor dies silently issue #9684

[SPARK-11718][Yarn][Core]Fix explicitly killed executor dies silently issue #9684

Uh oh!

Conversation

jerryshao commented Nov 13, 2015

Uh oh!

jerryshao Nov 13, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Nov 13, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 13, 2015

Uh oh!

jerryshao commented Nov 13, 2015

Uh oh!

SparkQA commented Nov 13, 2015

Uh oh!

SparkQA commented Nov 13, 2015

Uh oh!

vanzin commented Nov 13, 2015

Uh oh!

jerryshao commented Nov 14, 2015

Uh oh!

vanzin commented Nov 14, 2015

Uh oh!

vanzin Nov 14, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin commented Nov 14, 2015

Uh oh!

jerryshao commented Nov 15, 2015

Uh oh!

SparkQA commented Nov 15, 2015

Uh oh!

vanzin commented Nov 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants