Skip to content

Conversation

@jerryshao
Copy link
Contributor

Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.

The problem is disableExecutor returns false for pending killing executors when onDisconnect is detected, so no further implementation is done.

One solution is to bypass these explicitly killed executors to use super.onDisconnect to remove executor. This is simple.

Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.

Here this PR chooses solution 2.

Please help to review. @vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: here it would also be better to response a message, otherwise driver will pending on waiting the message until timeout. This may happens when issue a loss reason query, AM already processed this completed containers (since they run asynchronously).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, probably good to call context.sendFailure here.

@jerryshao jerryshao changed the title [SPARK-11718][Yarn][Core]Fix explicitly killing executor dies silently issue [SPARK-11718][Yarn][Core]Fix explicitly killed executor dies silently issue Nov 13, 2015
@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45837 has finished for PR 9684 at commit 85d18d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45846 has finished for PR 9684 at commit 85d18d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45848 has finished for PR 9684 at commit 0ae6613.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Nov 13, 2015

One solution is to bypass these explicitly killed executors to use super.onDisconnect to remove executor. This is simple.

Is there any reason why that solution wouldn't work? It feels unnecessary to ask for the loss reason of an executor you asked to be killed; it would also simplify the PR.

@jerryshao
Copy link
Contributor Author

Hi @vanzin , I have no strong inclination on either solution, I can change to the first solution.

The problem here with first solution is that the executor loss reason is SlaveLost, not actually ExecutorExited, so it is not well informed to the user, of course I could change the loss reason to make it more reasonable.

Besides, still I think there's a race condition between querying loss reason and process completed containers, since they're in two threads, here we assume processing completed containers should be later than querying loss reason, but I don't think we could guarantee this. Maybe we could receive onDisconnected later than processing completed containers. I think it is possible, and in such case we're failed to get loss reason, so here in this PR I changed to store loss reason if processing completed containers is ahead of querying, unless we could accept failing to get loss reason. What's your opinion?

Thanks a lot.

@vanzin
Copy link
Contributor

vanzin commented Nov 14, 2015

Besides, still I think there's a race condition between querying loss reason and process completed containers,

That's true. It's probably unlikely that it would be hit (since the "executor disconnects, ask for loss reason" path is much quicker than the "nm sees process has died, nm notifies rm, spark am wakes up from sleep and sends heartbeat to rm and sees container has exited" path), But better be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could just be:

} else {
  executorsPendingToRemove.contains(executorId)
}

@vanzin
Copy link
Contributor

vanzin commented Nov 14, 2015

Just minor things, otherwise LGTM.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 15, 2015

Test build #45948 has finished for PR 9684 at commit 5347734.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Nov 16, 2015

Merging to master / 1.6.

asfgit pushed a commit that referenced this pull request Nov 16, 2015
…y issue

Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.

The problem is  `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done.

One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple.

Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.

Here this PR chooses solution 2.

Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9684 from jerryshao/SPARK-11718.

(cherry picked from commit 24477d2)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 24477d2 Nov 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants