[SPARK-18761][branch-2.0] Introduce "task reaper" to oversee task killing in executors #16358

JoshRosen · 2016-12-20T20:43:07Z

Branch-2.0 backport of #16189; original description follows:

What changes were proposed in this pull request?

Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks.

This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call TaskRunner.kill() and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks.

This feature is flagged off by default and is controlled by four new configurations under the spark.task.reaper.* namespace. See the updated configuration.md doc for details.

How was this patch tested?

Tested via a new test case in JobCancellationSuite, plus manual testing.

…n executors Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks. This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks. This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details. Tested via a new test case in `JobCancellationSuite`, plus manual testing. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#16189 from JoshRosen/cancellation.

SparkQA · 2016-12-20T23:08:14Z

Test build #70419 has finished for PR 16358 at commit 3b7a24e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-12-20T23:55:28Z

LGTM. Merging to branch-2.0

…ling in executors Branch-2.0 backport of #16189; original description follows: ## What changes were proposed in this pull request? Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks. This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks. This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details. ## How was this patch tested? Tested via a new test case in `JobCancellationSuite`, plus manual testing. Author: Josh Rosen <joshrosen@databricks.com> Closes #16358 from JoshRosen/cancellation-branch-2.0.

JoshRosen mentioned this pull request Dec 20, 2016

[SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in executors #16189

Closed

JoshRosen closed this Dec 21, 2016

JoshRosen deleted the cancellation-branch-2.0 branch December 21, 2016 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18761][branch-2.0] Introduce "task reaper" to oversee task killing in executors #16358

[SPARK-18761][branch-2.0] Introduce "task reaper" to oversee task killing in executors #16358

Uh oh!

JoshRosen commented Dec 20, 2016

Uh oh!

SparkQA commented Dec 20, 2016

Uh oh!

yhuai commented Dec 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-18761][branch-2.0] Introduce "task reaper" to oversee task killing in executors #16358

[SPARK-18761][branch-2.0] Introduce "task reaper" to oversee task killing in executors #16358

Uh oh!

Conversation

JoshRosen commented Dec 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 20, 2016

Uh oh!

yhuai commented Dec 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants