[SPARK-19820] [core] Allow reason to be specified for task kill #17166

ericl · 2017-03-04T23:51:53Z

What changes were proposed in this pull request?

This refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as (N killed: $reason) and TaskKilled: $reason. Without this change, there is no way to provide the user feedback through the UI.

Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through SparkContext.killTask.

cc @rxin

In the stage overview UI the reasons are summarized:

Within the stage UI you can see individual task kill reasons:

How was this patch tested?

Existing tests, tried killing some stages in the UI and verified the messages are as expected.

SparkQA · 2017-03-04T23:58:26Z

Test build #73912 has finished for PR 17166 at commit e9178b6.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TaskKilled(reason: String, override val shouldRetry: Boolean) extends TaskFailedReason
case class KillTask(

SparkQA · 2017-03-05T00:13:27Z

Test build #73913 has finished for PR 17166 at commit ba7cbd0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-05T01:37:59Z

Test build #73916 has finished for PR 17166 at commit 1a716aa.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TaskKilled(reason: String, override val shouldRetry: Boolean) extends TaskFailedReason
case class KillTask(

fix compile really fix compile

SparkQA · 2017-03-05T02:12:48Z

Test build #73917 has finished for PR 17166 at commit 91b8aef.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TaskKilled(reason: String, override val shouldRetry: Boolean) extends TaskFailedReason
case class KillTask(

mridulm · 2017-03-05T07:42:12Z

What is the rationale for this change ? Is it to propagate the task kill reason to UI ?
The one line in https://github.com/apache/spark/pull/17166/files#diff-b8adb646ef90f616c34eb5c98d1ebd16R357.
Or did I miss some other use for this ?

ericl · 2017-03-05T08:11:39Z

Yes (updated the pr description) -- without this change, there is no way to provide the user feedback through the UI.

This also lets you choose whether the task will be rescheduled.

mridulm · 2017-03-05T11:03:38Z

If I did not miss it, there is no way for user to provide this information currently, right ?
Or is that coming in a subsequent PR ?

ericl · 2017-03-05T18:33:21Z

That's right, its not here (we can expose this in the public SparkContext API as a follow-up). This PR only adds the distinction between tasks killed due to stage cancellation and speculation attempts.

…

On Sun, Mar 5, 2017, 3:04 AM Mridul Muralidharan ***@***.***> wrote: If I did not miss it, there is no way for user to provide this information currently, right ? Or is that coming in a subsequent PR ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#17166 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SmvlOdCiMSUezJt8WexHi5Xzor8Oks5ripakgaJpZM4MTQUz> .

mridulm · 2017-03-06T04:05:31Z

In that case, let us make it a more complete PR - with the proposed api changes included - so that we can evaluate the merits of the change in total.

ericl · 2017-03-07T00:15:05Z

Added killTask(id: TaskId, reason: String) to SparkContext and a corresponding test.

cc @JoshRosen for the API changes. As discussed offline, it's very hard to preserve binary compatibility here since we have to move from a case object to a case class to add a reason.

vanzin · 2017-03-07T00:59:53Z

core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala

+   * @param shouldRetry Whether the scheduler should retry the task.
+   */
+  def killTask(
+      taskId: Long, executorId: String, interruptThread: Boolean, reason: String,


Please follow the convention defined in the "Indentation" section at http://spark.apache.org/contributing.html for long parameter lists. This happens in a bunch of methods in this PR.

JoshRosen · 2017-03-07T01:47:25Z

core/src/main/scala/org/apache/spark/TaskEndReason.scala

 @DeveloperApi
-case object TaskKilled extends TaskFailedReason {
-  override def toErrorString: String = "TaskKilled (killed intentionally)"
+case class TaskKilled(reason: String, override val shouldRetry: Boolean) extends TaskFailedReason {


I think it's a little weird for shouldRetry to be part of this interface: whether a killed task should be retried is already handled by existing scheduling logic and it's a little confusing to be setting it on a per-task basis rather than, say, having different events which are used to distinguish between speculative tasks being killed and jobs being canceled/aborted. Basically, it's confusing to me because this isn't the source-of-truth on whether we can/will re-try and its purpose here is therefore a bit unclear to me.

As discussed offline, I think that we may be able to update the logic in TaskSchedulerImpl.handleFailedTask (the only place which reads this field) in order to eliminate the need for this field in this class.

ericl · 2017-03-07T01:53:49Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

@kayousterhout does removing this check seem safe to you? It looks like the only case taskState != TaskState.KILLED guards against here is cancelled speculative tasks. Since those are relatively rare, it seems ok to call revive offers in those cases unconditionally. Tasks from cancelled stages and jobs should still be dropped here by the remaining isZombie check.

I need to do some git blaming to be 100% sure this is OK...I'll take a look tomorrow

iirc killed was always used internally for killing tasks without needing retry - hence the check.
Good point @kayousterhout , we might need to revisit use of KILLED to ensure this does not break (now with dev invocation of killed) which might need retry.

I looked at this more and this is fine. In any case it's not harmful to call reviveOffers -- just may result in wasted work.

@kayousterhout Are we making the change that killed tasks can/should be retried ?
If yes, this is a behavior change; and TSM.handleFailedTask(), we need to do the same.

This is what I mentioned w.r.t killed not resulting in task resubmission.

@mridulm It's possible that the makeOffers() call causes a different job's tasks to be executed on the given executor. Fundamentally, the problem is that the killed task needs to be re-scheduled on a different executor, and the only way to guarantee that the task gets offered new/different executors is to do a full reviveOffers() call (which is why the code in question exists in the first place).

What if Eric changes TSM.handleFailedTask to return a boolean value indicating whether the failed task needs to be re-scheduled? Then we could use that to decide whether to call reviveOffers.

There might be a simple solution here to avoid extra overhead on speculative tasks. We just need to check if the task index has been marked as successful -- if so, we can skip calling reviveOffers(). How does this look?

- if (!taskSetManager.isZombie) { + if (!taskSetManager.isZombie && !taskSetManager.someAttemptSucceeded(tid)) { reviveOffers()

Then in TaskSetManager,

+ def someAttemptSucceeded(tid: Long): Boolean = { + successful(taskInfos(tid).index) + }

Oh that sounds great to me @ericl and minimally invasive!

Great, I updated the PR to include this.

SparkQA · 2017-03-07T03:48:03Z

Test build #74059 has finished for PR 17166 at commit 90d2c98.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TaskKilled(reason: String) extends TaskFailedReason

SparkQA · 2017-03-07T04:27:37Z

Test build #74053 has finished for PR 17166 at commit a58d391.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T05:22:41Z

Test build #74057 has finished for PR 17166 at commit 02d81b5.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-03-07T05:54:06Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   *
+   * @param taskId the task ID to kill
+   * @param reason the reason for killing the task, which should be a short string
+   */


hm i don't think we should automatically retry just by providing a reason. Perhaps this

def killTask(taskId: Long, reason: String): Unit def killTaskAndRetry(taskId: Long, reason: String): Unit

same thing for the lower level dag scheduler api.

Well, it turns out there's not a good reason to not retry. The task will get retried anyways eventually unless the stage is cancelled. The previous code seems to be just a performance optimization to not call reviveOffers for speculative task completions.

ah ok. for some reason i read it as killTask(long) is kill without retry, and killTask(long, string) is with.

What about calling it "killAndRescheduleTask" then? Otherwise kill is a little misleading -- since where we use it elsewhere (to kill a stage) it implies no retry

I am unclear about the expectations from the api.

What is the expectation when a task is being killed.

Is it specifically for the task being referenced; or all attempts of the task ?

If latter, do we discard already succeeded tasks ?

"killAndRescheduleTask" implies it will be rescheduled - which might not occur in case this was a speculative task (or already completed) : would be good to clarify.

Is this expected to be exposed via the UI ?

How is it to be leveraged (if not via UI) ?

How to get to taskId - via SparkListener.onTaskStart ?

Any other means ? (IIRC no other way to get at it, but good to clarify for api user)

Given Mridul's points maybe killTaskAttempt is a better name? IMO specifying "attempt" in the name makes it sound less permanent than killTask (which to me sounds like it won't be retried)

What is the expectation when a task is being killed.
Is it specifically for the task being referenced; or all attempts of the task ?

The current task attempt (which is uniquely identifier by the task id). I updated the docs as suggested here.

"killAndRescheduleTask" implies it will be rescheduled - which might not occur in case this was a speculative task (or already completed) : would be good to clarify.

Went with killTaskAttempt.

Is this expected to be exposed via the UI ?
How is it to be leveraged (if not via UI) ?

For now, you can look at the Spark UI, find the task ID, and call killTaskAttempt on it. It would be nice to have this as a button on the executor page in a follow-up. You can also have a listener that kills tasks as suggested.

rxin · 2017-03-07T05:55:06Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

  }

-  def killTask(taskId: Long, interruptThread: Boolean): Unit = {
+  def killTask(


this fits in one line?

def killTask(taskId: Long, interruptThread: Boolean, reason: String): Unit = {

rxin · 2017-03-07T05:55:38Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala


+  /**
+   * Kill a given task. It will be retried.
+   */


similar to the public api, we should separate retry from reason...

rxin · 2017-03-07T05:56:41Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala

  case class LaunchTask(data: SerializableBuffer) extends CoarseGrainedClusterMessage

-  case class KillTask(taskId: Long, executor: String, interruptThread: Boolean)
+  case class KillTask(


this also seems to fit in one line?

case class KillTask(taskId: Long, executor: String, interruptThread: Boolean, reason: String)

SparkQA · 2017-03-07T06:12:35Z

Test build #74073 has started for PR 17166 at commit f03de61.

SparkQA · 2017-03-07T06:19:07Z

Test build #74065 has finished for PR 17166 at commit 170fa34.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TaskKilled(reason: String) extends TaskFailedReason

mridulm · 2017-03-21T18:09:30Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

+
+  private[spark] override def killTaskIfInterrupted(): Unit = {
+    if (maybeKillReason.isDefined) {
+      throw new TaskKilledException(maybeKillReason.get)


This is not thread safe - while technically we do not allow kill reason to be reset to None right now and might be fine, it can lead to future issues.

Either make all access/updates to kill reason synchronized; or capture maybeKillReason to a local variable and use that in the if and throw

mridulm · 2017-03-21T18:10:33Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

          // exception will be caught by the catch block, leading to an incorrect ExceptionFailure
          // for the task.
-          throw new TaskKilledException
+          throw new TaskKilledException(maybeKillReason.get)


Same as above here - atomic use of maybeKillReason required.

mridulm · 2017-03-21T18:13:59Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

-  // Whether the corresponding task has been killed.
-  @volatile private var interrupted: Boolean = false
+  // If defined, the corresponding task has been killed for the contained reason.
+  @volatile private var maybeKillReason: Option[String] = None


nit: Overloading maybeKillReason to indicate interrupted status smells a bit; but might be ok for now.

Yeah, the reason here is to allow this to be set atomically.

How about calling this reasonIfKilled, here and elsewhere? (if you strongly prefer the existing name find to leave as-is -- I just slightly prefer making it somewhat more obvious that this and the fact that the task has been killed are tightly intertwined).

In any case, can you expand the comment a bit to one you used below: "If specified, this task has been killed and this option contains the reason."

mridulm · 2017-03-21T18:15:40Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

  // A flag to indicate whether the task is killed. This is used in case context is not yet
  // initialized when kill() is invoked.
-  @volatile @transient private var _killed = false
+  @volatile @transient private var _maybeKillReason: String = null


Any reason to make this a String and not Option[String] - like other places it is defined/used ?

This one gets deserialized to null sometimes (@ transient), so it seemed cleaner to use a bare string rather than have an option that can be null.

Can you update the comment here?

mridulm · 2017-03-21T18:19:08Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+        SparkContextSuite.taskSucceeded = true
      }
    }
+    assert(SparkContextSuite.taskSucceeded)


Both listener and the task are both setting taskSuceeded ? That does not look right ...
I am assuming we need one failure to be raised with the appropriate message, one task success - to ensure listener success.
Additionally, re-execution of task to indicate success of task (though this aspect should be covered in some other test already).

mridulm · 2017-03-21T18:20:22Z

core/src/main/scala/org/apache/spark/TaskEndReason.scala

-case object TaskKilled extends TaskFailedReason {
-  override def toErrorString: String = "TaskKilled (killed intentionally)"
+case class TaskKilled(reason: String) extends TaskFailedReason {
+  override def toErrorString: String = s"TaskKilled ($reason)"


That is unfortunate, but looks like it cant be helped if we need this feature.
Probably something to keep in mind with future use of case objects !

Thx for clarifying.

mridulm · 2017-03-21T18:26:19Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

            logDebug("Exception thrown after task interruption", e)
-            throw new TaskKilledException
+            context.killTaskIfInterrupted()
+            null  // not reached


nit: It would be good if we could directly throw the exception here - instead of relying on killTaskIfInterrupted to do the right thing (it is interrupted already according to the case check)
Not only will it not remove the unreachable null, but also ensure future changes to killTaskIfInterrupted or interrupt reset, etc does not break this.

mridulm · 2017-03-21T18:35:04Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

@ericl Actually that is not correct.
Killed tasks were not candidates for resubmission on failure; and hence there is no need to revive offers when task kills are detected.

If they are to be made candidates, we need to introduce this expectation explicit elsewhere also to be consistent.

SparkQA · 2017-03-21T20:26:12Z

Test build #74993 has finished for PR 17166 at commit 884a3ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TaskKilledException(val reason: String) extends RuntimeException

mridulm · 2017-03-21T20:44:18Z

Hi @kayousterhout,
Can you take over reviewing this PR ? I might be tied up with other things for next couple of weeks, and I dont want @ericl's work to be blocked on me.

Thx

ericl

@mridulm addressed the comments around atomic access. Some of the changes are a bit awkward since we access that variable during exception handling, but maybe it's ok since we don't expect to hit this case.

ericl · 2017-03-21T20:49:28Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

+
+  private[spark] override def killTaskIfInterrupted(): Unit = {
+    if (maybeKillReason.isDefined) {
+      throw new TaskKilledException(maybeKillReason.get)


ericl · 2017-03-21T20:56:10Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

            logDebug("Exception thrown after task interruption", e)
-            throw new TaskKilledException
+            context.killTaskIfInterrupted()
+            null  // not reached


ericl · 2017-03-21T20:58:06Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

          // exception will be caught by the catch block, leading to an incorrect ExceptionFailure
          // for the task.
-          throw new TaskKilledException
+          throw new TaskKilledException(maybeKillReason.get)


ericl · 2017-03-21T20:58:57Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

  // A flag to indicate whether the task is killed. This is used in case context is not yet
  // initialized when kill() is invoked.
-  @volatile @transient private var _killed = false
+  @volatile @transient private var _maybeKillReason: String = null


This one gets deserialized to null sometimes (@ transient), so it seemed cleaner to use a bare string rather than have an option that can be null.

ericl · 2017-03-21T21:01:08Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

There is no need, but reviving offers has no effect either way. Those tasks will not be resubmitted even if reviveOffers() is called (in fact, reviveOffers() is called periodically on a timer thread, so if this was an issue we should have already seen it).

SparkQA · 2017-03-21T21:19:48Z

Test build #74999 has finished for PR 17166 at commit 203a900.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T21:24:00Z

Test build #75000 has finished for PR 17166 at commit 6e8593b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-22T01:00:41Z

Test build #75005 has finished for PR 17166 at commit 5707715.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout

This is looking good (and happy to take over the review @mridulm); just a few last minor clean up comments.

kayousterhout · 2017-03-22T20:49:16Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

-  // Whether the corresponding task has been killed.
-  @volatile private var interrupted: Boolean = false
+  // If defined, the corresponding task has been killed for the contained reason.
+  @volatile private var maybeKillReason: Option[String] = None


How about calling this reasonIfKilled, here and elsewhere? (if you strongly prefer the existing name find to leave as-is -- I just slightly prefer making it somewhat more obvious that this and the fact that the task has been killed are tightly intertwined).

In any case, can you expand the comment a bit to one you used below: "If specified, this task has been killed and this option contains the reason."

kayousterhout · 2017-03-22T20:52:17Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

          case e: Exception if context.isInterrupted =>
            logDebug("Exception thrown after task interruption", e)
-            throw new TaskKilledException
+            throw new TaskKilledException(context.getKillReason().getOrElse("unknown reason"))


why do you need the getOrElse here? (since isInterrupted is true, shouldn't this always be defined?)

@mridulm pointed out that should the kill reason get reset to None by a concurrent thread, this would crash. However, it is true that this can't happen in the current implementation.

If you think it's clearer, we could throw an AssertionError in this case.

Hm ok if Mridul wants this then fine to leave as-is

@kayousterhout I actually had not considered this, but the use of maybeKillReason in Executor/other places; this was a nice catch by @ericl

kayousterhout · 2017-03-22T21:04:23Z

core/src/main/scala/org/apache/spark/executor/Executor.scala


        case _: InterruptedException if task.killed =>
-          logInfo(s"Executor interrupted and killed $taskName (TID $taskId)")
+          val killReason = task.maybeKillReason.getOrElse("unknown reason")


Can you change if task.killed to if task.maybeKillReason.isDefied, and then just do .get here? Then you could get rid of the task.killed variable and avoid the weird dependency between task.killed being set and task.maybeKillReason being defined.

kayousterhout · 2017-03-22T21:08:02Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

+  def killed: Boolean = _maybeKillReason != null
+
+  /**
+   * If this task has been killed, contains the reason for the kill.


As above, can you make the comment "If specified, this task has been killed and this option contains the reason." (assuming that you get rid of the killed variable)

kayousterhout · 2017-03-22T21:08:32Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

  // A flag to indicate whether the task is killed. This is used in case context is not yet
  // initialized when kill() is invoked.
-  @volatile @transient private var _killed = false
+  @volatile @transient private var _maybeKillReason: String = null


Can you update the comment here?

kayousterhout · 2017-03-22T21:09:30Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

  }

+  override def killTaskAttempt(taskId: Long, interruptThread: Boolean, reason: String): Unit = {
+    logInfo(s"Killing task ($reason): $taskId")


super nit but can you make this s"Killing task $taskId ($reason)"? This is somewhat more consistent with task-level logging elsewhere

kayousterhout · 2017-03-22T21:10:56Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+  override def killTaskAttempt(taskId: Long, interruptThread: Boolean, reason: String): Unit = {
+    logInfo(s"Killing task ($reason): $taskId")
+    val execId = taskIdToExecutorId.getOrElse(
+      taskId, throw new IllegalArgumentException("Task not found: " + taskId))


similarly how about s"Cannot kill task $taskId because it no task with that ID was found."

Also it's kind of ugly that this throws an exception (seems like it could be an unhappy surprise to the user that their SparkContext threw an exception / died). How about instead changing the killTaskAttempt calls to return a boolean that's True if the task was successfully killed (and the returning false here)?

kayousterhout · 2017-03-22T21:34:37Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

Without this change, the job could hang: if just one task was left, and that task got killed, I don't think reviveOffers would ever be called.

@mridulm I'm not that concerned about the extra calls to reviveOffers. In the worse case, if every task in a job is speculated (which of course can't actually happen), this leads to 2x the number of calls to reviveOffers -- so it still doesn't change the asymptotic time complexity even in the worse case.

There are already a bunch of cases where we're pretty conservative with reviveOffers, in the sense that we call it even though we might not need to (e.g., when an executor dies, even if there aren't any tasks that need to be run; or every time there are speculative tasks available to run, even if there aren't any resources to run them on) so this change is in keeping with that pattern.

kayousterhout · 2017-03-22T21:35:50Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

Also, I spent a while making sure that everything is ok in TSM.handleFailedTask @mridulm, and all the code there seems to handle resubmission automatically (it just didn't happen previously, when we used TaskKilled for speculative tasks, because we have a check not to re-run tasks if one copy succeeded already)

ericl

Thanks for the comments, ptal.

ericl · 2017-03-22T21:46:57Z

core/src/main/scala/org/apache/spark/TaskContextImpl.scala

-  // Whether the corresponding task has been killed.
-  @volatile private var interrupted: Boolean = false
+  // If defined, the corresponding task has been killed for the contained reason.
+  @volatile private var maybeKillReason: Option[String] = None


ericl · 2017-03-22T21:48:55Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

          case e: Exception if context.isInterrupted =>
            logDebug("Exception thrown after task interruption", e)
-            throw new TaskKilledException
+            throw new TaskKilledException(context.getKillReason().getOrElse("unknown reason"))


@mridulm pointed out that should the kill reason get reset to None by a concurrent thread, this would crash. However, it is true that this can't happen in the current implementation.

If you think it's clearer, we could throw an AssertionError in this case.

ericl · 2017-03-22T21:52:00Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

  // A flag to indicate whether the task is killed. This is used in case context is not yet
  // initialized when kill() is invoked.
-  @volatile @transient private var _killed = false
+  @volatile @transient private var _maybeKillReason: String = null


ericl · 2017-03-22T21:52:33Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

+  def killed: Boolean = _maybeKillReason != null
+
+  /**
+   * If this task has been killed, contains the reason for the kill.


ericl · 2017-03-22T21:53:25Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

  }

+  override def killTaskAttempt(taskId: Long, interruptThread: Boolean, reason: String): Unit = {
+    logInfo(s"Killing task ($reason): $taskId")


ericl · 2017-03-22T21:56:05Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+  override def killTaskAttempt(taskId: Long, interruptThread: Boolean, reason: String): Unit = {
+    logInfo(s"Killing task ($reason): $taskId")
+    val execId = taskIdToExecutorId.getOrElse(
+      taskId, throw new IllegalArgumentException("Task not found: " + taskId))


kayousterhout · 2017-03-22T22:53:20Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

          case e: Exception if context.isInterrupted =>
            logDebug("Exception thrown after task interruption", e)
-            throw new TaskKilledException
+            throw new TaskKilledException(context.getKillReason().getOrElse("unknown reason"))


Hm ok if Mridul wants this then fine to leave as-is

kayousterhout · 2017-03-22T22:54:20Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

        // If this task has been killed before we deserialized it, let's quit now. Otherwise,
        // continue executing the task.
-        if (killed) {
+        val killReason = reasonIfKilled


why re-name the variable here (instead of just using reasonIfKilled below)?

If we assign to a temporary, then there is no risk of seeing concurrent mutations of the value as we access it below (though, this cannot currently happen).

Ugh in retrospect I think TaskContext should have just clearly documented that an invariant of reasonIfKilled is that, once set, it won't be un-set, and then we'd avoid all of these corner cases. But not worth changing now.

kayousterhout · 2017-03-22T22:56:45Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+      backend.killTask(taskId, execId.get, interruptThread, reason)
+      true
+    } else {
+      logInfo(s"Could not kill task $taskId because no task with that ID was found.")


kayousterhout · 2017-03-22T23:49:42Z

LGTM. I'll merge once tests pass.

SparkQA · 2017-03-23T00:45:07Z

Test build #75064 has finished for PR 17166 at commit a37c09b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-23T02:19:23Z

Test build #75069 has finished for PR 17166 at commit 71b41b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-03-23T21:15:05Z

LGTM -- this looks great. Thanks for coming up with a simple way to address @mridulm's feedback Eric!

mridulm

LGTM, thanks for the changes @ericl !
This definitely looks much better to me now.

mridulm · 2017-03-23T22:16:28Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

          case e: Exception if context.isInterrupted =>
            logDebug("Exception thrown after task interruption", e)
-            throw new TaskKilledException
+            throw new TaskKilledException(context.getKillReason().getOrElse("unknown reason"))


@kayousterhout I actually had not considered this, but the use of maybeKillReason in Executor/other places; this was a nice catch by @ericl

mridulm · 2017-03-23T22:19:41Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

      reason: TaskFailedReason): Unit = synchronized {
    taskSetManager.handleFailedTask(tid, taskState, reason)
-    if (!taskSetManager.isZombie && taskState != TaskState.KILLED) {
+    if (!taskSetManager.isZombie && !taskSetManager.someAttemptSucceeded(tid)) {


This should do it IMO, thanks !

SparkQA · 2017-03-23T23:31:26Z

Test build #75117 has finished for PR 17166 at commit 3ec3633.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T02:33:10Z

Test build #75130 has finished for PR 17166 at commit 8c4381f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-03-24T06:33:51Z

I merged this to master. I realized that the PR description is still from an old version of the change, so I modified the commit message to add that this also adds the SparkContext.killTaskAttempt method. Thanks for all of the work here @ericl!

This commit adds a killTaskAttempt method to SparkContext, to allow users to kill tasks so that they can be re-scheduled elsewhere. This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI. Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`. cc rxin In the stage overview UI the reasons are summarized: ![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png) Within the stage UI you can see individual task kill reasons: ![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png) Existing tests, tried killing some stages in the UI and verified the messages are as expected. Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekl@google.com> Closes apache#17166 from ericl/kill-reason.

ericl force-pushed the kill-reason branch from ba7cbd0 to 1a716aa Compare March 5, 2017 01:25

Allow reason to be specified for task kill

91b8aef

fix compile really fix compile

ericl force-pushed the kill-reason branch from 1a716aa to 91b8aef Compare March 5, 2017 02:01

add public api and update mima

a58d391

vanzin reviewed Mar 7, 2017

View reviewed changes

chop down indent

02d81b5

ericl force-pushed the kill-reason branch from 1ff3e58 to 02d81b5 Compare March 7, 2017 01:09

JoshRosen reviewed Mar 7, 2017

View reviewed changes

ericl commented Mar 7, 2017

View reviewed changes

remove shouldRetry param

170fa34

ericl force-pushed the kill-reason branch from 90d2c98 to 170fa34 Compare March 7, 2017 03:51

rxin reviewed Mar 7, 2017

View reviewed changes

mridulm reviewed Mar 21, 2017

View reviewed changes

comments 3

203a900

ericl commented Mar 21, 2017

View reviewed changes

clean up test

6e8593b

fix mima again

5707715

ericl force-pushed the kill-reason branch from 136e7b5 to 5707715 Compare March 21, 2017 22:08

kayousterhout reviewed Mar 22, 2017

View reviewed changes

comments 5

a37c09b

ericl commented Mar 22, 2017

View reviewed changes

kayousterhout reviewed Mar 22, 2017

View reviewed changes

Log warn

71b41b3

skip reviveoffers if the task is successful

3ec3633

mridulm approved these changes Mar 23, 2017

View reviewed changes

ericl added 2 commits March 23, 2017 16:39

fix flaky test

145c78a

Merge branch 'master' into kill-reason

8c4381f

asfgit closed this in 8e55804 Mar 24, 2017

[SPARK-19820] [core] Allow reason to be specified for task kill #17166

[SPARK-19820] [core] Allow reason to be specified for task kill #17166

Uh oh!

Conversation

ericl commented Mar 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 4, 2017

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

mridulm commented Mar 5, 2017

Uh oh!

ericl commented Mar 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Mar 5, 2017

Uh oh!

ericl commented Mar 5, 2017 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Mar 6, 2017

Uh oh!

ericl commented Mar 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Mar 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ericl commented Mar 4, 2017 •

edited

Loading

ericl commented Mar 5, 2017 •

edited

Loading

ericl commented Mar 5, 2017 via email •

edited

Loading

JoshRosen Mar 7, 2017 •

edited

Loading

ericl Mar 7, 2017 •

edited

Loading

ericl Mar 23, 2017 •

edited

Loading

ericl Mar 7, 2017 •

edited

Loading