[SPARK-8425][CORE] Application Level Blacklisting #14079

squito · 2016-07-06T22:26:00Z

What changes were proposed in this pull request?

This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application. Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout. Full details are available in a design doc attached to the jira.

How was this patch tested?

Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.

The added tests include:

verifying BlacklistTracker works correctly
verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
an integration test for the entire scheduler with blacklisting in a few different scenarios

1. create new BlacklistTracker and BlacklistStrategy interface to support complex use case for blacklist mechanism. 2. make Yarn allocator aware of node blacklist information 3. three strategies implemented for convenience, also user can define his own strategy SingleTaskStrategy: remain default behavior before this change. AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting stage level node blacklist ExecutorAndNodeStrategy: different taskSet can share blacklist information.

squito · 2016-07-06T22:31:37Z

@kayousterhout @markhamstra @tgravescs @mwws I finally this is ready for review. I have some minor updates left but I wanted to get this in your hands now. The main thing is testing on a cluster (would appreciate any input from you on this as well Tom).

One big change in implementation I'd like to highlight: the blacklisttracker no longer requires locks. Though its accessed by multiple threads, its (almost) always from some place in TaskSschedulerImpl, which already has a lock on the taskScheduler. This also requires expiring executors while we're doing other work (rather than in a background thread) -- I chose to do it inside the call to taskScheduler.resourceOffer.

The one exception to having a lock on taskScheduler is the YarnBackend -- it needs the full set of blacklisted nodes, and it does that without a lock a on the task scheduler. But this was pretty easy to workaround.

I'll drop a few inline comments as well.

SparkQA · 2016-07-07T00:44:20Z

Test build #61873 has finished for PR 14079 at commit 5bfe941.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2016-07-07T04:09:56Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+    } else {
+      blacklistTracker.taskSetFailed(manager.taskSet.stageId)
+      logInfo(s"Removed TaskSet ${manager.taskSet.id}, since it failed, from pool" +
+        s" ${manager.parent.name}")


Changing the log msg is unrelated to blacklisting, but this msg had always annoyed / confused me earlier, so I thought it was worth updating since i needed success anyway.

… no-op implementation

squito · 2016-07-07T21:34:09Z

I took another look at having BlacklistTracker just be an option, rather than having a NoopBlacklist. After some other cleanup, I decided it made more sense to go back to the option, but its in one commit so easy to go either way a34e9ae

SparkQA · 2016-07-07T21:47:59Z

Test build #61931 has finished for PR 14079 at commit cf58374.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-07T23:16:24Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

+      indexInTaskSet: Int): Boolean = {
+    // intentionally avoiding .getOrElse(..., new HashMap()) to avoid lots of object
+    // creation, since this method gets called a *lot*
+    stageIdToExecToFailures.get(stageId) match {


Wonder if something like this isn't easier to follow:

stageIdToExecToFailures.get(stageId) .flatMap(_.get(executorId)) .map(_.failuresByTask.contains(indexInTaskSet)) .getOrElse(false)

…d to test

squito · 2016-12-13T22:12:39Z

thanks for the review @kayousterhout. I also added a testcase to BlacklistTrackerSuite, "task failure timeout works as expected for long-running tasksets" to cover your point about the long running tasksets.

kayousterhout · 2016-12-13T22:55:21Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

    val firstTaskAttempts = taskScheduler.resourceOffers(offers).flatten
    firstTaskAttempts.foreach { task => logInfo(s"scheduled $task on ${task.executorId}") }
-    assert(firstTaskAttempts.isEmpty)
+    assert(firstTaskAttempts.size === 1)


Can you also check that the executor ID is executor4?

SparkQA · 2016-12-14T01:21:22Z

Test build #70102 has finished for PR 14079 at commit c422dd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-14T04:20:10Z

Test build #70120 has finished for PR 14079 at commit c95462f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2016-12-14T04:46:45Z

Jenkins, retest this please

SparkQA · 2016-12-14T07:28:26Z

Test build #70122 has finished for PR 14079 at commit c95462f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

There is a small race in SchedulerIntegrationSuite. The test assumes that the taskscheduler thread processing that last task will finish before the DAGScheduler processes the task event and notifies the job waiter, but that is not 100% guaranteed. ran the test locally a bunch of times, never failed, though admittedly it never failed locally for me before either. However I am nearly 100% certain this is what caused the failure of one jenkins build https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull (which is long gone now, sorry -- I fixed it as part of apache#14079 initially) Author: Imran Rashid <irashid@cloudera.com> Closes apache#16270 from squito/sched_integ_flakiness.

kayousterhout · 2016-12-15T01:37:10Z

LGTM!!!!! 🎉 🎉 🎉 🎉 🎉

Nice work on this -- this will be awesome to have in.

jsoltren · 2016-12-15T02:29:31Z

Great!

I've been working on some additional changes on top of this: UI representation for blacklisted executors (SPARK-16654), and implicit killing of blacklisted executors (SPARK-16554). I'll be sending pull requests for those soon after this is merged.

SparkQA · 2016-12-15T17:05:56Z

Test build #70194 has finished for PR 14079 at commit f249b00.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class EventTimeStats(var max: Long, var min: Long, var sum: Long, var count: Long)
class EventTimeStatsAccum(protected var currentStats: EventTimeStats = EventTimeStats.zero)

squito · 2016-12-15T20:50:48Z

thanks @kayousterhout ! appreciate all the time you've spent helping out on this issue.

merged to master

zsxwing · 2016-12-15T20:54:35Z

@squito Scala 2.10 is broken: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/3394/console Could you submit a PR to fix it? Thanks!

There is a small race in SchedulerIntegrationSuite. The test assumes that the taskscheduler thread processing that last task will finish before the DAGScheduler processes the task event and notifies the job waiter, but that is not 100% guaranteed. ran the test locally a bunch of times, never failed, though admittedly it never failed locally for me before either. However I am nearly 100% certain this is what caused the failure of one jenkins build https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull (which is long gone now, sorry -- I fixed it as part of apache#14079 initially) Author: Imran Rashid <irashid@cloudera.com> Closes apache#16270 from squito/sched_integ_flakiness.

## What changes were proposed in this pull request? This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application. Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout. Full details are available in a design doc attached to the jira. ## How was this patch tested? Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness. The added tests include: - verifying BlacklistTracker works correctly - verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker) - an integration test for the entire scheduler with blacklisting in a few different scenarios Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes apache#14079 from squito/blacklist-SPARK-8425.

squito · 2016-12-15T22:55:38Z

oops, thanks for letting me know @zsxwing , I just submitted #16298

There is a small race in SchedulerIntegrationSuite. The test assumes that the taskscheduler thread processing that last task will finish before the DAGScheduler processes the task event and notifies the job waiter, but that is not 100% guaranteed. ran the test locally a bunch of times, never failed, though admittedly it never failed locally for me before either. However I am nearly 100% certain this is what caused the failure of one jenkins build https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull (which is long gone now, sorry -- I fixed it as part of apache#14079 initially) Author: Imran Rashid <irashid@cloudera.com> Closes apache#16270 from squito/sched_integ_flakiness.

## What changes were proposed in this pull request? This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application. Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout. Full details are available in a design doc attached to the jira. ## How was this patch tested? Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness. The added tests include: - verifying BlacklistTracker works correctly - verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker) - an integration test for the entire scheduler with blacklisting in a few different scenarios Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes apache#14079 from squito/blacklist-SPARK-8425.

This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application. Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout. Full details are available in a design doc attached to the jira. Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness. The added tests include: - verifying BlacklistTracker works correctly - verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker) - an integration test for the entire scheduler with blacklisting in a few different scenarios Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes apache#14079 from squito/blacklist-SPARK-8425.

wei-mao-intel and others added 2 commits July 6, 2016 16:52

Update for new design

5bfe941

squito reviewed Jul 7, 2016
View reviewed changes

squito added 3 commits July 7, 2016 16:16

(node,task) blacklisting

d7adc67

go back to having the blacklist tracker as an option, rather than the…

a34e9ae

… no-op implementation

dont count shuffle-fetch failures

cf58374

vanzin reviewed Jul 7, 2016
View reviewed changes

squito added 2 commits July 8, 2016 11:57

make sure we clear the (node, task) blacklist on stage completion, ad…

7fcb266

…d to test

review feedback

487eb66

squito mentioned this pull request Dec 13, 2016

[SPARK-18846][Scheduler] Fix flakiness in SchedulerIntegrationSuite #16270

Closed

squito added 2 commits December 13, 2016 16:04

review feedback

555039d

Merge branch 'master' into blacklist-SPARK-8425

c422dd4

kayousterhout reviewed Dec 13, 2016

View reviewed changes

check executor id

c95462f

Merge branch 'master' into blacklist-SPARK-8425

f249b00

asfgit closed this in 93cdb8a Dec 15, 2016

[SPARK-8425][CORE] Application Level Blacklisting #14079

[SPARK-8425][CORE] Application Level Blacklisting #14079

Uh oh!

Conversation

squito commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

squito commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

squito Jul 7, 2016

Choose a reason for hiding this comment

Uh oh!

squito commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

vanzin Jul 7, 2016

Choose a reason for hiding this comment

Uh oh!

squito commented Dec 13, 2016

Uh oh!

kayousterhout Dec 13, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

squito commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

kayousterhout commented Dec 15, 2016

Uh oh!

jsoltren commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

squito commented Dec 15, 2016

Uh oh!

zsxwing commented Dec 15, 2016

Uh oh!

squito commented Dec 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

squito commented Jul 6, 2016 •

edited

Loading