Skip to content

Conversation

@squito
Copy link
Contributor

@squito squito commented Jul 6, 2016

What changes were proposed in this pull request?

This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application. Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout. Full details are available in a design doc attached to the jira.

How was this patch tested?

Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.

The added tests include:

  • verifying BlacklistTracker works correctly
  • verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
  • an integration test for the entire scheduler with blacklisting in a few different scenarios

wei-mao-intel and others added 2 commits July 6, 2016 16:52
1. create new BlacklistTracker and BlacklistStrategy interface to
support
complex use case for blacklist mechanism.
2. make Yarn allocator aware of node blacklist information
3. three strategies implemented for convenience, also user can define
his own strategy
SingleTaskStrategy: remain default behavior before this change.
AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting
stage level node blacklist
ExecutorAndNodeStrategy: different taskSet can share blacklist
information.
@squito
Copy link
Contributor Author

squito commented Jul 6, 2016

@kayousterhout @markhamstra @tgravescs @mwws I finally this is ready for review. I have some minor updates left but I wanted to get this in your hands now. The main thing is testing on a cluster (would appreciate any input from you on this as well Tom).

One big change in implementation I'd like to highlight: the blacklisttracker no longer requires locks. Though its accessed by multiple threads, its (almost) always from some place in TaskSschedulerImpl, which already has a lock on the taskScheduler. This also requires expiring executors while we're doing other work (rather than in a background thread) -- I chose to do it inside the call to taskScheduler.resourceOffer.

The one exception to having a lock on taskScheduler is the YarnBackend -- it needs the full set of blacklisted nodes, and it does that without a lock a on the task scheduler. But this was pretty easy to workaround.

I'll drop a few inline comments as well.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61873 has finished for PR 14079 at commit 5bfe941.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else {
blacklistTracker.taskSetFailed(manager.taskSet.stageId)
logInfo(s"Removed TaskSet ${manager.taskSet.id}, since it failed, from pool" +
s" ${manager.parent.name}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the log msg is unrelated to blacklisting, but this msg had always annoyed / confused me earlier, so I thought it was worth updating since i needed success anyway.

@squito
Copy link
Contributor Author

squito commented Jul 7, 2016

I took another look at having BlacklistTracker just be an option, rather than having a NoopBlacklist. After some other cleanup, I decided it made more sense to go back to the option, but its in one commit so easy to go either way a34e9ae

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61931 has finished for PR 14079 at commit cf58374.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

indexInTaskSet: Int): Boolean = {
// intentionally avoiding .getOrElse(..., new HashMap()) to avoid lots of object
// creation, since this method gets called a *lot*
stageIdToExecToFailures.get(stageId) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if something like this isn't easier to follow:

    stageIdToExecToFailures.get(stageId)
      .flatMap(_.get(executorId))
      .map(_.failuresByTask.contains(indexInTaskSet))
      .getOrElse(false)

@squito
Copy link
Contributor Author

squito commented Dec 13, 2016

thanks for the review @kayousterhout. I also added a testcase to BlacklistTrackerSuite, "task failure timeout works as expected for long-running tasksets" to cover your point about the long running tasksets.

val firstTaskAttempts = taskScheduler.resourceOffers(offers).flatten
firstTaskAttempts.foreach { task => logInfo(s"scheduled $task on ${task.executorId}") }
assert(firstTaskAttempts.isEmpty)
assert(firstTaskAttempts.size === 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check that the executor ID is executor4?

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70102 has finished for PR 14079 at commit c422dd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70120 has finished for PR 14079 at commit c95462f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Dec 14, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70122 has finished for PR 14079 at commit c95462f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 14, 2016
There is a small race in SchedulerIntegrationSuite.
The test assumes that the taskscheduler thread
processing that last task will finish before the DAGScheduler processes
the task event and notifies the job waiter, but that is not 100%
guaranteed.

ran the test locally a bunch of times, never failed, though admittedly
it never failed locally for me before either.  However I am nearly 100%
certain this is what caused the failure of one jenkins build
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull
(which is long gone now, sorry -- I fixed it as part of
apache#14079 initially)

Author: Imran Rashid <irashid@cloudera.com>

Closes apache#16270 from squito/sched_integ_flakiness.
@kayousterhout
Copy link
Contributor

LGTM!!!!! 🎉 🎉 🎉 🎉 🎉

Nice work on this -- this will be awesome to have in.

@jsoltren
Copy link

Great!

I've been working on some additional changes on top of this: UI representation for blacklisted executors (SPARK-16654), and implicit killing of blacklisted executors (SPARK-16554). I'll be sending pull requests for those soon after this is merged.

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70194 has finished for PR 14079 at commit f249b00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class EventTimeStats(var max: Long, var min: Long, var sum: Long, var count: Long)
  • class EventTimeStatsAccum(protected var currentStats: EventTimeStats = EventTimeStats.zero)

@asfgit asfgit closed this in 93cdb8a Dec 15, 2016
@squito
Copy link
Contributor Author

squito commented Dec 15, 2016

thanks @kayousterhout ! appreciate all the time you've spent helping out on this issue.

merged to master

@zsxwing
Copy link
Member

zsxwing commented Dec 15, 2016

robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
There is a small race in SchedulerIntegrationSuite.
The test assumes that the taskscheduler thread
processing that last task will finish before the DAGScheduler processes
the task event and notifies the job waiter, but that is not 100%
guaranteed.

ran the test locally a bunch of times, never failed, though admittedly
it never failed locally for me before either.  However I am nearly 100%
certain this is what caused the failure of one jenkins build
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull
(which is long gone now, sorry -- I fixed it as part of
apache#14079 initially)

Author: Imran Rashid <irashid@cloudera.com>

Closes apache#16270 from squito/sched_integ_flakiness.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application.  Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout.  Full details are available in a design doc attached to the jira.
## How was this patch tested?

Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.

The added tests include:
- verifying BlacklistTracker works correctly
- verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
- an integration test for the entire scheduler with blacklisting in a few different scenarios

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes apache#14079 from squito/blacklist-SPARK-8425.
@squito
Copy link
Contributor Author

squito commented Dec 15, 2016

oops, thanks for letting me know @zsxwing , I just submitted #16298

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
There is a small race in SchedulerIntegrationSuite.
The test assumes that the taskscheduler thread
processing that last task will finish before the DAGScheduler processes
the task event and notifies the job waiter, but that is not 100%
guaranteed.

ran the test locally a bunch of times, never failed, though admittedly
it never failed locally for me before either.  However I am nearly 100%
certain this is what caused the failure of one jenkins build
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull
(which is long gone now, sorry -- I fixed it as part of
apache#14079 initially)

Author: Imran Rashid <irashid@cloudera.com>

Closes apache#16270 from squito/sched_integ_flakiness.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application.  Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout.  Full details are available in a design doc attached to the jira.
## How was this patch tested?

Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.

The added tests include:
- verifying BlacklistTracker works correctly
- verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
- an integration test for the entire scheduler with blacklisting in a few different scenarios

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes apache#14079 from squito/blacklist-SPARK-8425.
yoonlee95 pushed a commit to yoonlee95/spark that referenced this pull request Aug 17, 2017
This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application.  Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout.  Full details are available in a design doc attached to the jira.

Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.

The added tests include:
- verifying BlacklistTracker works correctly
- verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
- an integration test for the entire scheduler with blacklisting in a few different scenarios

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes apache#14079 from squito/blacklist-SPARK-8425.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants