-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8425][CORE] Application Level Blacklisting #14079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9a6aaed
5bfe941
d7adc67
a34e9ae
cf58374
7fcb266
487eb66
c22aaad
dc2b3ed
338db65
fa3e34a
16afb43
7aff08a
e181546
351a9a7
572c777
8cebb01
dbf904e
f0de0db
8a12adf
c9e3662
497e626
515b18a
f0428b4
a5fbce7
b582d8e
cec36c9
290b315
8c58ad9
f012780
fc45f5b
f8b1bff
e56bb90
cc3b968
5fdfe49
e10fa10
1297788
c78964f
b679953
463b837
9a2cf84
d0f43c7
cfb653e
18ef5c6
0c3ceba
2381b25
3ca2f79
27b4bde
278fff3
1a467f0
0ff7d16
ff49a62
21907a5
0c57d9d
162cb0d
cb658dd
6b3babc
37f1573
45f42eb
cdd9f33
255e9b6
d431f26
cc3faaf
5d8500a
72036f4
fd57d86
35978e2
555039d
c422dd4
c95462f
f249b00
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,10 +17,274 @@ | |
|
|
||
| package org.apache.spark.scheduler | ||
|
|
||
| import java.util.concurrent.atomic.AtomicReference | ||
|
|
||
| import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet} | ||
|
|
||
| import org.apache.spark.SparkConf | ||
| import org.apache.spark.internal.Logging | ||
| import org.apache.spark.internal.config | ||
| import org.apache.spark.util.Utils | ||
| import org.apache.spark.util.{Clock, SystemClock, Utils} | ||
|
|
||
| /** | ||
| * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting | ||
| * executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add | ||
| * additional blacklisting of executors and nodes for individual tasks and stages which works in | ||
| * concert with the blacklisting here. | ||
| * | ||
| * The tracker needs to deal with a variety of workloads, eg.: | ||
| * | ||
| * * bad user code -- this may lead to many task failures, but that should not count against | ||
| * individual executors | ||
| * * many small stages -- this may prevent a bad executor for having many failures within one | ||
| * stage, but still many failures over the entire application | ||
| * * "flaky" executors -- they don't fail every task, but are still faulty enough to merit | ||
| * blacklisting | ||
| * | ||
| * See the design doc on SPARK-8425 for a more in-depth discussion. | ||
| * | ||
| * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is | ||
| * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The | ||
| * one exception is [[nodeBlacklist()]], which can be called without holding a lock. | ||
| */ | ||
| private[scheduler] class BlacklistTracker ( | ||
| conf: SparkConf, | ||
| clock: Clock = new SystemClock()) extends Logging { | ||
|
|
||
| BlacklistTracker.validateBlacklistConfs(conf) | ||
| private val MAX_FAILURES_PER_EXEC = conf.get(config.MAX_FAILURES_PER_EXEC) | ||
| private val MAX_FAILED_EXEC_PER_NODE = conf.get(config.MAX_FAILED_EXEC_PER_NODE) | ||
| val BLACKLIST_TIMEOUT_MILLIS = BlacklistTracker.getBlacklistTimeout(conf) | ||
|
|
||
| /** | ||
| * A map from executorId to information on task failures. Tracks the time of each task failure, | ||
| * so that we can avoid blacklisting executors due to failures that are very far apart. We do not | ||
| * actively remove from this as soon as tasks hit their timeouts, to avoid the time it would take | ||
| * to do so. But it will not grow too large, because as soon as an executor gets too many | ||
| * failures, we blacklist the executor and remove its entry here. | ||
| */ | ||
| private val executorIdToFailureList = new HashMap[String, ExecutorFailureList]() | ||
| val executorIdToBlacklistStatus = new HashMap[String, BlacklistedExecutor]() | ||
| val nodeIdToBlacklistExpiryTime = new HashMap[String, Long]() | ||
| /** | ||
| * An immutable copy of the set of nodes that are currently blacklisted. Kept in an | ||
| * AtomicReference to make [[nodeBlacklist()]] thread-safe. | ||
| */ | ||
| private val _nodeBlacklist = new AtomicReference[Set[String]](Set()) | ||
| /** | ||
| * Time when the next blacklist will expire. Used as a | ||
| * shortcut to avoid iterating over all entries in the blacklist when none will have expired. | ||
| */ | ||
| var nextExpiryTime: Long = Long.MaxValue | ||
| /** | ||
| * Mapping from nodes to all of the executors that have been blacklisted on that node. We do *not* | ||
| * remove from this when executors are removed from spark, so we can track when we get multiple | ||
| * successive blacklisted executors on one node. Nonetheless, it will not grow too large because | ||
| * there cannot be many blacklisted executors on one node, before we stop requesting more | ||
| * executors on that node, and we clean up the list of blacklisted executors once an executor has | ||
| * been blacklisted for BLACKLIST_TIMEOUT_MILLIS. | ||
| */ | ||
| val nodeToBlacklistedExecs = new HashMap[String, HashSet[String]]() | ||
|
|
||
| /** | ||
| * Un-blacklists executors and nodes that have been blacklisted for at least | ||
| * BLACKLIST_TIMEOUT_MILLIS | ||
| */ | ||
| def applyBlacklistTimeout(): Unit = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a docstring here? "Un-blacklists executors and nodes that have been blacklisted for at least BLACKLIST_TIMEOUT"? |
||
| val now = clock.getTimeMillis() | ||
| // quickly check if we've got anything to expire from blacklist -- if not, avoid doing any work | ||
| if (now > nextExpiryTime) { | ||
| // Apply the timeout to blacklisted nodes and executors | ||
| val execsToUnblacklist = executorIdToBlacklistStatus.filter(_._2.expiryTime < now).keys | ||
| if (execsToUnblacklist.nonEmpty) { | ||
| // Un-blacklist any executors that have been blacklisted longer than the blacklist timeout. | ||
| logInfo(s"Removing executors $execsToUnblacklist from blacklist because the blacklist " + | ||
| s"for those executors has timed out") | ||
| execsToUnblacklist.foreach { exec => | ||
| val status = executorIdToBlacklistStatus.remove(exec).get | ||
| val failedExecsOnNode = nodeToBlacklistedExecs(status.node) | ||
| failedExecsOnNode.remove(exec) | ||
| if (failedExecsOnNode.isEmpty) { | ||
| nodeToBlacklistedExecs.remove(status.node) | ||
| } | ||
| } | ||
| } | ||
| val nodesToUnblacklist = nodeIdToBlacklistExpiryTime.filter(_._2 < now).keys | ||
| if (nodesToUnblacklist.nonEmpty) { | ||
| // Un-blacklist any nodes that have been blacklisted longer than the blacklist timeout. | ||
| logInfo(s"Removing nodes $nodesToUnblacklist from blacklist because the blacklist " + | ||
| s"has timed out") | ||
| nodeIdToBlacklistExpiryTime --= nodesToUnblacklist | ||
| _nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet) | ||
| } | ||
| updateNextExpiryTime() | ||
| } | ||
| } | ||
|
|
||
| private def updateNextExpiryTime(): Unit = { | ||
| val execMinExpiry = if (executorIdToBlacklistStatus.nonEmpty) { | ||
| executorIdToBlacklistStatus.map{_._2.expiryTime}.min | ||
| } else { | ||
| Long.MaxValue | ||
| } | ||
| val nodeMinExpiry = if (nodeIdToBlacklistExpiryTime.nonEmpty) { | ||
| nodeIdToBlacklistExpiryTime.values.min | ||
| } else { | ||
| Long.MaxValue | ||
| } | ||
| nextExpiryTime = math.min(execMinExpiry, nodeMinExpiry) | ||
| } | ||
|
|
||
|
|
||
| def updateBlacklistForSuccessfulTaskSet( | ||
| stageId: Int, | ||
| stageAttemptId: Int, | ||
| failuresByExec: HashMap[String, ExecutorFailuresInTaskSet]): Unit = { | ||
| // if any tasks failed, we count them towards the overall failure count for the executor at | ||
| // this point. | ||
| val now = clock.getTimeMillis() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens for super long task sets (e.g., where the duration of the task set is longer than the blacklist timeout)? In that case, we could be adding things to the blacklist that have already expired / is that handled correctly?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, I think it depends what you mean by "handled correctly". We use the time the taskset completes, so its OK if the failures happened long ago when the taskset started, we still count those failures in the app blacklist, so later failures can trickle in and push us over the limit. OTOH, this also means that if we were already close to the limit on failures for the application when this taskset started, then a really long running taskset will fail to push us over the limit -- by the time the latest task set finishes, we've expired the old failures, so we only get failures from the new taskset. So if your taskset time is longer than the blacklist timeout, you're unlikely to ever get application level blacklisting. Clearly this is not great, but its not that bad. After all, even if it were app-level blacklisted, we'd hit still the timeout and remove the bad resources from the blacklist, so that we'd need to rediscover it in future blacklists. One of the main reasons for the app-level blacklist is to avoid lots of failures when the tasksets are short. If you really want an application level blacklist which is useful across really long tasksets, then you've got to crank up your timeout. We could change this slightly by first updating the application level blacklist, and then expiring failures past the timeout. But to me that behavior seems much less intuitive, for a pretty questionable gain. Does that make sense? What do you think?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh gosh, I lied completely about the way the time handling works. sorry that is somewhat embarassing. But the overall comment about what happens with long tasksets still apply -- its not so clear what the thing to do is in that case. If you really want to handle long tasksets and have app-level blacklsiting, you need to increase that timeout.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's what I was concerned about: Suppose BLACKLIST_TIMEOUT_MILLIS is 5 and MAX_FAILURES_PER_EXEC is 2. In a long running task set, task set 0, a task fails on executor A at time 8, but the blacklist tracker doesn't find out about this until much later when the task set finishes (for the sake of example, time 100). In the meantime task set 1 runs, has a task that fails on executor A at time 98, and then completes shortly thereafter at time 99. At this point, there have been two failures on executor A: one at time 8 and one at time 98. These are so far apart that they shouldn't cause A to be blacklisted. But it looks like when task set 0 finishes, we'll still add the entry at time 8 to ExecutorFailureList, and then hit MAX_FAILURES_PER_EXEC and blacklist executor A. This seems overly aggressive (i.e., it seems like long-running task sets can "unfairly" get executors to be blacklisted that actually had very spread out failures, potentially far in the past). It looks like this could be fixed by swapping lines 150 and 151 (with a comment that this is to handle long task sets)? I think this is what you were saying seems confusing, but I think is necessary to avoid blacklisting behavior that seems inconsistent with the timeout. Let me know if I'm misinterpreting the behavior here!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I think you are right, thanks for walking me through this example. I can't think of any behavior which is clearly "right" for long-running task sets, but I think what you are proposing is more clear. The other situation I was worried about is when the failures occur at the beginning of the taskset, but by the time the taskset finishes we're already past the timeout. You'd never get app-level blacklisting. But thinking about this again, I think that is the best thing to do. One other alternative would be to completely ignore the time that individual tasks fail, and instead use the time the tasksets complete for the timeout. But I think that would be overall more confusing. |
||
| failuresByExec.foreach { case (exec, failuresInTaskSet) => | ||
| val appFailuresOnExecutor = | ||
| executorIdToFailureList.getOrElseUpdate(exec, new ExecutorFailureList) | ||
| appFailuresOnExecutor.addFailures(stageId, stageAttemptId, failuresInTaskSet) | ||
| appFailuresOnExecutor.dropFailuresWithTimeoutBefore(now) | ||
| val newTotal = appFailuresOnExecutor.numUniqueTaskFailures | ||
|
|
||
| val expiryTimeForNewBlacklists = now + BLACKLIST_TIMEOUT_MILLIS | ||
| // If this pushes the total number of failures over the threshold, blacklist the executor. | ||
| // If its already blacklisted, we avoid "re-blacklisting" (which can happen if there were | ||
| // other tasks already running in another taskset when it got blacklisted), because it makes | ||
| // some of the logic around expiry times a little more confusing. But it also wouldn't be a | ||
| // problem to re-blacklist, with a later expiry time. | ||
| if (newTotal >= MAX_FAILURES_PER_EXEC && !executorIdToBlacklistStatus.contains(exec)) { | ||
| logInfo(s"Blacklisting executor id: $exec because it has $newTotal" + | ||
| s" task failures in successful task sets") | ||
| val node = failuresInTaskSet.node | ||
| executorIdToBlacklistStatus.put(exec, BlacklistedExecutor(node, expiryTimeForNewBlacklists)) | ||
| updateNextExpiryTime() | ||
|
|
||
| // In addition to blacklisting the executor, we also update the data for failures on the | ||
| // node, and potentially put the entire node into a blacklist as well. | ||
| val blacklistedExecsOnNode = nodeToBlacklistedExecs.getOrElseUpdate(node, HashSet[String]()) | ||
| blacklistedExecsOnNode += exec | ||
| // If the node is already in the blacklist, we avoid adding it again with a later expiry | ||
| // time. | ||
| if (blacklistedExecsOnNode.size >= MAX_FAILED_EXEC_PER_NODE && | ||
| !nodeIdToBlacklistExpiryTime.contains(node)) { | ||
| logInfo(s"Blacklisting node $node because it has ${blacklistedExecsOnNode.size} " + | ||
| s"executors blacklisted: ${blacklistedExecsOnNode}") | ||
| nodeIdToBlacklistExpiryTime.put(node, expiryTimeForNewBlacklists) | ||
| _nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| def isExecutorBlacklisted(executorId: String): Boolean = { | ||
| executorIdToBlacklistStatus.contains(executorId) | ||
| } | ||
|
|
||
| /** | ||
| * Get the full set of nodes that are blacklisted. Unlike other methods in this class, this *IS* | ||
| * thread-safe -- no lock required on a taskScheduler. | ||
| */ | ||
| def nodeBlacklist(): Set[String] = { | ||
| _nodeBlacklist.get() | ||
| } | ||
|
|
||
| def isNodeBlacklisted(node: String): Boolean = { | ||
| nodeIdToBlacklistExpiryTime.contains(node) | ||
| } | ||
|
|
||
| def handleRemovedExecutor(executorId: String): Unit = { | ||
| // We intentionally do not clean up executors that are already blacklisted in | ||
| // nodeToBlacklistedExecs, so that if another executor on the same node gets blacklisted, we can | ||
| // blacklist the entire node. We also can't clean up executorIdToBlacklistStatus, so we can | ||
| // eventually remove the executor after the timeout. Despite not clearing those structures | ||
| // here, we don't expect they will grow too big since you won't get too many executors on one | ||
| // node, and the timeout will clear it up periodically in any case. | ||
| executorIdToFailureList -= executorId | ||
| } | ||
|
|
||
|
|
||
| /** | ||
| * Tracks all failures for one executor (that have not passed the timeout). | ||
| * | ||
| * In general we actually expect this to be extremely small, since it won't contain more than the | ||
| * maximum number of task failures before an executor is failed (default 2). | ||
| */ | ||
| private[scheduler] final class ExecutorFailureList extends Logging { | ||
|
|
||
| private case class TaskId(stage: Int, stageAttempt: Int, taskIndex: Int) | ||
|
|
||
| /** | ||
| * All failures on this executor in successful task sets. | ||
| */ | ||
| private var failuresAndExpiryTimes = ArrayBuffer[(TaskId, Long)]() | ||
| /** | ||
| * As an optimization, we track the min expiry time over all entries in failuresAndExpiryTimes | ||
| * so its quick to tell if there are any failures with expiry before the current time. | ||
| */ | ||
| private var minExpiryTime = Long.MaxValue | ||
|
|
||
| def addFailures( | ||
| stage: Int, | ||
| stageAttempt: Int, | ||
| failuresInTaskSet: ExecutorFailuresInTaskSet): Unit = { | ||
| failuresInTaskSet.taskToFailureCountAndFailureTime.foreach { | ||
| case (taskIdx, (_, failureTime)) => | ||
| val expiryTime = failureTime + BLACKLIST_TIMEOUT_MILLIS | ||
| failuresAndExpiryTimes += ((TaskId(stage, stageAttempt, taskIdx), expiryTime)) | ||
| if (expiryTime < minExpiryTime) { | ||
| minExpiryTime = expiryTime | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * The number of unique tasks that failed on this executor. Only counts failures within the | ||
| * timeout, and in successful tasksets. | ||
| */ | ||
| def numUniqueTaskFailures: Int = failuresAndExpiryTimes.size | ||
|
|
||
| def isEmpty: Boolean = failuresAndExpiryTimes.isEmpty | ||
|
|
||
| /** | ||
| * Apply the timeout to individual tasks. This is to prevent one-off failures that are very | ||
| * spread out in time (and likely have nothing to do with problems on the executor) from | ||
| * triggering blacklisting. However, note that we do *not* remove executors and nodes from | ||
| * the blacklist as we expire individual task failures -- each have their own timeout. Eg., | ||
| * suppose: | ||
| * * timeout = 10, maxFailuresPerExec = 2 | ||
| * * Task 1 fails on exec 1 at time 0 | ||
| * * Task 2 fails on exec 1 at time 5 | ||
| * --> exec 1 is blacklisted from time 5 - 15. | ||
| * This is to simplify the implementation, as well as keep the behavior easier to understand | ||
| * for the end user. | ||
| */ | ||
| def dropFailuresWithTimeoutBefore(dropBefore: Long): Unit = { | ||
| if (minExpiryTime < dropBefore) { | ||
| var newMinExpiry = Long.MaxValue | ||
| val newFailures = new ArrayBuffer[(TaskId, Long)] | ||
| failuresAndExpiryTimes.foreach { case (task, expiryTime) => | ||
| if (expiryTime >= dropBefore) { | ||
| newFailures += ((task, expiryTime)) | ||
| if (expiryTime < newMinExpiry) { | ||
| newMinExpiry = expiryTime | ||
| } | ||
| } | ||
| } | ||
| failuresAndExpiryTimes = newFailures | ||
| minExpiryTime = newMinExpiry | ||
| } | ||
| } | ||
|
|
||
| override def toString(): String = { | ||
| s"failures = $failuresAndExpiryTimes" | ||
| } | ||
| } | ||
|
|
||
| } | ||
|
|
||
| private[scheduler] object BlacklistTracker extends Logging { | ||
|
|
||
|
|
@@ -80,7 +344,9 @@ private[scheduler] object BlacklistTracker extends Logging { | |
| config.MAX_TASK_ATTEMPTS_PER_EXECUTOR, | ||
| config.MAX_TASK_ATTEMPTS_PER_NODE, | ||
| config.MAX_FAILURES_PER_EXEC_STAGE, | ||
| config.MAX_FAILED_EXEC_PER_NODE_STAGE | ||
| config.MAX_FAILED_EXEC_PER_NODE_STAGE, | ||
| config.MAX_FAILURES_PER_EXEC, | ||
| config.MAX_FAILED_EXEC_PER_NODE | ||
| ).foreach { config => | ||
| val v = conf.get(config) | ||
| if (v <= 0) { | ||
|
|
@@ -112,3 +378,5 @@ private[scheduler] object BlacklistTracker extends Logging { | |
| } | ||
| } | ||
| } | ||
|
|
||
| private final case class BlacklistedExecutor(node: String, expiryTime: Long) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about BlacklistStatus? "Executor" is confusing, since the class doesn't contain anything about an executor
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. move this class inside BlacklistTracker? Or I wonder if it would be better to rename the map to executorIdToNodeAndExpiryTime and then just put a 2-item tuple in the map rather than this simple datastructure (don't have strong feelings though, if you prefer the class)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point, moved this and
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, actually there is a minor reason why I can't put it inside BlacklistTracker -- I expose |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to have the first sentence say "An immutable copy of the set of nodes that are currently blacklisted (i.e., of the keys in nodeIdToBlacklistExpiryTime). Kept..."?
(I keep forgetting why this is necessary)