-
Notifications
You must be signed in to change notification settings - Fork 118
handle failed executor event #602
base: branch-2.2-kubernetes
Are you sure you want to change the base?
handle failed executor event #602
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question: how does Yarn handle this case, i.e., of executors that fail to register?
@@ -54,6 +54,8 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
private val RUNNING_EXECUTOR_PODS_LOCK = new Object | |||
// Indexed by executor IDs and guarded by RUNNING_EXECUTOR_PODS_LOCK. | |||
private val runningExecutorsToPods = new mutable.HashMap[String, Pod] | |||
// executors names with failed status and guarded by RUNNING_EXECUTOR_PODS_LOCK. | |||
private val failedExecutors = new mutable.HashSet[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/executors
/Executors
.
val (executorId, pod) = allocateNewExecutorPod(nodeToLocalTaskCount) | ||
runningExecutorsToPods.put(executorId, pod) | ||
runningPodsToExecutors.put(pod.getMetadata.getName, executorId) | ||
logInfo( | ||
s"Requesting a new executor, total executors is now ${runningExecutorsToPods.size}") | ||
s"Requesting a new executor $executorId, total executors is now " + | ||
s"${runningExecutorSize()}(${failedExecutors.size} failed)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty space before (
.
// e.g. if we expect to create 2 executor but every executor failed, | ||
// after create 1002 pod, we're not going to create more | ||
def runningExecutorSize(): Int = runningExecutorsToPods.size - | ||
math.min(failedExecutors.size, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1000? What about totalExpectedExecutors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totalExpectedExecutors
sounds a good idea to limit the executor numbers.
}.getOrElse(logWarning(s"Unable to remove pod for unknown executor $executorId")) | ||
} | ||
} | ||
|
||
// e.g. if we expect to create 2 executor but every executor failed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc needs a bit of rewording to explain what the return value represents and how it is calculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this
// It represent current created executors exclude failed one.
// To avoid to create too many failed executor,
// we limit the accounting size of failed executors to totalExpectedExecutors
// so after create 2*totalExpectedExecutors executors,
// we stop create more even if all of them failed
BTW: this fix should also be upstreamed. Can you file a PR against upstream apache/master? |
Not so familiar with spark-on-yarn, I think if allocateResponse.getCompletedContainersStatuses can return this kind of executors, then in yarn mode, it can handle this scenario just like registered but failed executor.
Will do this after this is merged |
d7906c4
to
1e22fab
Compare
Signed-off-by: forrestchen <forrestchen@tencent.com>
1e22fab
to
cf35cda
Compare
Signed-off-by: forrestchen forrestchen@tencent.com
see #600
(Please fill in changes proposed in this fix)
How was this patch tested?
manual test.