handle failed executor event #602

ChenLingPeng · 2018-01-11T08:28:42Z

Signed-off-by: forrestchen forrestchen@tencent.com

see #600

(Please fill in changes proposed in this fix)

How was this patch tested?

manual test.

liyinan926

A general question: how does Yarn handle this case, i.e., of executors that fail to register?

liyinan926 · 2018-01-11T17:30:49Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

@@ -54,6 +54,8 @@ private[spark] class KubernetesClusterSchedulerBackend(
  private val RUNNING_EXECUTOR_PODS_LOCK = new Object
  // Indexed by executor IDs and guarded by RUNNING_EXECUTOR_PODS_LOCK.
  private val runningExecutorsToPods = new mutable.HashMap[String, Pod]
+  // executors names with failed status and guarded by RUNNING_EXECUTOR_PODS_LOCK.
+  private val failedExecutors = new mutable.HashSet[String]


s/executors/Executors.

liyinan926 · 2018-01-11T17:31:13Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

            val (executorId, pod) = allocateNewExecutorPod(nodeToLocalTaskCount)
            runningExecutorsToPods.put(executorId, pod)
            runningPodsToExecutors.put(pod.getMetadata.getName, executorId)
            logInfo(
-              s"Requesting a new executor, total executors is now ${runningExecutorsToPods.size}")
+              s"Requesting a new executor $executorId, total executors is now " +
+                s"${runningExecutorSize()}(${failedExecutors.size} failed)")


Empty space before (.

liyinan926 · 2018-01-11T19:46:34Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

+    // e.g. if we expect to create 2 executor but every executor failed,
+    // after create 1002 pod, we're not going to create more
+    def runningExecutorSize(): Int = runningExecutorsToPods.size -
+      math.min(failedExecutors.size, 1000)


Why 1000? What about totalExpectedExecutors ?

totalExpectedExecutors sounds a good idea to limit the executor numbers.

liyinan926 · 2018-01-11T19:47:12Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

        }.getOrElse(logWarning(s"Unable to remove pod for unknown executor $executorId"))
      }
    }
+
+    // e.g. if we expect to create 2 executor but every executor failed,


Javadoc needs a bit of rewording to explain what the return value represents and how it is calculated.

how about this

// It represent current created executors exclude failed one. // To avoid to create too many failed executor, // we limit the accounting size of failed executors to totalExpectedExecutors // so after create 2*totalExpectedExecutors executors, // we stop create more even if all of them failed

liyinan926 · 2018-01-11T21:58:54Z

BTW: this fix should also be upstreamed. Can you file a PR against upstream apache/master?

ChenLingPeng · 2018-01-12T07:27:19Z

A general question: how does Yarn handle this case, i.e., of executors that fail to register?

Not so familiar with spark-on-yarn, I think if allocateResponse.getCompletedContainersStatuses can return this kind of executors, then in yarn mode, it can handle this scenario just like registered but failed executor.

this fix should also be upstreamed

Will do this after this is merged

Signed-off-by: forrestchen <forrestchen@tencent.com>

liyinan926 reviewed Jan 11, 2018

View reviewed changes

ChenLingPeng force-pushed the handle-executor-failed branch from d7906c4 to 1e22fab Compare January 12, 2018 07:47

handle failed executor event

cf35cda

Signed-off-by: forrestchen <forrestchen@tencent.com>

ChenLingPeng force-pushed the handle-executor-failed branch from 1e22fab to cf35cda Compare January 12, 2018 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

handle failed executor event #602

handle failed executor event #602

Uh oh!

ChenLingPeng commented Jan 11, 2018

Uh oh!

liyinan926 left a comment

Uh oh!

liyinan926 Jan 11, 2018

Uh oh!

liyinan926 Jan 11, 2018

Uh oh!

liyinan926 Jan 11, 2018

Uh oh!

ChenLingPeng Jan 12, 2018 •

edited

Loading

Uh oh!

liyinan926 Jan 11, 2018

Uh oh!

ChenLingPeng Jan 12, 2018

Uh oh!

liyinan926 commented Jan 11, 2018

Uh oh!

ChenLingPeng commented Jan 12, 2018

Uh oh!

Uh oh!

handle failed executor event #602

Are you sure you want to change the base?

handle failed executor event #602

Uh oh!

Conversation

ChenLingPeng commented Jan 11, 2018

How was this patch tested?

Uh oh!

liyinan926 left a comment

Choose a reason for hiding this comment

Uh oh!

liyinan926 Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

liyinan926 Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

liyinan926 Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

ChenLingPeng Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyinan926 Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

ChenLingPeng Jan 12, 2018

Choose a reason for hiding this comment

Uh oh!

liyinan926 commented Jan 11, 2018

Uh oh!

ChenLingPeng commented Jan 12, 2018

Uh oh!

Uh oh!

ChenLingPeng Jan 12, 2018 •

edited

Loading