Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

handle failed executor event #602

Open
wants to merge 1 commit into
base: branch-2.2-kubernetes
Choose a base branch
from

Conversation

ChenLingPeng
Copy link

Signed-off-by: forrestchen forrestchen@tencent.com

see #600

(Please fill in changes proposed in this fix)

How was this patch tested?

manual test.

Copy link
Member

@liyinan926 liyinan926 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question: how does Yarn handle this case, i.e., of executors that fail to register?

@@ -54,6 +54,8 @@ private[spark] class KubernetesClusterSchedulerBackend(
private val RUNNING_EXECUTOR_PODS_LOCK = new Object
// Indexed by executor IDs and guarded by RUNNING_EXECUTOR_PODS_LOCK.
private val runningExecutorsToPods = new mutable.HashMap[String, Pod]
// executors names with failed status and guarded by RUNNING_EXECUTOR_PODS_LOCK.
private val failedExecutors = new mutable.HashSet[String]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/executors/Executors.

val (executorId, pod) = allocateNewExecutorPod(nodeToLocalTaskCount)
runningExecutorsToPods.put(executorId, pod)
runningPodsToExecutors.put(pod.getMetadata.getName, executorId)
logInfo(
s"Requesting a new executor, total executors is now ${runningExecutorsToPods.size}")
s"Requesting a new executor $executorId, total executors is now " +
s"${runningExecutorSize()}(${failedExecutors.size} failed)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty space before (.

// e.g. if we expect to create 2 executor but every executor failed,
// after create 1002 pod, we're not going to create more
def runningExecutorSize(): Int = runningExecutorsToPods.size -
math.min(failedExecutors.size, 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1000? What about totalExpectedExecutors ?

Copy link
Author

@ChenLingPeng ChenLingPeng Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totalExpectedExecutors sounds a good idea to limit the executor numbers.

}.getOrElse(logWarning(s"Unable to remove pod for unknown executor $executorId"))
}
}

// e.g. if we expect to create 2 executor but every executor failed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc needs a bit of rewording to explain what the return value represents and how it is calculated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this

    // It represent current created executors exclude failed one.
    // To avoid to create too many failed executor,
    // we limit the accounting size of failed executors to totalExpectedExecutors
    // so after create 2*totalExpectedExecutors executors,
    // we stop create more even if all of them failed

@liyinan926
Copy link
Member

BTW: this fix should also be upstreamed. Can you file a PR against upstream apache/master?

@ChenLingPeng
Copy link
Author

A general question: how does Yarn handle this case, i.e., of executors that fail to register?

Not so familiar with spark-on-yarn, I think if allocateResponse.getCompletedContainersStatuses can return this kind of executors, then in yarn mode, it can handle this scenario just like registered but failed executor.

this fix should also be upstreamed

Will do this after this is merged

Signed-off-by: forrestchen <forrestchen@tencent.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants