Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Add Recovery Logic for Failed Pod #624

Open
wants to merge 1 commit into
base: branch-2.2-kubernetes
Choose a base branch
from

Conversation

duyanghao
Copy link

@duyanghao duyanghao commented Mar 14, 2018

Signed-off-by: duyanghao 1294057873@qq.com

What changes were proposed in this pull request?

Add recovery logic for failed pod and fix MEM_EXCEEDED_EXIT_CODE constant.

How was this patch tested?

Manual tests show successful for recovery of failed pod as below:

  1. make one executor pod fail(register itself failure)
  2. driver can discover the failed pod
  3. driver allocates a new executor pod

spark.executor.instances=5

# kubectl get pods -n=xxx -a -o wide|grep spark-debug-sar-test8
spark-debug-sar-test8           1/1       Completed     0          3m        192.168.25.92    x.x.x.x
spark-debug-sar-test8-exec-1    1/1       Completed     0          3m        192.168.25.94    x.x.x.x
spark-debug-sar-test8-exec-2    1/1       Completed     0          3m        192.168.25.93    x.x.x.x
spark-debug-sar-test8-exec-3    0/1       Error       0          3m        192.168.11.31    x.x.x.x
spark-debug-sar-test8-exec-4    0/1       Error       0          3m        192.168.11.37    x.x.x.x
spark-debug-sar-test8-exec-5    0/1       Error       0          3m        192.168.11.44    x.x.x.x
spark-debug-sar-test8-exec-6    1/1       Completed     0          48s       192.168.25.99    x.x.x.x
spark-debug-sar-test8-exec-7    1/1       Completed     0          48s       192.168.25.95    x.x.x.x
spark-debug-sar-test8-exec-8    1/1       Completed     0          48s       192.168.25.97    x.x.x.x

Signed-off-by: duyanghao <1294057873@qq.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant