-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support exitCode restartPolicy #537
Comments
/kind feature |
Can you clarify what is the expected behavior? I wonder if we should have an API that is closer to Job's PodFailurePolicy instead https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy |
If we add exitCode, we might be further from this goal kubeflow/training-operator#1718 |
Probably, he expects a similar behavior to other training controllers (e.g., tfjob).
In the future, it would be good to support PodFailurePolicy once we introduce batch/v1 Job to workers. However, it might be worth supporting a similar logic to other CustomJobs in the short term. WDYT? |
I guess we'll need a new API version anyways. Since the launcher already uses Job, it might be worth translating the behavior of ExitCode into a PodFailurePolicy (I believe it should be possible), instead of manually watching the pods. |
I don't think we need a new API version since we can support // Get the exit code of the container.
var exitCode int32 = 0xbeef // magic number
for _, status := range pod.Status.ContainerStatuses {
state := status.State
if status.Name == r.GetDefaultContainerName() && state.Terminated != nil {
exitCode = state.Terminated.ExitCode
logger.Infof("Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
r.GetRecorder().Eventf(job, corev1.EventTypeNormal, "ExitedWithCode", "Pod: %v.%v exited with code %v", pod.Namespace, pod.Name, exitCode)
}
}
// Check if the pod is retryable.
if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
if pod.Status.Phase == corev1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
failedPodsCount.Inc()
logger.Infof("Need to restart the pod: %v.%v", pod.Namespace, pod.Name)
if err = r.DeletePod(ctx, pod.Namespace, pod.Name); err != nil {
return err
}
}
} If we support a similarly logic to other CustomJobs, we don't provide a configurable ExitCode similar to PodFailurePolicy. // RestartPolicyExitCode policy means that user should add exit code by themselves,
// The job operator will check these exit codes to
// determine the behavior when an error occurs:
// - 1-127: permanent error, do not restart.
// - 128-255: retryable error, will restart the pod.
RestartPolicyExitCode RestartPolicy = "ExitCode" |
I agree. |
Sorry, what I meant is that we might need a new apiversion once we plan to migrate everything to batch/Job. |
Ah, I see. I agree. We need a new API version if we migrate workers to batch/Jobs. In the short term, I would like to support the WDYT? |
sounds good |
/assign |
MPIJob doesn't support restartPolicy=ExitCode
ref: kubeflow/training-operator#1768
The text was updated successfully, but these errors were encountered: