Handle pod failures for all policies #189

georgkaleido · 2022-04-08T12:30:33Z

If a pod is in phase failure we have to create a new one.
Currently it was assumed the pod would restart due to a RestartPolicy on the pod level
This doesn't work if the pod fails for a system reason.

google-cla · 2022-04-08T12:30:38Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

Fixes kubeflow#1570 Together with kubeflow/common#189 There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.

johnugeorge · 2022-06-03T17:17:26Z

can you do a rebase?

Fixes kubeflow#1570 Together with kubeflow/common#189 There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.

georgkaleido · 2022-06-09T09:56:54Z

@johnugeorge Done

johnugeorge · 2022-06-09T11:03:47Z

@georgkaleido Can you fix golint ?

If a pod is in phase failure we have to create a new one. Currently it was assumed the pod would restart due to a RestartPolicy on the pod level This doesn't work if the pod fails for a system reason.

georgkaleido · 2022-06-09T12:14:34Z

@johnugeorge done

Fixes #1570 Together with kubeflow/common#189 There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.

terrytangyuan

Thanks!

/lgtm

google-oss-prow · 2022-06-09T12:42:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

If a pod is in phase failure we have to create a new one. Currently it was assumed the pod would restart due to a RestartPolicy on the pod level This doesn't work if the pod fails for a system reason. (cherry picked from commit 8f0ddb5)

google-oss-prow bot requested a review from terrytangyuan April 8, 2022 12:30

google-oss-prow bot added the size/S label Apr 8, 2022

georgkaleido mentioned this pull request Apr 8, 2022

Restart job on failure for Always,OnFailure Policy kubeflow/training-operator#1572

Merged

1 task

georgkaleido force-pushed the pod_policy branch from 5a42410 to bea646c Compare June 9, 2022 09:56

Handle pod failures for all policies

8f411a6

If a pod is in phase failure we have to create a new one. Currently it was assumed the pod would restart due to a RestartPolicy on the pod level This doesn't work if the pod fails for a system reason.

georgkaleido force-pushed the pod_policy branch from bea646c to 8f411a6 Compare June 9, 2022 12:04

terrytangyuan approved these changes Jun 9, 2022

View reviewed changes

google-oss-prow bot assigned terrytangyuan Jun 9, 2022

google-oss-prow bot added the lgtm label Jun 9, 2022

google-oss-prow bot added the approved label Jun 9, 2022

google-oss-prow bot merged commit 8f0ddb5 into kubeflow:master Jun 9, 2022

yoanisgil mentioned this pull request Jun 9, 2022

Handle pod failures for all policies #195

Closed

abin-thomas-by mentioned this pull request Aug 10, 2022

handle all restart policies kubeflow/training-operator#1649

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle pod failures for all policies #189

Handle pod failures for all policies #189

georgkaleido commented Apr 8, 2022

google-cla bot commented Apr 8, 2022

johnugeorge commented Jun 3, 2022

georgkaleido commented Jun 9, 2022

johnugeorge commented Jun 9, 2022

georgkaleido commented Jun 9, 2022

terrytangyuan left a comment

google-oss-prow bot commented Jun 9, 2022

Handle pod failures for all policies #189

Handle pod failures for all policies #189

Conversation

georgkaleido commented Apr 8, 2022

google-cla bot commented Apr 8, 2022

johnugeorge commented Jun 3, 2022

georgkaleido commented Jun 9, 2022

johnugeorge commented Jun 9, 2022

georgkaleido commented Jun 9, 2022

terrytangyuan left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jun 9, 2022