Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart job on failure for Always,OnFailure,ExitCode Policy #1605

Closed
wants to merge 1 commit into from
Closed

Restart job on failure for Always,OnFailure,ExitCode Policy #1605

wants to merge 1 commit into from

Conversation

yoanisgil
Copy link

@yoanisgil yoanisgil commented Jun 8, 2022

What this PR does / why we need it:

There can be pod level failures caused by the system, which would previously caused the entire job to fail on all policies except ExitCode. See also #1570

Works together with kubeflow/common#195

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

#1570

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yoanisgil
To complete the pull request process, please assign gaocegege after the PR has been reviewed.
You can assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johnugeorge
Copy link
Member

@yoanisgil we need to apply for other operators as well ?

/cc @gaocegege
/cc @zw0610

@google-oss-prow google-oss-prow bot requested a review from gaocegege June 8, 2022 20:20
@coveralls
Copy link

coveralls commented Jun 8, 2022

Pull Request Test Coverage Report for Build 2464048060

  • 1 of 2 (50.0%) changed or added relevant lines in 2 files are covered.
  • 8 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.1%) to 36.914%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/pytorch/pytorchjob_controller.go 0 1 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 8 77.24%
Totals Coverage Status
Change from base Build 2445064020: -0.1%
Covered Lines: 2306
Relevant Lines: 6247

💛 - Coveralls

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be added in MPIJob, too

WDYT @zw0610

@gaocegege
Copy link
Member

Thanks for your contribution! 🎉 👍

The PR itself LGTM

@yoanisgil
Copy link
Author

@gaocegege happy to add to MPIJob as well but I'd need some help with instructions on how to test it.

@johnugeorge
Copy link
Member

johnugeorge commented Jun 9, 2022

Can you add a test case as well?

@gaocegege Why isn't this applicable to all frameworks?

@johnugeorge
Copy link
Member

johnugeorge commented Jun 9, 2022

Since original PR is ready, shall we close this?

#1572

@yoanisgil
Copy link
Author

Yup.

@yoanisgil yoanisgil closed this Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants