Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warn event and directly return without creating pods for job validation failure #1564

Closed

Conversation

cheimu
Copy link
Member

@cheimu cheimu commented Mar 24, 2022

What this PR does / why we need it:
Currently, all jobs' controllers validate job spec but only log an error in training-operator, but

  1. Users don't know they are not correct because pods are running (such as tfjob wrong container name case)
  2. Creating pod or Running pod will reserve node resources, while they don't behave correctly. Wasting resources.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1563

Checklist:

  • Docs included if any changes are user facing

@aws-kf-ci-bot
Copy link
Contributor

Hi @cheimu. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Mar 24, 2022

Pull Request Test Coverage Report for Build 2039519742

  • 5 of 9 (55.56%) changed or added relevant lines in 5 files are covered.
  • 4 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.2%) to 37.061%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 0 2 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 2 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 1 0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 1 0%
pkg/controller.v1/mpi/mpijob_controller.go 2 75.66%
Totals Coverage Status
Change from base Build 2014597889: 0.2%
Covered Lines: 2290
Relevant Lines: 6179

💛 - Coveralls

@gaocegege
Copy link
Member

/ok-to-test

/assign @zw0610

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cheimu
To complete the pull request process, please ask for approval from zw0610 after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zw0610
Copy link
Member

zw0610 commented Mar 25, 2022

Generally LGTM. It would be better if test cases are included as well.

@cheimu
Copy link
Member Author

cheimu commented Mar 25, 2022

Generally LGTM. It would be better if test cases are included as well.

Oh yeah, you are right, let me add tests

@cheimu
Copy link
Member Author

cheimu commented Mar 25, 2022

/retest
kubeflow-training-operator-presubmit error is:
kubernetes.client.exceptions.ApiException: (401)
Reason: Unauthorized
Seems like flaky

@google-oss-prow google-oss-prow bot added size/L and removed size/S labels Mar 25, 2022
@cheimu
Copy link
Member Author

cheimu commented Mar 25, 2022

Generally LGTM. It would be better if test cases are included as well.

Done

@cheimu
Copy link
Member Author

cheimu commented Mar 25, 2022

/retest kubeflow-training-operator-presubmit error is: kubernetes.client.exceptions.ApiException: (401) Reason: Unauthorized Seems like flaky

/retest

same reason.

@cheimu
Copy link
Member Author

cheimu commented Mar 25, 2022

/retest

image

@aws-kf-ci-bot
Copy link
Contributor

@cheimu: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-training-operator-presubmit bcbf9ba link /test kubeflow-training-operator-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

zw0610 and others added 7 commits May 12, 2022 11:20
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
* Deprecate training-operator presubmit on optional-test-infra

This PR serves as sub-PR to deprecate training-operator presubmit on optional-test-infra.

* Update config file

Update workflow format
* Adding latest image tag

* Update manifests with latest image tag

* Adding integration tests

* Change trigger type
Currently certain operations like tailing logs from the Python sdk
against the latest version of the operator due to a label mistmatch;
fix that.

Closes kubeflow#1587.
* Adding latest image tag

* Update manifests with latest image tag

* Update k8s dependencies to v0.24.1

* Update manifests

* Update k8s matrix for integration tests

* Update k8s matrix for unit tests

* Fix k8s versions

* Fix version

* Add scripts in separate file

* Fix Makefile

* Cleanup Makefile

* Addressing review comments
@zw0610
Copy link
Member

zw0610 commented Jun 6, 2022

/retest

@zw0610
Copy link
Member

zw0610 commented Jun 6, 2022

@johnugeorge Could you advise how to handle stalled pull requests because of the failed ci like this one after the migration to GHA?

@johnugeorge
Copy link
Member

it should trigger new GHA workflow on a rebase

@cheimu Can you rebase?

@google-oss-prow google-oss-prow bot added size/XXL and removed size/L labels Jun 7, 2022
@cheimu
Copy link
Member Author

cheimu commented Jun 7, 2022

:( sad for myself for a second. I'll open a new pr and close this one later...

@johnugeorge
Copy link
Member

@cheimu can you do it at the earliest as we are planning to have feature release by end of next week?

@cheimu
Copy link
Member Author

cheimu commented Jun 13, 2022

@cheimu can you do it at the earliest as we are planning to have feature release by end of next week?

Hi @johnugeorge , I'm afraid I can't make it this week... I'll try my best though

@johnugeorge
Copy link
Member

Closing it in favour of #1704
/close

@google-oss-prow
Copy link

@johnugeorge: Closed this PR.

In response to this:

Closing it in favour of #1704
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot closed this Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should we add a check for TFJob's container name?