-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent implementation about when the validation of job's spec failed #1704
Comments
Thanks for reporting. Yes. we should not continue if validation fails. Also, recording a warning event is a great idea. Can you fix this? |
Yes, I think MPI controller is doing it correctly. |
@terrytangyuan Since error is returned when Validation fails in MPI, reconcile function will be called again. Ref: #1705 (comment) |
Maybe this issue does not complete. |
@tenzen-y: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
In below codes, when validate the job's spec failed, the process is different. The MPIJob will return an err, so the MPIJob will not continue to creating corresponding pods/services, it will try again after some time. The PytorchJob/TFJob will just print an error log then continue, but it maybe cause unexpected results in the future.
I think we need to discuss what exactly we should do when we validate job's spec failed then we apply it to all Jobs. In my opinion, it should not continue after validating job's spec failed, and we not only to print error log, but also need to record a warning event so that users can know why their Job is blocking through
kubectl describe XXJob
.Referring to point4 of #1703
training-operator/pkg/controller.v1/mpi/mpijob_controller.go
Lines 135 to 138 in 82af677
training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go
Lines 133 to 135 in 82af677
training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go
Lines 158 to 160 in 82af677
The text was updated successfully, but these errors were encountered: