Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent implementation about when the validation of job's spec failed #1704

Open
HeGaoYuan opened this issue Dec 22, 2022 · 8 comments · Fixed by #1705
Open

Inconsistent implementation about when the validation of job's spec failed #1704

HeGaoYuan opened this issue Dec 22, 2022 · 8 comments · Fixed by #1705

Comments

@HeGaoYuan
Copy link
Contributor

In below codes, when validate the job's spec failed, the process is different. The MPIJob will return an err, so the MPIJob will not continue to creating corresponding pods/services, it will try again after some time. The PytorchJob/TFJob will just print an error log then continue, but it maybe cause unexpected results in the future.

I think we need to discuss what exactly we should do when we validate job's spec failed then we apply it to all Jobs. In my opinion, it should not continue after validating job's spec failed, and we not only to print error log, but also need to record a warning event so that users can know why their Job is blocking through kubectl describe XXJob.

Referring to point4 of #1703

if err = kubeflowv1.ValidateV1MpiJobSpec(&mpijob.Spec); err != nil {
logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String())
return ctrl.Result{}, err
}

if err = kubeflowv1.ValidateV1PyTorchJobSpec(&pytorchjob.Spec); err != nil {
logger.Info(err.Error(), "PyTorchJob failed validation", req.NamespacedName.String())
}

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {
logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())
}

@johnugeorge
Copy link
Member

Thanks for reporting.

Yes. we should not continue if validation fails. Also, recording a warning event is a great idea. Can you fix this?

@johnugeorge
Copy link
Member

/cc @gaocegege @terrytangyuan

HeGaoYuan pushed a commit to HeGaoYuan/training-operator that referenced this issue Dec 23, 2022
HeGaoYuan added a commit to HeGaoYuan/training-operator that referenced this issue Dec 23, 2022
@terrytangyuan
Copy link
Member

Yes, I think MPI controller is doing it correctly.

@johnugeorge
Copy link
Member

johnugeorge commented Dec 23, 2022

@terrytangyuan Since error is returned when Validation fails in MPI, reconcile function will be called again. Ref: #1705 (comment)

HeGaoYuan added a commit to HeGaoYuan/training-operator that referenced this issue Jan 25, 2023
google-oss-prow bot pushed a commit that referenced this issue Jan 25, 2023
* fix #1704

* use commonutil.JobFailedValidationReason replace of JobFailedValidation
@tenzen-y
Copy link
Member

tenzen-y commented Jan 25, 2023

Maybe this issue does not complete.
/reopen

@google-oss-prow google-oss-prow bot reopened this Jan 25, 2023
@google-oss-prow
Copy link

@tenzen-y: Reopened this issue.

In response to this:

It looks like not to complete this issue.
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants