Inconsistent implementation about when the validation of job's spec failed #1704

HeGaoYuan · 2022-12-22T14:30:49Z

In below codes, when validate the job's spec failed, the process is different. The MPIJob will return an err, so the MPIJob will not continue to creating corresponding pods/services, it will try again after some time. The PytorchJob/TFJob will just print an error log then continue, but it maybe cause unexpected results in the future.

I think we need to discuss what exactly we should do when we validate job's spec failed then we apply it to all Jobs. In my opinion, it should not continue after validating job's spec failed, and we not only to print error log, but also need to record a warning event so that users can know why their Job is blocking through kubectl describe XXJob.

Referring to point4 of #1703

training-operator/pkg/controller.v1/mpi/mpijob_controller.go

Lines 135 to 138 in 82af677

    
           if err = kubeflowv1.ValidateV1MpiJobSpec(&mpijob.Spec); err != nil { 
        
           	logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String()) 
        
           	return ctrl.Result{}, err 
        
           }

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Lines 133 to 135 in 82af677

    
           if err = kubeflowv1.ValidateV1PyTorchJobSpec(&pytorchjob.Spec); err != nil { 
        
           	logger.Info(err.Error(), "PyTorchJob failed validation", req.NamespacedName.String()) 
        
           }

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 158 to 160 in 82af677

    
           if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil { 
        
           	logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String()) 
        
           }

The text was updated successfully, but these errors were encountered:

johnugeorge · 2022-12-22T17:47:31Z

Thanks for reporting.

Yes. we should not continue if validation fails. Also, recording a warning event is a great idea. Can you fix this?

johnugeorge · 2022-12-22T17:47:42Z

/cc @gaocegege @terrytangyuan

terrytangyuan · 2022-12-23T13:51:01Z

Yes, I think MPI controller is doing it correctly.

johnugeorge · 2022-12-23T14:31:27Z

@terrytangyuan Since error is returned when Validation fails in MPI, reconcile function will be called again. Ref: #1705 (comment)

* fix #1704 * use commonutil.JobFailedValidationReason replace of JobFailedValidation

tenzen-y · 2023-01-25T18:15:02Z

Maybe this issue does not complete.
/reopen

google-oss-prow · 2023-01-25T18:15:06Z

@tenzen-y: Reopened this issue.

In response to this:

It looks like not to complete this issue.
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2023-08-24T15:02:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2023-09-13T07:57:42Z

/lifecycle frozen

HeGaoYuan pushed a commit to HeGaoYuan/training-operator that referenced this issue Dec 23, 2022

fix kubeflow#1704

89e6968

HeGaoYuan mentioned this issue Dec 23, 2022

fix https://github.com/kubeflow/training-operator/issues/1704 #1705

Merged

1 task

HeGaoYuan added a commit to HeGaoYuan/training-operator that referenced this issue Dec 23, 2022

fix kubeflow#1704

250def7

johnugeorge mentioned this issue Jan 21, 2023

Add warn event and directly return without creating pods for job validation failure #1564

Closed

1 task

HeGaoYuan added a commit to HeGaoYuan/training-operator that referenced this issue Jan 25, 2023

fix kubeflow#1704

35a160d

google-oss-prow bot closed this as completed in #1705 Jan 25, 2023

google-oss-prow bot pushed a commit that referenced this issue Jan 25, 2023

fix #1704 (#1705)

d0fb5c0

* fix #1704 * use commonutil.JobFailedValidationReason replace of JobFailedValidation

google-oss-prow bot reopened this Jan 25, 2023

github-actions bot added the lifecycle/stale label Aug 24, 2023

google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale labels Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent implementation about when the validation of job's spec failed #1704

Inconsistent implementation about when the validation of job's spec failed #1704

HeGaoYuan commented Dec 22, 2022

johnugeorge commented Dec 22, 2022

johnugeorge commented Dec 22, 2022

terrytangyuan commented Dec 23, 2022

johnugeorge commented Dec 23, 2022 •

edited

Loading

tenzen-y commented Jan 25, 2023 •

edited

Loading

google-oss-prow bot commented Jan 25, 2023

github-actions bot commented Aug 24, 2023

tenzen-y commented Sep 13, 2023

Inconsistent implementation about when the validation of job's spec failed #1704

Inconsistent implementation about when the validation of job's spec failed #1704

Comments

HeGaoYuan commented Dec 22, 2022

johnugeorge commented Dec 22, 2022

johnugeorge commented Dec 22, 2022

terrytangyuan commented Dec 23, 2022

johnugeorge commented Dec 23, 2022 • edited Loading

tenzen-y commented Jan 25, 2023 • edited Loading

google-oss-prow bot commented Jan 25, 2023

github-actions bot commented Aug 24, 2023

tenzen-y commented Sep 13, 2023

johnugeorge commented Dec 23, 2022 •

edited

Loading

tenzen-y commented Jan 25, 2023 •

edited

Loading