Introduce batch/v1 Job with Indexed completion mode #1718

tenzen-y · 2023-01-10T19:59:32Z

/kind discussion

We often implement features similar to batch/v1 Job (e.g., kubeflow/common#196) since the training operator creates blocks of Pod + Service for each rank, not batch/v1 Job + Service once the custom Job resource (e.g., TFJob) is created.

IIUC, training-operator designed like the above since training-operator core architecture is created before the Indexed Job feature and Pod failure policy feature are released.

So I would like to propose the architecture that the training-operator creates batch/v1 Job with Indexed completion mode + Service, not Pod + Service.

Introducing batch/v1 Job eliminates the need to implement and maintain features similar to batch/v1 Job and makes introducing new batch/v1 Job features easy.

/cc @kubeflow/wg-training-leads

johnugeorge · 2023-01-12T06:22:13Z

Interesting. This would need a major rewrite ?

/cc @gaocegege @zw0610 @terrytangyuan

tenzen-y · 2023-01-12T07:05:02Z

This would need a major rewrite ?

Probably, yes. Replacing Pod with batch/v1 Indexed Job, we can use Job with Pod-to-Pod Communication logic.

Also, we maybe remove most of the kubeflow/common codes and use batch/v1 Job logic.

For example, we can replace ReplicaSpec.RestartPolicy with PodFailurePolicy:

https://github.com/kubeflow/common/blob/34276e9d2ffa39f5922479bff87dc5ed5ed94cfb/pkg/apis/common/v1/types.go#L79-L83

https://github.com/kubernetes/kubernetes/blob/c9ed04762f94a319d7b1fb718dc345491a32bea6/pkg/apis/batch/types.go#L220-L229

This means replacing that, we don't need to hold the logic to judge whether restart pods.

gaocegege · 2023-01-12T07:10:18Z

/cc @alculquicondor

tenzen-y · 2023-01-12T07:14:09Z

IIUC, Indexed Job feature aims tf-operator, mpi-operator and more.

kubernetes/kubernetes#99497 (comment)

zw0610 · 2023-01-12T07:25:19Z

Introducing batch/v1 Job eliminates the need to implement and maintain features similar to batch/v1 Job and makes introducing new batch/v1 Job features easy.

Since it's the major benefits is to reduce the duplicated developing workload regarding some features, do we have such a feature list which are not implemented in training-operator but will be completed once the batch/v1 Job was adopted?

tenzen-y · 2023-01-12T08:03:49Z

Introducing batch/v1 Job eliminates the need to implement and maintain features similar to batch/v1 Job and makes introducing new batch/v1 Job features easy.

Since it's the major benefits is to reduce the duplicated developing workload regarding some features, do we have such a feature list which are not implemented in training-operator but will be completed once the batch/v1 Job was adopted?

Good point. I don't create such a list yet. At present, I found the suspend Job and pod disruption conditions.

I'll create a list and share it with you in this issue.

alculquicondor · 2023-01-12T14:37:32Z

+1
This has been discussed before #1303

If you could provide a list of any missing functionality in the Job API, we could add those to the roadmap.
We did a lot of progress with failure policies, but IIUC, there's also a need for some form of success policy?

Also @ahg-g is working on a proposal for a multi-pod-template API, that he's going to present in the Batch working group meeting on Feb 2nd https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#heading=h.ukbaidczvy3r

ahg-g · 2023-01-12T14:54:44Z

+1

The benefit is not just deduping code, but also helping to defragment the ecosystem. While I do understand the benefit of having dedicated APIs for MPI, TF training etc., it is important that they build on a common API that we can use for job-level scheduling and autoscaling.

As @alculquicondor mentioned, I am working on a proposal that I will make public next week and will be discussed in Kubernetes batch working group. I am also happy to schedule a time and discuss it with the kubeflow community, can you please let me know how/where I can put this topic on the meeting agenda?

tenzen-y · 2023-01-12T17:21:55Z

If you could provide a list of any missing functionality in the Job API, we could add those to the roadmap. We did a lot of progress with failure policies, but IIUC, there's also a need for some form of success policy?

@alculquicondor
I think introducing the success policy would be useful for training-operator since we hold the logic by ourselves in tensorflow-controller.

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 505 to 534 in ddf372c

    
           if rtype == kubeflowv1.TFJobReplicaTypeWorker { 
        
           	// Leave a succeeded condition for the following two cases: 
        
           	// 1. If default success policy is used and worker 0 has completed. 
        
           	// 2. If `SuccessPolicyAllWorkers` success policy is used and all workers are succeeded. 
        
           	if expected == 0 || (worker0Completed && *tfJob.Spec.SuccessPolicy != kubeflowv1.SuccessPolicyAllWorkers) { 
        
           		msg := fmt.Sprintf("TFJob %s/%s successfully completed.", 
        
           			tfJob.Namespace, tfJob.Name) 
        
           		r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg) 
        
           		if jobStatus.CompletionTime == nil { 
        
           			now := metav1.Now() 
        
           			jobStatus.CompletionTime = &now 
        
           		} 
        
           		err := commonutil.UpdateJobConditions(jobStatus, 
        
           			commonv1.JobSucceeded, tfJobSucceededReason, msg) 
        
           		if err != nil { 
        
           			commonutil.LoggerForJob(tfJob).Infof("Append tfjob condition error: %v", err) 
        
           			return err 
        
           		} 
        
           		trainingoperatorcommon.SuccessfulJobsCounterInc(tfJob.Namespace, kubeflowv1.TFJobFrameworkName) 
        
           	} else if running > 0 { 
        
           		// Some workers are still running, leave a running condition. 
        
           		msg := fmt.Sprintf("TFJob %s/%s is running.", 
        
           			tfJob.Namespace, tfJob.Name) 
        
           		err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRunning, tfJobRunningReason, msg) 
        
           		if err != nil { 
        
           			commonutil.LoggerForJob(tfJob).Infof("Append tfjob condition error: %v", err) 
        
           			return err 
        
           		} 
        
           	} 
        
           }

Maybe, tensorflow-controller or training-controller can be one of the use cases to introduce the success policy to batch/v1 Job.

Also @ahg-g is working on a proposal for a multi-pod-template API, that he's going to present in the Batch working group meeting on Feb 2nd https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#heading=h.ukbaidczvy3r

Thanks for sharing. I'm interested a muti-pod-template API since we can consider using the API after we introduce batch/v1 Job to training-operator.

Are there KEPs for a multi-pod-template API in k/enhancement?

tenzen-y · 2023-01-12T17:32:48Z

The benefit is not just deduping code, but also helping to defragment the ecosystem. While I do understand the benefit of having dedicated APIs for MPI, TF training etc., it is important that they build on a common API that we can use for job-level scheduling and autoscaling.

@ahg-g Yes, exactly. I think so too.

As @alculquicondor mentioned, I am working on a proposal that I will make public next week and will be discussed in Kubernetes batch working group. I am also happy to schedule a time and discuss it with the kubeflow community, can you please let me know how/where I can put this topic on the meeting agenda?

We have bi-weekly community meetings for WG Training, and there is a meeting note in https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#.

I rarely attend meetings, but you can share a multiple-pod-template API with WG Training leads.

ahg-g · 2023-01-12T17:35:12Z

Are there KEPs for a multi-pod-template API in k/enhancement?

Not yet, as I mentioned above, I will share a google doc next week, it is easier to discuss such a significant proposal on a google doc first before we move to a KEP. Note that the plan is to initially host the API under the Kueue project to iterate fast on the api with the goal of upstreaming it eventually.

tenzen-y · 2023-01-12T17:36:38Z

Are there KEPs for a multi-pod-template API in k/enhancement?

Not yet, as I mentioned above, I will share a google doc next week, it is easier to discuss such a significant proposal on a google doc first before we move to a KEP. Note that the plan is to initially host the API under the Kueue project to iterate fast on the api with the goal of upstreaming it eventually.

I see. Thanks for letting me know.

tenzen-y · 2023-01-13T03:15:13Z

I will work on this issue after the kubeflow v1.7 feature freeze date since that date is coming up. Then, I will share the corresponding table for batch/v1 Job and training-operator Job feature in this issue.

If I find this issue with significant API changes, I will submit a proposal to this repository.

Also, I will work on the actual implementation after #1714 is done.

ahg-g · 2023-01-13T19:58:16Z

/cc @richardsliu

tenzen-y · 2023-01-17T18:40:52Z

Maybe, we need to wait for the Indexed Jobs with unset completions parameter feature to support Elastic PytorchJob.

ref: kubernetes/enhancements#3715

alculquicondor · 2023-01-17T18:43:36Z

Is elastic Pytorch the only training job that supports resizing?

Does it matter which workers get removed?

tenzen-y · 2023-01-17T19:44:30Z

Is elastic Pytorch the only training job that supports resizing?

IIUC, we are supporting only one master replica for PytorchJob. So yes.

training-operator/pkg/apis/kubeflow.org/v1/pytorch_validation.go

Lines 62 to 66 in b87c6fa

    
           if rType == PyTorchJobReplicaTypeMaster { 
        
           	if value.Replicas != nil && int(*value.Replicas) != 1 { 
        
           		return fmt.Errorf("PyTorchJobSpec is not valid: There must be only 1 master replica") 
        
           	} 
        
           }

Does it matter which workers get removed?

Maybe, It does not matter which worker is deleted since the Elastic PytorchJob uses a local elastic agent.

@gaocegege @zw0610 If I misunderstand the Elastic PytorchJob, can you correct me?

tenzen-y · 2023-01-17T19:47:07Z

Also, we may use Indexed Jobs with unset completions parameter feature for MPIJob v1 with Horovod.

tenzen-y · 2023-01-26T04:38:01Z

The Elastic Indexed Job is supposed to graduate to beta in K8s 1.27. So we can work on this once we stop supporting k8s 1.26 (maybe next year?).

alculquicondor · 2023-01-26T13:36:24Z

I agree

tenzen-y · 2023-05-04T11:37:51Z

We may be able to introduce the JobSet instead of batch/job, although I think we need to wait for the JobSet API to be in beta.

https://github.com/kubernetes-sigs/jobset

tenzen-y · 2023-05-07T22:46:46Z

As a first step, migrating to batch/job might be better. After that, we migrate to the JobSet.
Since directly migrating to the JobSet has too much influence on the training operator.

github-actions · 2023-08-24T00:05:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2023-08-24T13:21:00Z

/lifecycle frozen

tenzen-y · 2024-05-15T07:10:21Z

/assign
We're starting the discussion based on my enhancement proposal.

google-oss-prow bot added the kind/discussion label Jan 10, 2023

alculquicondor mentioned this issue Mar 3, 2023

Support exitCode restartPolicy kubeflow/mpi-operator#537

Open

tenzen-y mentioned this issue May 2, 2023

Pod name using generated name kubeflow/common#215

Closed

github-actions bot added the lifecycle/stale label Aug 24, 2023

google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale labels Aug 24, 2023

tenzen-y mentioned this issue Aug 24, 2023

[Feature request] Exit code on failure #1749

Open

tenzen-y mentioned this issue Sep 14, 2023

When setting restartPolicy to OnFailure in PyTorchJob, is there something like maxRetartCount #1589

Open

tenzen-y mentioned this issue Jan 5, 2024

Replace the plain pod workers with Indexed Job kubeflow/mpi-operator#613

Open

tenzen-y mentioned this issue Feb 15, 2024

How to restart a large-scale training job using OnFailure restart policy #2000

Closed

kellyaa mentioned this issue Apr 12, 2024

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors #2045

Closed

google-oss-prow bot assigned tenzen-y May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce batch/v1 Job with Indexed completion mode #1718

Introduce batch/v1 Job with Indexed completion mode #1718

tenzen-y commented Jan 10, 2023 •

edited

Loading

johnugeorge commented Jan 12, 2023

tenzen-y commented Jan 12, 2023 •

edited

Loading

gaocegege commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

zw0610 commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

alculquicondor commented Jan 12, 2023

ahg-g commented Jan 12, 2023 •

edited

Loading

tenzen-y commented Jan 12, 2023 •

edited

Loading

tenzen-y commented Jan 12, 2023 •

edited

Loading

ahg-g commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

tenzen-y commented Jan 13, 2023

ahg-g commented Jan 13, 2023

tenzen-y commented Jan 17, 2023

alculquicondor commented Jan 17, 2023

tenzen-y commented Jan 17, 2023

tenzen-y commented Jan 17, 2023 •

edited

Loading

tenzen-y commented Jan 26, 2023

alculquicondor commented Jan 26, 2023

tenzen-y commented May 4, 2023

tenzen-y commented May 7, 2023

github-actions bot commented Aug 24, 2023

tenzen-y commented Aug 24, 2023

tenzen-y commented May 15, 2024

Introduce batch/v1 Job with Indexed completion mode #1718

Introduce batch/v1 Job with Indexed completion mode #1718

Comments

tenzen-y commented Jan 10, 2023 • edited Loading

johnugeorge commented Jan 12, 2023

tenzen-y commented Jan 12, 2023 • edited Loading

gaocegege commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

zw0610 commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

alculquicondor commented Jan 12, 2023

ahg-g commented Jan 12, 2023 • edited Loading

tenzen-y commented Jan 12, 2023 • edited Loading

tenzen-y commented Jan 12, 2023 • edited Loading

ahg-g commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

tenzen-y commented Jan 13, 2023

ahg-g commented Jan 13, 2023

tenzen-y commented Jan 17, 2023

alculquicondor commented Jan 17, 2023

tenzen-y commented Jan 17, 2023

tenzen-y commented Jan 17, 2023 • edited Loading

tenzen-y commented Jan 26, 2023

alculquicondor commented Jan 26, 2023

tenzen-y commented May 4, 2023

tenzen-y commented May 7, 2023

github-actions bot commented Aug 24, 2023

tenzen-y commented Aug 24, 2023

tenzen-y commented May 15, 2024

tenzen-y commented Jan 10, 2023 •

edited

Loading

tenzen-y commented Jan 12, 2023 •

edited

Loading

ahg-g commented Jan 12, 2023 •

edited

Loading

tenzen-y commented Jan 12, 2023 •

edited

Loading

tenzen-y commented Jan 12, 2023 •

edited

Loading

tenzen-y commented Jan 17, 2023 •

edited

Loading