Support suspend semantics for MPIJob #511

mimowo · 2023-01-27T15:49:59Z

It solves: #504

mimowo · 2023-01-27T16:00:40Z

@alculquicondor @tenzen-y WIP but ready for early feedback (would be good as this is my first PR in this repo). PTAL.

alculquicondor

FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.

pkg/apis/kubeflow/v2beta1/types.go

pkg/controller/mpi_job_controller.go

alculquicondor · 2023-01-27T16:29:47Z

pkg/controller/mpi_job_controller.go

+	if c.gangSchedulerName != "" {
+		if err := c.deletePodGroups(mpiJob); err != nil {
+			return err
+		}
+	}


@tenzen-y are you familiar with the volcano integration?

I wonder if we need to remove the pod group on suspension. Does it matter?

Do podgroups allocate any node resources, as pods, or they are just API objects (as services)?

I do not delete services and other API objects on suspension. I also don't delete the launcher job, just suspend it.

AFAIK, a podgroup is just a declaration that some pods should be treated as a unit. But a podgroup doesn't create other objects.

The scheduling.volcano.sh/v1beta1 PodGroup also has queueing logic. So we might need to delete PodGroup to re-queue Pods.

Maybe, we should create E2E with the volcano in a separate PR.

probably someone with volcano experience should do it.

I looked at the volcano code a little bit on my own, but could not conclude if it may create the pods under some scenarios.

Thus, for now, I delete the PodGroups either, to be on the safe side. WDYT?

+1 for the idea of an e2e test employing with podgroups. I guess we could create a follow-up Issue and ask for people willing to do it.

Thus, for now, I delete the PodGroups either, to be on the safe side. WDYT?

That sounds good to me.

alculquicondor · 2023-01-27T16:30:19Z

pkg/controller/mpi_job_controller.go

-		if err != nil {
-			return err
+		if !isMPIJobSuspended(mpiJob) {
+			worker, err = c.getOrCreateWorker(mpiJob)


also create the podgroup conditionally?

Now I create all resources for as long as the MPIJob is suspended. This is the common workflow for Kueue, where the Job is created suspended. In some cases, when the job never get unsuspended (for whatever reason), we can save on creating the objects.

For now reverted the change - sorry for back end forth - this is just to minimize the diff as the unit tests currently verify the objects are created. Let me know what you think.

alculquicondor · 2023-01-27T16:31:08Z

pkg/controller/mpi_job_controller.go

 		}
 		return nil
 	}

 	// first set StartTime.
-	if mpiJob.Status.StartTime == nil {
+	if mpiJob.Status.StartTime == nil && !isMPIJobSuspended(mpiJob) {


if the job was suspended, we should reset the StartTime. Double check how we do it in the job controller.

Done - I've added the JobSuspended condition and replicated the semantics around StartTime.

However, the .spec.runPolicy.activeDeadlineSeconds is actually respected for MPIJobs via batch.Job:

mpi-operator/pkg/controller/mpi_job_controller.go

Line 1320 in 382da78

ActiveDeadlineSeconds: mpiJob.Spec.RunPolicy.ActiveDeadlineSeconds,

.

This means that the changes aren't strictly required to enforce the timeout. WDYT?

Yay for using Job!

mimowo · 2023-01-27T16:55:22Z

FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.

I see, but is common reused by other subprojects too, right? So we would also need to copy the contents of common into these repos. Sounds like a lot of work, maybe simple, but the diffs will be big and one needs to be careful, so not sure we want to block the suspend work on that? Also, is this effort already planned, or in progress @alculquicondor @tenzen-y ?

tenzen-y · 2023-01-27T17:21:12Z

FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.

I see, but is common reused by other subprojects too, right? So we would also need to copy the contents of common into these repos. Sounds like a lot of work, maybe simple, but the diffs will be big and one needs to be careful, so not sure we want to block the suspend work on that? Also, is this effort already planned, or in progress @alculquicondor @tenzen-y ?

Yes, that's right. We are using common repo in training-operator. However, we are planning to consolidate common codes to the training-operator repo.

kubeflow/trainer#1714

tenzen-y · 2023-01-27T17:29:44Z

FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.

@alculquicondor I agree with adding a suspend member to Runpolicy. Although can we copy RunPolicy in a separate PR? Since I think copying the Runpolicy to this repo is another context with this PR.

mimowo · 2023-01-27T17:38:50Z

@alculquicondor I agree with adding a suspend member to Runpolicy. Although can we copy RunPolicy in a separate PR? Since I think copying the Runpolicy to this repo is another context with this PR.

What about the other constants, like the once defining conditions? I guess we could have a PR to just copy RunPolicy to mpi-operator to unblock this work, but keep the dependency on common@0.4.6 for the condition constants. Then, we can extend the set of MPIJob conditions by JobSuspended just in the mpi-operator. If this sounds good I can open a preparatory PR just to copy RunPolicy.

tenzen-y · 2023-01-27T17:53:55Z

@alculquicondor I agree with adding a suspend member to Runpolicy. Although can we copy RunPolicy in a separate PR? Since I think copying the Runpolicy to this repo is another context with this PR.

What about the other constants, like the once defining conditions? I guess we could have a PR to just copy RunPolicy to mpi-operator to unblock this work, but keep the dependency on common@0.4.6 for the condition constants. Then, we can extend the set of MPIJob conditions by JobSuspended just in the mpi-operator. If this sounds good I can open a preparatory PR just to copy RunPolicy.

Sounds good to me. Although, let me know what other members think.

cc @alculquicondor @terrytangyuan

alculquicondor · 2023-01-27T18:16:44Z

sgtm

pkg/controller/mpi_job_controller.go

terrytangyuan · 2023-01-29T02:11:39Z

Sounds good

mimowo · 2023-01-30T13:26:37Z

@tenzen-y @alculquicondor I've opened the preparatory PR here: #513. Please review.

mimowo · 2023-01-31T10:24:15Z

@terrytangyuan Please approve CI

alculquicondor

Can you add a unit test?

Also, I'm not convinced of the value of an E2E test over unit+integration. Do you have a particular justification?

alculquicondor · 2023-02-01T19:58:10Z

manifests/base/crd.yaml

@@ -94,7 +96,7 @@ spec:
                  properties:
                    type:
                      type: string
-                      enum: ["Created", "Running", "Restarting", "Succeeded", "Failed"]
+                      enum: ["Created", "Running", "Restarting", "Succeeded", "Suspended", "Failed"]


Was this autogenerated?
If so, I'm curious to know how it worked.

Maybe this? https://github.com/kubeflow/common/blob/9ec55d141f90faaf52fd6df271e987e5a6781945/pkg/apis/common/v1/types.go#L112

We should probably use it in Kueue, where applicable

Nope, I updated it manually.

Uhm... then make sure that make generate doesn't override it.

it doesn't, checked

@tenzen-y is this expected?
Or is this related to your other PR?

We can leave this to #510

Yes, this is a known issue, and this will be fixed by #510.

Yeah, but it is needed for the e2e test (which I believe is worth adding).

Yeah, keep the manual change, and #510 should automatize it.

alculquicondor · 2023-02-01T20:13:04Z

pkg/controller/mpi_job_controller.go

 		}
 		return nil
 	}

 	// first set StartTime.
-	if mpiJob.Status.StartTime == nil {
+	if mpiJob.Status.StartTime == nil && !isMPIJobSuspended(mpiJob) {


Yay for using Job!

alculquicondor · 2023-02-01T20:18:46Z

pkg/controller/mpi_job_controller.go

@@ -905,6 +936,19 @@ func (c *MPIJobController) updateMPIJobStatus(mpiJob *kubeflow.MPIJob, launcher
 	if err != nil {
 		return fmt.Errorf("checking launcher pods running: %w", err)
 	}
+	if isMPIJobSuspended(mpiJob) {
+		// it is suspended now
+		if updateMPIJobConditions(mpiJob, kubeflow.JobSuspended, v1.ConditionTrue, "MPIJobSuspended", "MPIJob suspended") {


Suggested change

if updateMPIJobConditions(mpiJob, kubeflow.JobSuspended, v1.ConditionTrue, "MPIJobSuspended", "MPIJob suspended") {

if updateMPIJobConditions(mpiJob, kubeflow.JobSuspended, v1.ConditionTrue, "Suspended", "MPIJob suspended") {

I don't think we need the redundancy.

Yeah, I thought so to, but this seems to be a convention here. For now, I stick to the convention but added the reason to the list.

alculquicondor · 2023-02-01T20:23:05Z

pkg/controller/mpi_job_controller.go

@@ -1304,7 +1351,7 @@ func (c *MPIJobController) newWorker(mpiJob *kubeflow.MPIJob, index int) *corev1
 }

 func (c *MPIJobController) newLauncherJob(mpiJob *kubeflow.MPIJob) *batchv1.Job {
-	return &batchv1.Job{
+	job := &batchv1.Job{


Food for thought:
We will probably need some kind of Job annotation that tells kueue that this Job is already queued as part of a higher level object (MPIJob in this case), so that we simply ignore it.

Good point. IIUC, this affects only Kueue configurations when ManageJobsWithoutQueueName=true, which is non-default. We could have an annotation, yes, or just do not manage by Kueue any Job objects which have OwnerReference to another object managed by Kueue.

or just do not manage by Kueue any Job objects which have OwnerReference to another object managed by Kueue.

The dependency could be indirect. And we don't want to spend a GET call to obtain such information.

I see, but still we should strive to keep the integration interface as small as possible - every extension to the surface of the interface will be multiplied by the number of projects, but maybe a new annotation is not that bad. Also, maybe we can have a hybrid approach.

mimowo · 2023-02-02T09:46:58Z

Can you add a unit test?

Done, ended up adding 3 actually: for creating suspended MPIJob, suspending if running and resuming.
One thing is that to write the unit test for resuming I had to refactor the code a little bit to inject a fake clock in tests.
Also, the test for suspending a running MPIJob revealed that I was requiring two syncs - one to clean up the pod workers
and one to update the MPIJob status. Now, I do these steps in one sync.

Also, I'm not convinced of the value of an E2E test over unit+integration. Do you have a particular justification?

I think of two reasons:

the test abstracts out the implementation details, thus is better at documenting what the feature is about. For example, it abstracts out when the service and other auxiliary objects are created. Thus, having such a test allows us to do refactoring and have confidence that the feature works before the unit, or integration tests are adjusted.
it makes us more confident about race conditions. For example, suspending or resuming an MPIJob triggers
suspending or resuming the Launcher Job which happens asynchronously. We don't really test the interaction
between the launcher job and the MPIJob at other layers of testing.

alculquicondor · 2023-02-02T16:52:28Z

manifests/base/crd.yaml

@@ -94,7 +96,7 @@ spec:
                  properties:
                    type:
                      type: string
-                      enum: ["Created", "Running", "Restarting", "Succeeded", "Failed"]
+                      enum: ["Created", "Running", "Restarting", "Succeeded", "Suspended", "Failed"]


We can leave this to #510

alculquicondor · 2023-02-02T16:55:11Z

pkg/controller/mpi_job_controller_test.go

+			updateMPIJobConditions(mpiJobCopy, kubeflow.JobSuspended, v1.ConditionTrue, mpiJobSuspendedReason, "MPIJob suspended")
+			msg = fmt.Sprintf("MPIJob %s/%s is suspended.", mpiJob.Namespace, mpiJob.Name)
+			updateMPIJobConditions(mpiJobCopy, common.JobRunning, v1.ConditionFalse, mpiJobSuspendedReason, msg)
+			f.expectUpdateMPIJobStatusAction(mpiJobCopy)


expect zero workers

The code above checks that already:

mpiJobCopy.Status.ReplicaStatuses = map[common.ReplicaType]*common.ReplicaStatus{ common.ReplicaType(kubeflow.MPIReplicaTypeLauncher): {}, common.ReplicaType(kubeflow.MPIReplicaTypeWorker): {}, }

It does not specify Active, meaning it checks the value is 0.

The workers were never created in this scenario, so I cannot assert on delete actions.

but you can assert that there were no pods were created

the status is not the same as pods being created, necessarily

but if there were some actions to create a pod, and we didn't expect them, the test would fail (IIUC). For example, when I comment out expecting the status update, the test fails as follows:
1 unexpected actions: [{ActionImpl:{Namespace:default Verb:update Resource:kubeflow.org/v2beta1, Resource=mpijobs

oh I see, so it's implicitly checked

alculquicondor · 2023-02-02T16:59:00Z

test/e2e/mpi_job_test.go

@@ -211,12 +231,31 @@ var _ = ginkgo.Describe("MPIJob", func() {
 	})
 })

+func resumeJob(mpiJob *kubeflow.MPIJob) *kubeflow.MPIJob {


For follow up: accept contexts in all these functions

Will open a PR directly after this, doesn't seem it requires Issue for later.

Similarly, going to open a PR to copy the MPIJob conditions from common.

@mimowo I opened the PR to copy the MPIJob conditions to this repo in #514 since I faced issues caused by the MPIJob conditions in #510.

alculquicondor · 2023-02-02T17:51:29Z

/lgtm
/assign @terrytangyuan

terrytangyuan

Thanks!

/lgtm
/approve

google-oss-prow · 2023-02-03T14:45:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan · 2023-02-03T14:47:49Z

The other PR got merged first so this one will need to resolve conflicts :-)

# Conflicts: # pkg/apis/kubeflow/v2beta1/types.go # pkg/controller/mpi_job_controller.go # pkg/controller/mpi_job_controller_status.go # pkg/controller/mpi_job_controller_test.go # test/integration/mpi_job_controller_test.go

- add unit tests for creating suspended, suspending and resuming - use fake clock for unit tests - do not return from the syncHandler after worker pods cleanup on suspend - this allows to continue with the MPIJob update in the same sync # Conflicts: # pkg/controller/mpi_job_controller.go

terrytangyuan · 2023-02-03T15:29:37Z

/lgtm

alculquicondor · 2023-02-03T15:42:34Z

Still lgtm

@alculquicondor

Native Kubernetes Jobs have a suspend flag that allows to temporarily suspend a Job execution and resume it later, or start Jobs in a suspended state and have a custom controller, such as Kueue, decide later when to start them. So adding it to RayJob spec for consistency. Moreover, some frameworks like Kubeflow are adding it, so it becomes a standard functionality. An example implementation for MPIJob: kubeflow/mpi-operator#511 Implementation details If a RayJob is created with a spec.suspend == true, then RayCluster instance (with corresponding Kubernetes resources) is not created and the Ray job is not submitted to the cluster. The JobDeploymentStatus is set to Suspended and the corresponding event is issued. The RayJob remains in this state until somebody unsuspends the job. If suspend flips from true to false, then the RayJob controller immediately creates a RayCluster instance and submits the job. If suspend flips from false to true while Job is running, then the RayJob controller tries to gracefully stop the job and deletes the RayCluster instance (with underlying Kubernetes resources). The JobDeploymentStatus is set to Suspended; JobStatus is set to STOPPED and the corresponding event is issued. Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else. No Kueue-specific code leaked to Kuberay implementation Contributors from Kueue/Kubernetes cc'ed: @alculquicondor @mwielgus

@alculquicondor

Native Kubernetes Jobs have a suspend flag that allows to temporarily suspend a Job execution and resume it later, or start Jobs in a suspended state and have a custom controller, such as Kueue, decide later when to start them. So adding it to RayJob spec for consistency. Moreover, some frameworks like Kubeflow are adding it, so it becomes a standard functionality. An example implementation for MPIJob: kubeflow/mpi-operator#511 Implementation details If a RayJob is created with a spec.suspend == true, then RayCluster instance (with corresponding Kubernetes resources) is not created and the Ray job is not submitted to the cluster. The JobDeploymentStatus is set to Suspended and the corresponding event is issued. The RayJob remains in this state until somebody unsuspends the job. If suspend flips from true to false, then the RayJob controller immediately creates a RayCluster instance and submits the job. If suspend flips from false to true while Job is running, then the RayJob controller tries to gracefully stop the job and deletes the RayCluster instance (with underlying Kubernetes resources). The JobDeploymentStatus is set to Suspended; JobStatus is set to STOPPED and the corresponding event is issued. Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else. No Kueue-specific code leaked to Kuberay implementation Contributors from Kueue/Kubernetes cc'ed: @alculquicondor @mwielgus

google-oss-prow bot added do-not-merge/work-in-progress size/L labels Jan 27, 2023

google-oss-prow bot requested review from alculquicondor and zw0610 January 27, 2023 15:50

mimowo mentioned this pull request Jan 27, 2023

Add suspend semantics #504

Closed

alculquicondor reviewed Jan 27, 2023

View reviewed changes

mimowo force-pushed the mpijob-add-susped branch 2 times, most recently from 77e56c5 to ed0bcd7 Compare January 27, 2023 17:47

tenzen-y reviewed Jan 27, 2023

View reviewed changes

pkg/controller/mpi_job_controller.go Outdated Show resolved Hide resolved

mimowo force-pushed the mpijob-add-susped branch from ed0bcd7 to da1e019 Compare January 30, 2023 12:41

mimowo mentioned this pull request Jan 30, 2023

Use local copy of RunPolicy by MPI-operator #513

Merged

mimowo force-pushed the mpijob-add-susped branch 2 times, most recently from aaf12e6 to 7634c74 Compare January 30, 2023 17:47

mimowo force-pushed the mpijob-add-susped branch from 7634c74 to fd68512 Compare January 31, 2023 10:56

google-oss-prow bot added size/XL and removed size/L labels Jan 31, 2023

mimowo force-pushed the mpijob-add-susped branch 4 times, most recently from 13c54ea to 4e96e9a Compare January 31, 2023 16:10

google-oss-prow bot added the lgtm label Feb 1, 2023

alculquicondor reviewed Feb 1, 2023

View reviewed changes

google-oss-prow bot added size/XL and removed lgtm size/L labels Feb 2, 2023

mimowo force-pushed the mpijob-add-susped branch from 580b061 to 7357382 Compare February 2, 2023 09:25

alculquicondor reviewed Feb 2, 2023

View reviewed changes

google-oss-prow bot assigned terrytangyuan Feb 2, 2023

google-oss-prow bot added the lgtm label Feb 2, 2023

alculquicondor mentioned this pull request Feb 2, 2023

Use local copy of JobStatus by mpi-operator #514

Merged

terrytangyuan approved these changes Feb 3, 2023

View reviewed changes

google-oss-prow bot added the approved label Feb 3, 2023

Implement Suspend semantics for MPIJob

ca89791

# Conflicts: # pkg/apis/kubeflow/v2beta1/types.go # pkg/controller/mpi_job_controller.go # pkg/controller/mpi_job_controller_status.go # pkg/controller/mpi_job_controller_test.go # test/integration/mpi_job_controller_test.go

mimowo force-pushed the mpijob-add-susped branch from 7357382 to f600aa4 Compare February 3, 2023 15:09

google-oss-prow bot removed the lgtm label Feb 3, 2023

mimowo force-pushed the mpijob-add-susped branch from f600aa4 to 11e368e Compare February 3, 2023 15:19

google-oss-prow bot added the lgtm label Feb 3, 2023

google-oss-prow bot merged commit 92e491e into kubeflow:master Feb 3, 2023

mimowo mentioned this pull request Feb 3, 2023

Pass context to the utility methods in e2e tests #516

Merged

oginskis mentioned this pull request Feb 23, 2023

[Feature] Support suspend in RayJob ray-project/kuberay#926

Merged

4 tasks

mimowo deleted the mpijob-add-susped branch March 18, 2023 18:57

	if updateMPIJobConditions(mpiJob, kubeflow.JobSuspended, v1.ConditionTrue, "MPIJobSuspended", "MPIJob suspended") {
	if updateMPIJobConditions(mpiJob, kubeflow.JobSuspended, v1.ConditionTrue, "Suspended", "MPIJob suspended") {

Support suspend semantics for MPIJob #511

Support suspend semantics for MPIJob #511

Conversation

mimowo commented Jan 27, 2023 • edited Loading

mimowo commented Jan 27, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Jan 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Jan 27, 2023

tenzen-y commented Jan 27, 2023

tenzen-y commented Jan 27, 2023

mimowo commented Jan 27, 2023

tenzen-y commented Jan 27, 2023

alculquicondor commented Jan 27, 2023

terrytangyuan commented Jan 29, 2023

mimowo commented Jan 30, 2023

mimowo commented Jan 31, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Feb 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Feb 2, 2023

terrytangyuan left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Feb 3, 2023

terrytangyuan commented Feb 3, 2023

terrytangyuan commented Feb 3, 2023

alculquicondor commented Feb 3, 2023

mimowo commented Jan 27, 2023 •

edited

Loading

tenzen-y Jan 27, 2023 •

edited

Loading