Implement suspend semantics #1859

tenzen-y · 2023-07-11T12:34:43Z

What this PR does / why we need it:
I implemented the suspend semantics like batch/job and MPIJob v2beta1 to PyTorchJob. The semantics enables the external controller can stop creating pods. For example, this is useful for adapting Kubeflow TrainingJob to the job queueing system.

The training operator removes the following resources regardless of runPolicy.cleanPodPolicy when the runPolicy.suspend is true:

Pods
Services
HorizontalPodAutoscalers
PodGroups (for volcano / scheduler-plugins)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #1519
Related to: #1853

Checklist:

Docs included if any changes are user facing

coveralls · 2023-07-11T12:50:12Z

Pull Request Test Coverage Report for Build 5520115950

60 of 169 (35.5%) changed or added relevant lines in 15 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.5%) to 33.498%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controller.v1/common/pod.go	0	1	0.0%
pkg/controller.v1/mpi/mpijob_controller.go	4	5	80.0%
pkg/controller.v1/pytorch/hpa.go	13	15	86.67%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go	0	5	0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go	0	5	0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go	10	16	62.5%
pkg/apis/kubeflow.org/v1/openapi_generated.go	0	7	0.0%
pkg/controller.v1/mxnet/mxjob_controller.go	0	10	0.0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go	4	14	28.57%
pkg/reconciler.v1/common/job.go	0	11	0.0%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	2	79.36%

Totals
Change from base Build 5511536982:	0.5%
Covered Lines:	3257
Relevant Lines:	9723

💛 - Coveralls

coveralls · 2023-07-11T12:50:12Z

Pull Request Test Coverage Report for Build 5613234102

65 of 176 (36.93%) changed or added relevant lines in 15 files are covered.
11 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.3%) to 33.538%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controller.v1/common/pod.go	0	1	0.0%
pkg/controller.v1/mpi/mpijob_controller.go	4	5	80.0%
pkg/controller.v1/pytorch/hpa.go	13	14	92.86%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go	0	5	0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go	0	5	0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go	8	14	57.14%
pkg/apis/kubeflow.org/v1/openapi_generated.go	0	7	0.0%
pkg/controller.v1/mxnet/mxjob_controller.go	0	10	0.0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go	4	14	28.57%
pkg/reconciler.v1/common/job.go	0	11	0.0%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	2	79.73%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go	9	55.87%

Totals
Change from base Build 5545906377:	0.3%
Covered Lines:	3280
Relevant Lines:	9780

💛 - Coveralls

tenzen-y · 2023-07-11T12:51:40Z

/assign @johnugeorge

cc: @alculquicondor @mimowo This PR is the first step to support suspend semantics in the kubeflow/training-operator.

tenzen-y · 2023-07-11T12:53:03Z

pkg/controller.v1/common/job.go

Here is the core logic for suspend semantics.

alculquicondor · 2023-07-11T12:56:35Z

cc @trasc

pkg/controller.v1/common/job.go

trasc · 2023-07-11T14:52:47Z

pkg/util/status.go

+}
+
+func IsSuspend(status apiv1.JobStatus) bool {
+	return hasCondition(status, apiv1.JobSuspended)


hasCondition name now is misleading and it looks to be doing the same thing as apimachinery/pkg/api/meta.IsStatusConditionTrue

Unfortunately, our condition.status is typed corev1.ConditionStatus.

training-operator/pkg/apis/kubeflow.org/v1/common_types.go

Line 111 in a3a2972

Status v1.ConditionStatus `json:"status"`

So apimachinery/pkg/api/meta.IsStatusConditionTrue doesn't work :(

you could still rename the function to isConditionTrue

Ah, I misunderstood @trasc 's comment.
I will replace all functions with IsStatusConditionTrue. Thanks!

pkg/util/status.go

tenzen-y · 2023-07-13T10:08:27Z

/hold for @alculquicondor's comment.

mimowo

Generally lgtm, but I would prefer to make it more similar to Job & MPIJob.

pkg/controller.v1/pytorch/hpa.go

pkg/util/status_test.go

pkg/controller.v1/common/job.go

mimowo · 2023-07-14T15:37:15Z

pkg/controller.v1/common/job.go

+			return err
+		}
+		for rType := range jobStatus.ReplicaStatuses {
+			jobStatus.ReplicaStatuses[rType].Active = 0


I guess this would be set anyway by the other code once the replica pods are cleaned up. This is the approach we take in MPIJob and Job. I would like if we could apply here the same approach

We don't have any other codes to reset the Active field, and the replicaStatus[*].Active is reset only by

training-operator/pkg/controller.v1/common/job.go

Lines 143 to 148 in 72f2512

if commonutil.IsSucceeded(jobStatus) {

for rtype := range jobStatus.ReplicaStatuses {

jobStatus.ReplicaStatuses[rtype].Succeeded += jobStatus.ReplicaStatuses[rtype].Active

jobStatus.ReplicaStatuses[rtype].Active = 0

}

}

.

So we need to reset the Active field here if the job is suspended.

However, since I think we should reset the Active field when cleaning up replica pods, I would do the refactoring in the follow-ups.

is there any code that sets Active to non zero?

is there any code that sets Active to non zero?

Yes, here:

training-operator/pkg/controller.v1/common/pod.go

Line 372 in 72f2512

updateJobReplicaStatuses(jobStatus, rType, pod)

->

training-operator/pkg/controller.v1/common/status.go

Line 15 in 72f2512

func updateJobReplicaStatuses(jobStatus *apiv1.JobStatus, rtype apiv1.ReplicaType, pod *corev1.Pod) {

->

training-operator/pkg/core/status.go

Lines 34 to 50 in 72f2512

func UpdateJobReplicaStatuses(jobStatus *apiv1.JobStatus, rtype apiv1.ReplicaType, pod *corev1.Pod) {

switch pod.Status.Phase {

case corev1.PodRunning:

if pod.DeletionTimestamp != nil {

// when node is not ready, the pod will be in terminating state.

// Count deleted Pods as failures to account for orphan Pods that

// never have a chance to reach the Failed phase.

jobStatus.ReplicaStatuses[rtype].Failed++

} else {

jobStatus.ReplicaStatuses[rtype].Active++

}

case corev1.PodSucceeded:

jobStatus.ReplicaStatuses[rtype].Succeeded++

case corev1.PodFailed:

jobStatus.ReplicaStatuses[rtype].Failed++

}

}

so those functions wouldn't be called in the next reconcile, essentially resetting the number of active pods?

so those functions wouldn't be called in the next reconcile

Yes. If the job is suspended, JobController never calls ReconcilePods(): https://github.com/tenzen-y/training-operator/blob/ce7259ecfaacbd529b6b1095dd6b632517dac0d0/pkg/controller.v1/common/job.go#L147-L173

Instead of setting counts manually, why can't it be derived from the status of all pods? Since we have already cleaned up, active pods will be zero. We can do this refactoring separately as well.

We don't have functions to decrease Active count, only have functions to reset the count. So I guess we should refactor ReconcileJob(). However, the refactor will affect Succeeded and Failed conditions, too. So I would like to work on another PR.

mimowo · 2023-07-14T15:40:46Z

pkg/controller.v1/common/job.go

+		}
+		jc.Recorder.Event(runtimeObject, corev1.EventTypeNormal, commonutil.NewReason(jobKind, commonutil.JobSuspendedReason), msg)
+		if !reflect.DeepEqual(*oldStatus, jobStatus) {
+			return jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)


Would it be possible to let the reconcile function continue, so that other status fields are updated, such as Active (mentioned above)? I think it is preferable not to update here but let other fields be updated too, so that we can update as much as possible in a single reconciliation run.

I think it is preferable not to update here but let other fields be updated too, so that we can update as much as possible in a single reconciliation run.

That makes sense. As I say above, we need to refactor the ReconcilePods:

training-operator/pkg/controller.v1/common/pod.go

Line 268 in 72f2512

func (jc *JobController) ReconcilePods(

.

So I would do your suggestion in follow-ups.

pkg/controller.v1/pytorch/pytorchjob_controller.go

tenzen-y · 2023-07-16T17:19:49Z

@mimowo I updated this PR. PTAL.

pkg/util/status.go

pkg/controller.v1/common/job.go

pkg/controller.v1/pytorch/pytorchjob_controller_test.go

tenzen-y · 2023-07-18T06:41:17Z

I have rebased.

google-oss-prow · 2023-07-18T12:03:20Z

@mimowo: changing LGTM is restricted to collaborators

In response to this:

/lgtm
/assign @alculquicondor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2023-07-18T19:03:55Z

Is the title accurate? It says PyTorchJob, but I see API updates in every CRD.

tenzen-y · 2023-07-18T19:12:23Z

Is the title accurate? It says PyTorchJob, but I see API updates in every CRD.

Sure.

alculquicondor

LGTM overall
I just have side questions and a nit

alculquicondor · 2023-07-18T19:17:40Z

pkg/controller.v1/common/job.go

 			continue
 		}
-		if err := jc.PodControl.DeletePod(pod.Namespace, pod.Name, job.(runtime.Object)); err != nil {
+		if err := jc.PodControl.DeletePod(pod.Namespace, pod.Name, runtimeObject); err != nil {
 			return err
 		}
 		// Pod and service have the same name, thus the service could be deleted using pod's name.


side question: why is there a service per pod? That sounds like unnecessary load.

IIUC, ml framework configs need different FQDN for each pod.

For example, tensorflow ClusterSpec: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec

wouldn't a single headless Service allow that? similar to this https://kubernetes.io/docs/tasks/job/job-with-pod-to-pod-communication/

Ah, right.
Actually, using a single headless service is planning, although it is closed :(

#1030

alculquicondor · 2023-07-18T19:21:34Z

pkg/controller.v1/common/job.go

+			return err
+		}
+		for rType := range jobStatus.ReplicaStatuses {
+			jobStatus.ReplicaStatuses[rType].Active = 0


is there any code that sets Active to non zero?

alculquicondor · 2023-07-18T19:27:28Z

pkg/util/status.go

+}
+
+func IsSuspend(status apiv1.JobStatus) bool {
+	return hasCondition(status, apiv1.JobSuspended)


you could still rename the function to isConditionTrue

alculquicondor · 2023-07-19T14:24:39Z

LGTM

tenzen-y · 2023-07-19T14:42:48Z

Thanks everyone!

/hold cancel
/assign @johnugeorge

pkg/controller.v1/pytorch/hpa.go

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

tenzen-y · 2023-07-20T16:00:46Z

@johnugeorge I addressed your comments and squashed commits into one. PTAL.

pkg/controller.v1/common/job.go

johnugeorge · 2023-07-20T20:24:39Z

Thanks for this awesome feature!
/lgtm
/approve

google-oss-prow · 2023-07-20T20:24:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the do-not-merge/work-in-progress label Jul 11, 2023

tenzen-y marked this pull request as ready for review July 11, 2023 12:34

google-oss-prow bot removed the do-not-merge/work-in-progress label Jul 11, 2023

google-oss-prow bot requested review from jinchihe and kuizhiqing July 11, 2023 12:34

google-oss-prow bot added the size/XL label Jul 11, 2023

tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 383e6b7 to 31154ff Compare July 11, 2023 12:46

google-oss-prow bot assigned johnugeorge Jul 11, 2023

tenzen-y commented Jul 11, 2023

View reviewed changes

pkg/controller.v1/common/job.go

Copy link

Member Author

tenzen-y Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the core logic for suspend semantics.

trasc reviewed Jul 11, 2023

View reviewed changes

tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 42b1cc9 to 1ab6390 Compare July 11, 2023 15:32

johnugeorge reviewed Jul 12, 2023

View reviewed changes

pkg/util/status.go Outdated Show resolved Hide resolved

tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 1ab6390 to c1af716 Compare July 13, 2023 10:07

google-oss-prow bot added the do-not-merge/hold label Jul 13, 2023

mimowo reviewed Jul 14, 2023

View reviewed changes

mimowo reviewed Jul 17, 2023

View reviewed changes

google-oss-prow bot added size/XXL size/XL and removed size/XL size/XXL labels Jul 17, 2023

tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from ffb736e to 1ed3e8e Compare July 18, 2023 06:41

google-oss-prow bot added size/XXL and removed size/XL labels Jul 18, 2023

google-oss-prow bot assigned alculquicondor Jul 18, 2023

tenzen-y mentioned this pull request Jul 18, 2023

Support kubeflow.org/pytorchjob kubernetes-sigs/kueue#995

Merged

tenzen-y changed the title ~~Implement suspend semantics to PyTorchJob~~ Implement suspend semantics Jul 18, 2023

alculquicondor reviewed Jul 18, 2023

View reviewed changes

tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 77ec73e to 4c28296 Compare July 19, 2023 05:53

google-oss-prow bot removed the do-not-merge/hold label Jul 19, 2023

johnugeorge reviewed Jul 20, 2023

View reviewed changes

pkg/controller.v1/pytorch/hpa.go Outdated Show resolved Hide resolved

Implement suspend semantics

e4bf325

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 4c28296 to e4bf325 Compare July 20, 2023 15:59

johnugeorge reviewed Jul 20, 2023

View reviewed changes

pkg/controller.v1/common/job.go Show resolved Hide resolved

google-oss-prow bot added the lgtm label Jul 20, 2023

google-oss-prow bot added the approved label Jul 20, 2023

google-oss-prow bot merged commit 64e39f2 into kubeflow:master Jul 20, 2023

tenzen-y deleted the support-suspend-semantics-for-pytorchjob branch July 20, 2023 20:57

tenzen-y mentioned this pull request Aug 1, 2023

Implement integration test for MPIJob v1 related to suspend semantics #1875

Merged

1 task

johnugeorge mentioned this pull request Aug 5, 2023

[Release] Training operator 1.7.0 release #1809

Closed

8 tasks

tenzen-y mentioned this pull request Aug 7, 2023

Support queue-related logic with kube-queue #1519

Closed

	if commonutil.IsSucceeded(jobStatus) {
	for rtype := range jobStatus.ReplicaStatuses {
	jobStatus.ReplicaStatuses[rtype].Succeeded += jobStatus.ReplicaStatuses[rtype].Active
	jobStatus.ReplicaStatuses[rtype].Active = 0
	}
	}

	func UpdateJobReplicaStatuses(jobStatus apiv1.JobStatus, rtype apiv1.ReplicaType, pod corev1.Pod) {
	switch pod.Status.Phase {
	case corev1.PodRunning:
	if pod.DeletionTimestamp != nil {
	// when node is not ready, the pod will be in terminating state.
	// Count deleted Pods as failures to account for orphan Pods that
	// never have a chance to reach the Failed phase.
	jobStatus.ReplicaStatuses[rtype].Failed++
	} else {
	jobStatus.ReplicaStatuses[rtype].Active++
	}
	case corev1.PodSucceeded:
	jobStatus.ReplicaStatuses[rtype].Succeeded++
	case corev1.PodFailed:
	jobStatus.ReplicaStatuses[rtype].Failed++
	}
	}

Implement suspend semantics #1859

Implement suspend semantics #1859

Conversation

tenzen-y commented Jul 11, 2023 • edited Loading

coveralls commented Jul 11, 2023

Pull Request Test Coverage Report for Build 5520115950

💛 - Coveralls

coveralls commented Jul 11, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5613234102

💛 - Coveralls

tenzen-y commented Jul 11, 2023

Choose a reason for hiding this comment

alculquicondor commented Jul 11, 2023

Choose a reason for hiding this comment

tenzen-y Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Jul 13, 2023

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Jul 16, 2023 • edited Loading

Choose a reason for hiding this comment

tenzen-y commented Jul 16, 2023

tenzen-y commented Jul 18, 2023

google-oss-prow bot commented Jul 18, 2023

alculquicondor commented Jul 18, 2023

tenzen-y commented Jul 18, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Jul 19, 2023

tenzen-y commented Jul 19, 2023

tenzen-y commented Jul 20, 2023

johnugeorge commented Jul 20, 2023

google-oss-prow bot commented Jul 20, 2023

tenzen-y commented Jul 11, 2023 •

edited

Loading

coveralls commented Jul 11, 2023 •

edited

Loading

tenzen-y Jul 11, 2023 •

edited

Loading

tenzen-y Jul 16, 2023 •

edited

Loading

alculquicondor Jul 18, 2023 •

edited

Loading