feat(pytorch): Add elastic proposal #522

gaocegege · 2021-08-16T12:15:12Z

Signed-off-by: cegao <cegao@tencent.com>

google-oss-robot · 2021-08-16T12:15:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege
To complete the pull request process, please assign james-jwu after the PR has been reviewed.
You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: cegao <cegao@tencent.com>

gaocegege · 2021-08-18T07:54:44Z

/cc @kubeflow/wg-training-leads @alculquicondor @zw0610

proposals/pytorch-elastic-proposal.md

alculquicondor · 2021-08-18T13:26:47Z

proposals/pytorch-elastic-proposal.md

+}
+```
+
+Two fields are added in `common.ReplicaSpec`: `minReplicas` and `maxReplicas`. They acts as MIN_SIZE and MAX_SIZE in the elastic example above.


Do these fields make sense for every other operator?

Personally, I think so. But we should discuss it further /cc @kubeflow/wg-training-leads

proposals/pytorch-elastic-proposal.md

alculquicondor · 2021-08-18T13:33:03Z

proposals/pytorch-elastic-proposal.md

+
+### Environment Variables
+
+`SetPodEnv` in `pkg/controller.v1/pytorch/pytorch.go` should be changed. There is no need to set `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT` if TorchElastic is used. `KUBEFLOW_RDZV_HOST`, `KUBEFLOW_RDZV_PORT`, `KUBEFLOW_MIN_SIZE` and `KUBEFLOW_MAX_SIZE` Should be set instead.


what determines if a PyTorchJob is elastic or not? whether minReplicas and maxReplicas are different than nil?

how is selection of elastic vs non elastic execution for the operator?

Option 1. introduce new field to indicate if it's elastic job.
Option 2. Compare min & max to tell controller implicitly.

I think operator side, it will just reconcile deltas, either add or remove which is already part of the logic. But we do need some changes to honor new fields

https://github.com/kubeflow/common/blob/2f3f636f16ef4cedb12543a96ac1412da98bbca5/pkg/reconciler.v1/common/pod.go#L139-L144

Yes, I think so. Prefer the latter.

proposals/pytorch-elastic-proposal.md

alculquicondor · 2021-08-18T13:34:14Z

proposals/pytorch-elastic-proposal.md

+
+### Reconciliation
+
+`JobController.ReconcilePods` should be refactored. Now the pods are returned by `GetPodSlices`. For example, if `spec.Replicas` is 3, the PodSlices may look like: `[[0],[1],[2]]`. It is not expected when elastic training is enabled.


What should GetPodSlices return for elastic jobs instead?

How does the controller decide how many pods to create?

@alculquicondor I assume controller just provides the elastic ability. A different control loop should make the decision like an autoscaler?\

I will illustrate more about it in the proposal.

If it happens in a different control loop, then the pytorch controller will create the number of pods equal to replicas?

Also, if it happens in a separate control loop, why do we have to make minReplicas and maxReplicas part of the ReplicaSpec?

They could just be part of the HPA object.

alculquicondor · 2021-08-18T13:35:19Z

proposals/pytorch-elastic-proposal.md

+      value: "${pytorchjob.spec.replicas[worker].minReplicas}"
+    - name: KUBEFLOW_MAX_SIZE
+      value: "${pytorchjob.spec.replicas[worker].macReplicas}"
+    command: "python -m torch.distributed.run --rdzv_backend=c10d --rdzv_endpoint=$KUBEFLOW_RDZV_HOST:$KUBEFLOW_RDZV_PORT --nnodes=$KUBEFLOW_MIN_SIZE:$KUBEFLOW_MAX_SIZE --nproc_per_node=1 xxx.py"


Could the operator set a default command and then users can use the args to append more arguments and the python file?

Maybe users want to set their own entrypoint in the command, I think. Thus it may be better to keep command here, WDYT

PyTorch 1.10 introduced torchrun, so you may need to be flexible to accomidate <1.10 versions that use python -m torch.distributed.run and >=1.10 with torchrun.

Yeah, now we use the built-in environment variables PET_* to do it. Then I think we do not have the problem.

alculquicondor · 2021-08-18T13:38:44Z

proposals/pytorch-elastic-proposal.md

+}
+```
+
+### Autoscaler Integration


Is this part of the "alternatives considered"?

I think we need to support it, but I am not sure how it affects API design. As you know, there is no built-in resource in Kubernetes which has minReplicas and maxReplicas except Autoscaler.

I was thinking if we should put them in the PyTorchJob CRD. Thus make it in the alternatives considered

Signed-off-by: cegao <cegao@tencent.com>

Jeffwan · 2021-08-19T06:34:53Z

Nice proposal. I will leave some comments tomorrow.

proposals/pytorch-elastic-proposal.md

johnugeorge · 2021-08-19T08:49:46Z

proposals/pytorch-elastic-proposal.md

+
+### Environment Variables
+
+`SetPodEnv` in `pkg/controller.v1/pytorch/pytorch.go` should be changed. There is no need to set `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT` if TorchElastic is used. `KUBEFLOW_RDZV_HOST`, `KUBEFLOW_RDZV_PORT`, `KUBEFLOW_MIN_SIZE` and `KUBEFLOW_MAX_SIZE` Should be set instead.


how is selection of elastic vs non elastic execution for the operator?

proposals/pytorch-elastic-proposal.md

andreyvelich · 2021-08-19T18:46:10Z

proposals/pytorch-elastic-proposal.md

+
+## Limatations
+
+- KUBEFLOW_RDZV_PORT will be open for every pod even though workers except worker-0 do not use it.


Can we add and expose this port only for Worker-0 pod ?

Maybe we can, do you mean we deal with it with a custom condition loop?

Yes, for example. If we don't want to have additional ports.

proposals/pytorch-elastic-proposal.md

Jeffwan · 2021-08-19T23:48:14Z

proposals/pytorch-elastic-proposal.md

+          containers: 
+            - name: pytorch
+              image: <image>
+              command: "python -m torch.distributed.run --rdzv_backend=c10d --rdzv_endpoint=$KUBEFLOW_RDZV_HOST:$KUBEFLOW_RDZV_PORT --nnodes=$KUBEFLOW_MIN_SIZE:$KUBEFLOW_MAX_SIZE --nproc_per_node=1 xxx.py"


Let's make sure it's user's choice on rendezvous backends? We don't want to manage this part, right?

I think so. There are three backends now: static, c10d, etcd. And users can also implement their own backend like redis and so on. Users can specify it manually. If they use c10d, we can set rdzv endpoint for them. If they use etcd, they can set the endpoint by themselves.

This part may need correction since pytorch-elastic shall be able to read these environment variables directly without being specified as launch arguments.

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

proposals/pytorch-elastic-proposal.md

Signed-off-by: cegao <cegao@tencent.com>

Jeffwan · 2021-08-20T06:12:10Z

proposals/pytorch-elastic-proposal.md

+
+[TorchElastic operator](https://github.com/pytorch/elastic/blob/master/kubernetes/api/v1alpha1/elasticjob_types.go) implemented by @jeffwan puts the new fields under `PyTorchJobSpec`.
+
+Personally, prefer keeping it in `common.ReplicaSpec` since other Jobs may also need it.


Agree on this direction

Jeffwan · 2021-08-20T06:12:45Z

proposals/pytorch-elastic-proposal.md

+Personally, prefer keeping it in `common.ReplicaSpec` since other Jobs may also need it.
+
+```diff
+// PyTorchJobSpec is a desired state description of the PyTorchJob.


API side. Let's take this into consideration. We want to support user specified pods to scale in as an optional field to give more granular control
kubeflow/mpi-operator#410

terrytangyuan · 2021-08-20T18:12:20Z

proposals/pytorch-elastic-proposal.md

+
+## Abstract
+
+[TorchElastic](https://pytorch.org/docs/1.9.0/distributed.elastic.html), which was open sourced over a year ago in the pytorch/elastic github repository, is a runner and coordinator for PyTorch worker processes. it has been part of PyTorch core since 1.9.0. This proposal is to support such feature with the help of PyTorchJob. 


What's involved to support additional frameworks with elastic capabilities?

Now we do not have a unified CRD to support all frameworks. Thus I think we can support different frameworks in different CRDs. This PR is for PyTorchJob.

WDYT

Sounds good. We just need to highlight what features are available for which job kinds in the docs then.

Signed-off-by: cegao <cegao@tencent.com>

gaocegege · 2021-10-27T02:59:26Z

�We may also have this problem: pytorch/pytorch#65992

We should consider it in this proposal.

gaocegege · 2021-11-04T08:34:11Z

/cc @qiankunli

google-oss-prow · 2021-11-04T08:34:12Z

@gaocegege: GitHub didn't allow me to request PR reviews from the following users: qiankunli.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @qiankunli

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

proposals/pytorch-elastic-proposal.md

Signed-off-by: cegao <cegao@tencent.com>

gaocegege · 2021-11-10T09:09:02Z

I updated the details about HPA. BTW, I think we should upgrade the PyTorchJob APIVersion to v2beta1 since it is a breaking change.

alculquicondor · 2021-11-10T14:23:31Z

What is breaking about it? If users don't set elastic configurations, wouldn't it work as before?

gaocegege · 2021-11-11T02:43:43Z

I double-checked the CRD definition. We can keep compatibility with v1. common.JobStatus is changed but it does not block v1 if we upgrade the kubeflow/common version. Thus we can do it in v1.

type ReplicaStatus struct {
+	// LabelSelector is the selector for the replica.
+	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	// The number of actively running pods.
	Active int32 `json:"active,omitempty"`
	// The number of pods which reached phase Succeeded.
	Succeeded int32 `json:"succeeded,omitempty"`
	// The number of pods which reached phase Failed.
	Failed int32 `json:"failed,omitempty"`
}

alculquicondor · 2021-11-11T14:16:09Z

Right, API changes are breaking only if you remove or rename a field. I think the proposed changes look backwards compatible.

gaocegege · 2021-11-19T03:11:01Z

/cc @kubeflow/wg-training-leads

Could you please have another look?

kubeflow/training-operator#1453 The PR is ready, and the coverall coverage increased (+7.1%) to 15.252%, PyTorch related test coverage is increased from 0% to 80%

Jeffwan · 2021-11-29T15:09:20Z

The proposal overall looks good to me. A thing I am not clear is the metrics HPA part. I feel it's better to decouple with the job controller but I understand user may need some simple solution for easy onboarding. A global optimizer is something I am looking for. We can discuss it later. Looks like the PR is merged, let's merge this one as well.

/lgtm

theadactyl · 2021-12-15T19:43:50Z

@kubeflow/wg-training-leads could we get an "/approved" for this?

terrytangyuan · 2021-12-15T19:48:52Z

/approve

google-oss-prow · 2021-12-15T19:53:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege, terrytangyuan
To complete the pull request process, please assign james-jwu after the PR has been reviewed.
You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan · 2022-01-26T21:56:40Z

/assign @james-jwu

stale · 2022-04-27T20:37:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

terrytangyuan · 2022-04-27T22:30:11Z

/assign @theadactyl

stale · 2022-09-21T05:17:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

feat(pytorch): Add elastic proposal

8392419

Signed-off-by: cegao <cegao@tencent.com>

google-cla bot added the cla: yes label Aug 16, 2021

google-oss-robot requested a review from Bobgy August 16, 2021 12:15

google-oss-robot requested a review from theadactyl August 16, 2021 12:15

google-oss-robot added the size/L label Aug 16, 2021

gaocegege added 2 commits August 17, 2021 17:46

feat: Update

e884159

Signed-off-by: cegao <cegao@tencent.com>

Update

dc058c7

Signed-off-by: cegao <cegao@tencent.com>

gaocegege changed the title ~~feat(pytorch): Add elastic proposal WIP~~ feat(pytorch): Add elastic proposal Aug 18, 2021

google-oss-robot requested review from alculquicondor and zw0610 August 18, 2021 07:54

alculquicondor reviewed Aug 18, 2021

View reviewed changes

fix: Update author info

b944007

Signed-off-by: cegao <cegao@tencent.com>

johnugeorge reviewed Aug 19, 2021

View reviewed changes

andreyvelich reviewed Aug 19, 2021

View reviewed changes

Jeffwan reviewed Aug 19, 2021

View reviewed changes

gaocegege and others added 3 commits August 20, 2021 09:44

Update proposals/pytorch-elastic-proposal.md

2515409

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Update proposals/pytorch-elastic-proposal.md

5b2e17b

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Update proposals/pytorch-elastic-proposal.md

939d649

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

zw0610 reviewed Aug 20, 2021

View reviewed changes

proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved

chore: Add zw as an author

e1aeef9

Signed-off-by: cegao <cegao@tencent.com>

Jeffwan reviewed Aug 20, 2021

View reviewed changes

terrytangyuan reviewed Aug 20, 2021

View reviewed changes

gaocegege marked this pull request as draft October 25, 2021 02:32

google-oss-robot added the do-not-merge/work-in-progress label Oct 25, 2021

Jeffwan mentioned this pull request Oct 27, 2021

add MinReplicas and MaxReplicas field kubeflow/common#171

Open

feat: Add elastic policy in CRD

99f2b80

Signed-off-by: cegao <cegao@tencent.com>

gaocegege marked this pull request as ready for review November 4, 2021 08:28

google-oss-prow bot removed the do-not-merge/work-in-progress label Nov 4, 2021

alculquicondor reviewed Nov 4, 2021

View reviewed changes

proposals/pytorch-elastic-proposal.md Outdated Show resolved Hide resolved

fix: Add HPA

b564524

Signed-off-by: cegao <cegao@tencent.com>

gaocegege mentioned this pull request Nov 12, 2021

[request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator? pytorch/elastic#117

Open

gaocegege mentioned this pull request Nov 23, 2021

feat: Add LabelSelector kubeflow/common#177

Merged

google-oss-prow bot assigned Jeffwan Nov 29, 2021

google-oss-prow bot added the lgtm label Nov 29, 2021

terrytangyuan approved these changes Dec 15, 2021

View reviewed changes

jbottum mentioned this pull request Jan 26, 2022

Training Operator WG and Kubeflow 1.5 release kubeflow/manifests#2105

Closed

google-oss-prow bot assigned james-jwu Jan 26, 2022

stale bot added the lifecycle/stale label Apr 27, 2022

google-oss-prow bot assigned theadactyl Apr 27, 2022

stale bot removed the lifecycle/stale label Apr 27, 2022

stale bot added the lifecycle/stale label Sep 21, 2022


		### Environment Variables

		`SetPodEnv` in `pkg/controller.v1/pytorch/pytorch.go` should be changed. There is no need to set `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT` if TorchElastic is used. `KUBEFLOW_RDZV_HOST`, `KUBEFLOW_RDZV_PORT`, `KUBEFLOW_MIN_SIZE` and `KUBEFLOW_MAX_SIZE` Should be set instead.


		### Reconciliation

		`JobController.ReconcilePods` should be refactored. Now the pods are returned by `GetPodSlices`. For example, if `spec.Replicas` is 3, the PodSlices may look like: `[[0],[1],[2]]`. It is not expected when elastic training is enabled.


		## Limatations

		- KUBEFLOW_RDZV_PORT will be open for every pod even though workers except worker-0 do not use it.


		[TorchElastic operator](https://github.com/pytorch/elastic/blob/master/kubernetes/api/v1alpha1/elasticjob_types.go) implemented by @jeffwan puts the new fields under `PyTorchJobSpec`.

		Personally, prefer keeping it in `common.ReplicaSpec` since other Jobs may also need it.


		## Abstract

		[TorchElastic](https://pytorch.org/docs/1.9.0/distributed.elastic.html), which was open sourced over a year ago in the pytorch/elastic github repository, is a runner and coordinator for PyTorch worker processes. it has been part of PyTorch core since 1.9.0. This proposal is to support such feature with the help of PyTorchJob.

feat(pytorch): Add elastic proposal #522

Are you sure you want to change the base?

feat(pytorch): Add elastic proposal #522

Conversation

gaocegege commented Aug 16, 2021

google-oss-robot commented Aug 16, 2021

gaocegege commented Aug 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaocegege Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Aug 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zw0610 Oct 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan Oct 25, 2021 • edited Loading

Choose a reason for hiding this comment

gaocegege commented Oct 27, 2021

gaocegege commented Nov 4, 2021

google-oss-prow bot commented Nov 4, 2021

gaocegege commented Nov 10, 2021

alculquicondor commented Nov 10, 2021

gaocegege commented Nov 11, 2021

alculquicondor commented Nov 11, 2021

gaocegege commented Nov 19, 2021 • edited Loading

Jeffwan commented Nov 29, 2021

theadactyl commented Dec 15, 2021

terrytangyuan commented Dec 15, 2021

google-oss-prow bot commented Dec 15, 2021

terrytangyuan commented Jan 26, 2022

stale bot commented Apr 27, 2022

terrytangyuan commented Apr 27, 2022

stale bot commented Sep 21, 2022

gaocegege Aug 20, 2021 •

edited

Loading

zw0610 Oct 27, 2021 •

edited

Loading

terrytangyuan Oct 25, 2021 •

edited

Loading

gaocegege commented Nov 19, 2021 •

edited

Loading