Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paddle operator proposal to kubeflow community. #502

Merged
merged 1 commit into from
Jun 21, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions proposals/paddle-operator-proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Paddle Operator Proposal

## Motivation

[PaddlePaddle](https://github.com/paddlepaddle) is a widely used machine learning framework in China. However there is no easy way to launch PaddlePaddle training jobs on Kubernetes. By providing a CRD and a custom controller, we can make PaddlePaddle distributed training simple for end users. For more information about PaddlePaddle, please check our [website](https://www.paddlepaddle.org.cn/).

## Goals

Kubeflow user should be able to run training using PaddlePaddle easily on Kubernetes. This will be implemented by using Kubernetes operator. An end user can run training jobs with PaddlePaddle on local or cloud.

The proposal defines the followings:

* Provide a Custom Resource Definition (CRD) for defining PaddlePaddle training job, currently supports running two distributed tasks, ParameterServer (PS) and Collective.
* Implement a controller to manage the CRD, create dependent resources, and reconcile to the desired states.
* The script for operator and controller deployment.
* Several distributed PaddlePaddle training examples.

## Non-Goals

For the model serving part, it will not be included in the paddle-operator.

## API (CRD and resulting objects)

```
deploy
|-- examples
| |-- resnet.yaml
| |-- wide_and_deep.yaml
| |-- wide_and_deep_podip.yaml
| |-- wide_and_deep_service.yaml
| `-- wide_and_deep_volcano.yaml
|-- v1
| |-- crd.yaml
| `-- operator.yaml
`-- v1beta1
|-- crd.yaml
`-- operator.yaml
```

### Custom Resource Definition
The custom resource definition yaml example is as following:

```yaml
apiVersion: batch.paddlepaddle.org/v1
kind: PaddleJob
metadata:
name: wide-ande-deep
spec:
withGloo: 1
intranet: PodIP
cleanPodPolicy: OnCompletion
worker:
replicas: 2
template:
spec:
containers:
- name: paddle
image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
ps:
replicas: 2
template:
spec:
containers:
- name: paddle
image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
```

* The optional configuration of withGloo is 0 not enabled, 1 only starts the worker side, 2 starts all (worker and server), it is recommended to set 1;

* The cleanPodPolicy can be optionally configured as Always/Never/OnFailure/OnCompletion, which indicates whether to delete the pod when the task is terminated (failed or successful). It is recommended to Never during debugging and OnCompletion during production;

* The intranet can be optionally configured as Service/PodIP, which means the communication method between pods. The user does not need to configure it, and PodIP is used by default;

* The content of ps and worker is podTemplateSpec. Users can add more content according to the Kubernetes specification, such as GPU configuration.

We also provide PaddlePaddle collective mode with GPU.

```
apiVersion: batch.paddlepaddle.org/v1
kind: PaddleJob
metadata:
name: resnet
spec:
cleanPodPolicy: Never
worker:
replicas: 2
template:
spec:
containers:
- name: paddle
image: registry.baidubce.com/paddle-operator/demo-resnet:v1
command:
- python
args:
- "-m"
- "paddle.distributed.launch"
- "train_fleet.py"
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: dshm
emptyDir:
medium: Memory
```

* Here you need to add shared memory to mount to prevent cache errors;

* This example uses the built-in data set. After the program is started, it will be downloaded. Depending on the network environment, it may wait a long time.


### Resulting Master
```yaml
apiVersion: v1
kind: Service
metadata:
name: wide-ande-deep-service-ps-0
namespace: paddle-system
ownerReferences:
- apiVersion: batch.paddlepaddle.org/v1
blockOwnerDeletion: true
controller: true
kind: PaddleJob
name: wide-ande-deep-service
uid: 8f432e67-8cda-482c-b147-91f9d4400067
resourceVersion: "9513616"
selfLink: /api/v1/namespaces/paddle-system/services/wide-ande-deep-service-ps-0
uid: e274db1e-ee7f-4b6d-bc0c-034c32f4b7a1
spec:
clusterIP: None
ports:
- port: 2379
protocol: TCP
targetPort: 2379
selector:
paddle-res-name: wide-ande-deep-service-ps-0
sessionAffinity: None
type: ClusterIP
```

```yaml
apiVersion: v1
kind: Pod
metadata:
name: wide-ande-deep-ps-0
namespace: paddle-system
ownerReferences:
- apiVersion: batch.paddlepaddle.org/v1
blockOwnerDeletion: true
controller: true
kind: PaddleJob
name: wide-ande-deep
uid: f206587f-5dee-46f5-9399-e835bde02487
resourceVersion: "9506900"
selfLink: /api/v1/namespaces/paddle-system/pods/wide-ande-deep-ps-0
uid: 36b27c8f-9712-474b-b21b-dd6b54aaef29
spec:
containers:
- env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: PADDLE_TRAINER_ID
value: "0"
- name: TRAINING_ROLE
value: PSERVER
- name: PADDLE_TRAINING_ROLE
value: PSERVER
envFrom:
- configMapRef:
name: wide-ande-deep
image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
```

### Resulting Worker
```yaml
apiVersion: v1
kind: Pod
metadata:
name: wide-ande-deep-worker-0
namespace: paddle-system
ownerReferences:
- apiVersion: batch.paddlepaddle.org/v1
blockOwnerDeletion: true
controller: true
kind: PaddleJob
name: wide-ande-deep
uid: f206587f-5dee-46f5-9399-e835bde02487
resourceVersion: "9507629"
selfLink: /api/v1/namespaces/paddle-system/pods/wide-ande-deep-worker-0
uid: e8534fe6-7c2e-4849-9a99-ffdcd5df76bb
spec:
containers:
- env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: PADDLE_TRAINER_ID
value: "0"
- name: TRAINING_ROLE
value: TRAINER
- name: PADDLE_TRAINING_ROLE
value: TRAINER
envFrom:
- configMapRef:
name: wide-ande-deep
image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
```

The worker spec generates a pod. Currently worker will communicate to the master through the master's service name, we'll use a service registry for service discovery.

## Design

Here are some original design docs for paddle-perator on Kubernetes.

* Paddle Operator Architecture on Kubernetes, please check out [design-arch](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-arch.md)
* Paddle training job instance fault tolerant, please check out [design-fault-tolerant](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-fault-tolerant.md)
* Co-scheduling training job to prevent job instances from resource deadlock, please check out [design-coschedule](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-coschedule.md)
Comment on lines +223 to +225
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All link expired

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've just merged the new refactoring, I'll fix it soon.



## Alternatives Considered

One option is to add PaddlePaddle support to the existing tf-operator, but the parameters and operations between two frameworks are quite different. Combining them may make the user experience unnecessarily complicated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. What about the implementation of the operator? Could you leverage https://github.com/kubeflow/common's interface that's used for other operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using kube-builder as the skeleton for the paddle-operator, part of the kubeflow common components are not necessary for our project, like job controller etc. We'll see whether we can leverage on kubeflow common for advanced features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, we are aware of your concern, technically, we are building our operator with the newest version of kubebuilder which take care of almost all the staff like Informer/Indexer/clienset etc. It leaves out the Reconcile function to be implemented, it also provides CRUD operations directly on resources with context, which means that the kubeflow/common may not necessary in this circumstances.


## Current Status

We recently refactored the paddle-operator for better performance and code readability. And we'll merge the dev branch back into main branch soon, so our real code branch is `dev` at this moment.