Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s dist train for en #9789

Merged
merged 4 commits into from
Apr 11, 2018
Merged

Conversation

Yancey1989
Copy link
Contributor

Fixed #8914

TBD
We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the
previous document.
In this article, we will introduce how to craete a PaddlePaddle job with multiple nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

craete -> create


Before creating a training job, the users need to deploy the Python scripts and
training data which have already been sliced on the precast path in the distributed file
system(We can use the different type of Kuberentes Volumes to mount different distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the users need to deploy the Python scripts and training data which have already been sliced on the precast path in the distributed file system" is a little confusing to me, because "sliced" and "on" sounds like they came together, but they are no (from what I guess). How about "the users need to slice the training data and deploy the Python scripts along with it into the distributed file system"
also I don't know why the path of data should be precast or predefined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @putcn , There should be only one distributed file system used in the k8s cluster, and all the users would share with it, we need to precast the root path for each user, for example:

/pfs/home/example@email.com/xxx

Before creating a training job, the users need to deploy the Python scripts and
training data which have already been sliced on the precast path in the distributed file
system(We can use the different type of Kuberentes Volumes to mount different distributed
file system). Before start training, The program would copy the training data into the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"file system" -> "file systems"

Before creating a training job, the users need to deploy the Python scripts and
training data which have already been sliced on the precast path in the distributed file
system(We can use the different type of Kuberentes Volumes to mount different distributed
file system). Before start training, The program would copy the training data into the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Before start training" -> "Before training starts"
"would copy' -> "will copy"

![PaddlePaddle on Kubernetes Architecture](src/k8s-paddle-arch.png)

The above figure describes a distributed training architecture which contains 3 nodes, each
Pod would mount a folder of the distributed file system to save training data and models
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"would mount" -> "mounts"


The above figure describes a distributed training architecture which contains 3 nodes, each
Pod would mount a folder of the distributed file system to save training data and models
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod -> Pods

The above figure describes a distributed training architecture which contains 3 nodes, each
Pod would mount a folder of the distributed file system to save training data and models
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
3 nodes, each Pod has a PaddlePaddle container. After the containers have been created,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have been -> are

Pod would mount a folder of the distributed file system to save training data and models
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
3 nodes, each Pod has a PaddlePaddle container. After the containers have been created,
PaddlePaddle would start up the communication between PServer and Trainer and read training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would start -> starts

PaddlePaddle would start up the communication between PServer and Trainer and read training
data for this training job.

As the description above, we can start up a PaddlePaddle distributed training job on a ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready Kubernetes cluster -> Kubernetes ready cluster

data for this training job.

As the description above, we can start up a PaddlePaddle distributed training job on a ready
Kubernetes cluster as the following steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as -> with


### Build a Docker Image

PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training docker image needs to package the paddle pserver and paddle trainer runtimes, as well as two more processes before we can kick off the training:

  1. copying the training data into container
  2. Generating the initialization arguments for paddle pserver and paddle training processes.

- Copy the training data into the container.
- Generate the start arguments of `Paddle PServer` and `Paddle Training` process.

Because of the official Docker Image `paddlepaddle/paddle:latest` has already included the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the paddlepaddle official docker image already has the runtimes we need, we'll take it as the base image and pack some additional scripts for the processes mentioned above to build our training image. for more detail, please find from the following link:

docker push [YOUR_REPO]/paddle:mypaddle
```

**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

representative -> represents

```

**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository,
you need to use your repository instead of it. We will use `[YOUR_REPO]/paddle:mypaddle` to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use your repository instead of it -> replace it with your repository name

### Prepare Training Data

We can download and split the training job by creating a Kubernetes Job, or custom your image
by editing [k8s_train](./src/k8s_train/README.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./src/k8s_train/README.md -> ./src/k8s_train/

by editing [k8s_train](./src/k8s_train/README.md).

Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of
the different distributed file system, the generated dataset would be saved on this volume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the different distributed file system -> distributed file system

restartPolicy: Never
```

If success, you can see some information like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add "kubectl create -f XXX" here?

restartPolicy: Never
```

If success, you can see some information like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If success -> If created successfully

```

The `paddle-cluster-job` above is the job name for this training job; we need 3
PaddlePaddle training node and save the split training data on `paddle-cluster-job` path,
Copy link
Contributor

@putcn putcn Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

training node -> training nodes
on -> in


The `paddle-cluster-job` above is the job name for this training job; we need 3
PaddlePaddle training node and save the split training data on `paddle-cluster-job` path,
the folder `0`, `1` and `2` representative the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

representative -> represents


### Create a Job

Kubernetes allow users to create an object with YAML files, and we can use a command-line tool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an object -> objects


In the above YAML file:
- `metadata.name`, The job name.
- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kubernetes Job would create -> whether the Kubernetes Job would create

In the above YAML file:
- `metadata.name`, The job name.
- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time.
- `completions`, The Job would become the success status only the number of successful Pod(the exit code is 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only -> only when

```

Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node,
, pull the Docker image and begin to train.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra ","


### Checkout the Output

At the process of training, we can check the logs and the output models, such as we store
Copy link
Contributor

@putcn putcn Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such as we store the output on output folder -> which is stored in the output folder


### Using Environment Variables

Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which running on -> which runs in


Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on
Kubernetes, `start_paddle.py` provides a start up script to convert the environment variable
to the start up argument of PaddlePaddle process:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argument -> arguments


### Communication between Pods

At the begin of `start_paddle.py`, it would initialize and parse the arguments.
Copy link
Contributor

@putcn putcn Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would initialize -> initializes
parse -> parses

idMap = getIdMap(podlist)
```

**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
Copy link
Contributor

@putcn putcn Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would fetch -> fetches
pod -> pods

if some Pods are running, may cause some error. -> if some pods are alreay running, it may cause some error.

**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
Kubernetes Pod or Replicaset in the future.

For the implement of `getIdMap(podlist)`, this function would fetch each IP address of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the implement of getIdMap(podlist), this function would fetch each IP address -> the function getIdMap(podlist) fetches IPs addresses


### Create Job

The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such as Paddle Trainer -> take Paddle Trainer as an example

@putcn
Copy link
Contributor

putcn commented Apr 9, 2018

Great job @Yancey1989, I left some comments, please check.

@Yancey1989
Copy link
Contributor Author

Thanks for @putcn 's review! updated by comments.

Copy link
Contributor

@putcn putcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Yancey1989 Yancey1989 merged commit 1d88ebe into PaddlePaddle:develop Apr 11, 2018
@Yancey1989 Yancey1989 deleted the k8s_dist_doc_en branch April 11, 2018 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants