k8s dist train for en #9789

Yancey1989 · 2018-04-09T07:58:42Z

putcn · 2018-04-09T20:13:07Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

-TBD
+We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the
+previous document.
+In this article, we will introduce how to craete a PaddlePaddle job with multiple nodes


craete -> create

putcn · 2018-04-09T20:23:08Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+Before creating a training job, the users need to deploy the Python scripts and
+training data which have already been sliced on the precast path in the distributed file
+system(We can use the different type of Kuberentes Volumes to mount different distributed


"the users need to deploy the Python scripts and training data which have already been sliced on the precast path in the distributed file system" is a little confusing to me, because "sliced" and "on" sounds like they came together, but they are no (from what I guess). How about "the users need to slice the training data and deploy the Python scripts along with it into the distributed file system"
also I don't know why the path of data should be precast or predefined?

Thanks @putcn , There should be only one distributed file system used in the k8s cluster, and all the users would share with it, we need to precast the root path for each user, for example:

/pfs/home/example@email.com/xxx

putcn · 2018-04-09T20:23:48Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+Before creating a training job, the users need to deploy the Python scripts and
+training data which have already been sliced on the precast path in the distributed file
+system(We can use the different type of Kuberentes Volumes to mount different distributed
+file system). Before start training, The program would copy the training data into the


"file system" -> "file systems"

putcn · 2018-04-09T20:29:27Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+Before creating a training job, the users need to deploy the Python scripts and
+training data which have already been sliced on the precast path in the distributed file
+system(We can use the different type of Kuberentes Volumes to mount different distributed
+file system). Before start training, The program would copy the training data into the


"Before start training" -> "Before training starts"
"would copy' -> "will copy"

putcn · 2018-04-09T20:31:08Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+![PaddlePaddle on Kubernetes Architecture](src/k8s-paddle-arch.png)
+
+The above figure describes a distributed training architecture which contains 3 nodes, each 
+Pod would mount a folder of the distributed file system to save training data and models


"would mount" -> "mounts"

putcn · 2018-04-09T20:31:59Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+The above figure describes a distributed training architecture which contains 3 nodes, each 
+Pod would mount a folder of the distributed file system to save training data and models
+by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on


Pod -> Pods

putcn · 2018-04-09T20:33:23Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+The above figure describes a distributed training architecture which contains 3 nodes, each 
+Pod would mount a folder of the distributed file system to save training data and models
+by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
+3 nodes, each Pod has a PaddlePaddle container. After the containers have been created,


have been -> are

putcn · 2018-04-09T20:34:10Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+Pod would mount a folder of the distributed file system to save training data and models
+by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on
+3 nodes, each Pod has a PaddlePaddle container. After the containers have been created,
+PaddlePaddle would start up the communication between PServer and Trainer and read training


would start -> starts

putcn · 2018-04-09T20:35:29Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+PaddlePaddle would start up the communication between PServer and Trainer and read training
+data for this training job.
+
+As the description above, we can start up a PaddlePaddle distributed training job on a ready


ready Kubernetes cluster -> Kubernetes ready cluster

putcn · 2018-04-09T20:35:50Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+data for this training job.
+
+As the description above, we can start up a PaddlePaddle distributed training job on a ready
+Kubernetes cluster as the following steps:


putcn · 2018-04-09T20:57:00Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+### Build a Docker Image
+
+PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and


Training docker image needs to package the paddle pserver and paddle trainer runtimes, as well as two more processes before we can kick off the training:

copying the training data into container

Generating the initialization arguments for paddle pserver and paddle training processes.

putcn · 2018-04-09T21:00:03Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+- Copy the training data into the container.
+- Generate the start arguments of `Paddle PServer` and `Paddle Training` process.
+
+Because of the official Docker Image `paddlepaddle/paddle:latest` has already included the


Since the paddlepaddle official docker image already has the runtimes we need, we'll take it as the base image and pack some additional scripts for the processes mentioned above to build our training image. for more detail, please find from the following link:

putcn · 2018-04-09T22:05:30Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+docker push  [YOUR_REPO]/paddle:mypaddle
+```
+
+**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository,


representative -> represents

putcn · 2018-04-09T22:06:25Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+```
+
+**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository,
+you need to use your repository instead of it. We will use `[YOUR_REPO]/paddle:mypaddle` to


use your repository instead of it -> replace it with your repository name

putcn · 2018-04-09T22:08:56Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+### Prepare Training Data
+
+We can download and split the training job by creating a Kubernetes Job, or custom your image
+by editing [k8s_train](./src/k8s_train/README.md).


./src/k8s_train/README.md -> ./src/k8s_train/

putcn · 2018-04-09T22:34:26Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+by editing [k8s_train](./src/k8s_train/README.md).
+
+Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of
+the different distributed file system, the generated dataset would be saved on this volume.


the different distributed file system -> distributed file system

putcn · 2018-04-09T22:35:58Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+      restartPolicy: Never
+```
+
+If success, you can see some information like this:


should we add "kubectl create -f XXX" here?

putcn · 2018-04-09T22:36:31Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+      restartPolicy: Never
+```
+
+If success, you can see some information like this:


If success -> If created successfully

putcn · 2018-04-09T22:37:43Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+```
+
+The `paddle-cluster-job` above is the job name for this training job; we need 3
+PaddlePaddle training node and save the split training data on `paddle-cluster-job` path,


training node -> training nodes
on -> in

putcn · 2018-04-09T22:38:29Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+The `paddle-cluster-job` above is the job name for this training job; we need 3
+PaddlePaddle training node and save the split training data on `paddle-cluster-job` path,
+the folder `0`, `1` and `2` representative the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs.


representative -> represents

putcn · 2018-04-09T22:49:58Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+### Create a Job
+
+Kubernetes allow users to create an object with YAML files, and we can use a command-line tool


an object -> objects

putcn · 2018-04-09T22:51:20Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+In the above YAML file:
+- `metadata.name`, The job name.
+- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time.


The Kubernetes Job would create -> whether the Kubernetes Job would create

putcn · 2018-04-09T22:51:57Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+In the above YAML file:
+- `metadata.name`, The job name.
+- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time.
+- `completions`, The Job would become the success status only the number of successful Pod(the exit code is 0)


only -> only when

putcn · 2018-04-09T23:07:43Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+```
+
+Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node,
+, pull the Docker image and begin to train.


putcn · 2018-04-09T23:08:27Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+### Checkout the Output
+
+At the process of training, we can check the logs and the output models, such as we store


such as we store the output on output folder -> which is stored in the output folder

putcn · 2018-04-09T23:10:19Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+### Using Environment Variables
+
+Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on


which running on -> which runs in

putcn · 2018-04-09T23:10:50Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on
+Kubernetes, `start_paddle.py` provides a start up script to convert the environment variable
+to the start up argument of PaddlePaddle process:


argument -> arguments

putcn · 2018-04-09T23:11:10Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+### Communication between Pods
+
+At the begin of `start_paddle.py`, it would initialize and parse the arguments.


would initialize -> initializes
parse -> parses

putcn · 2018-04-09T23:12:04Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+    idMap = getIdMap(podlist)
+```
+
+**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of


would fetch -> fetches
pod -> pods

if some Pods are running, may cause some error. -> if some pods are alreay running, it may cause some error.

putcn · 2018-04-09T23:16:25Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of
+Kubernetes Pod or Replicaset in the future.
+
+For the implement of `getIdMap(podlist)`, this function would fetch each IP address of


For the implement of getIdMap(podlist), this function would fetch each IP address -> the function getIdMap(podlist) fetches IPs addresses

putcn · 2018-04-09T23:17:28Z

doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md

+
+### Create Job
+
+The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get


Such as Paddle Trainer -> take Paddle Trainer as an example

putcn · 2018-04-09T23:18:31Z

Great job @Yancey1989, I left some comments, please check.

Yancey1989 · 2018-04-10T11:50:48Z

Thanks for @putcn 's review! updated by comments.

putcn

LGTM, thanks!

k8s dist train for en

d05071f

Yancey1989 requested review from putcn and typhoonzero April 9, 2018 07:58

shanyi15 added the translation label Apr 9, 2018

putcn reviewed Apr 9, 2018

View reviewed changes

update by comments

adaa9c5

Yancey1989 added 2 commits April 10, 2018 19:52

update by comments

4a22349

edit the title

875d48d

putcn approved these changes Apr 10, 2018

View reviewed changes

Yancey1989 merged commit 1d88ebe into PaddlePaddle:develop Apr 11, 2018

Yancey1989 deleted the k8s_dist_doc_en branch April 11, 2018 01:54


		### Build a Docker Image

		PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and


		### Create a Job

		Kubernetes allow users to create an object with YAML files, and we can use a command-line tool


		### Checkout the Output

		At the process of training, we can check the logs and the output models, such as we store


		### Using Environment Variables

		Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on


		### Communication between Pods

		At the begin of `start_paddle.py`, it would initialize and parse the arguments.


		### Create Job

		The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get

k8s dist train for en #9789

k8s dist train for en #9789

Conversation

Yancey1989 commented Apr 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

putcn Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn commented Apr 9, 2018

Yancey1989 commented Apr 10, 2018

putcn left a comment

Choose a reason for hiding this comment

putcn Apr 9, 2018 •

edited

Loading

putcn Apr 9, 2018 •

edited

Loading

putcn Apr 9, 2018 •

edited

Loading

putcn Apr 9, 2018 •

edited

Loading