-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s dist train for en #9789
k8s dist train for en #9789
Conversation
TBD | ||
We introduced how to create a PaddlePaddle Job with a single node on Kuberentes in the | ||
previous document. | ||
In this article, we will introduce how to craete a PaddlePaddle job with multiple nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
craete -> create
|
||
Before creating a training job, the users need to deploy the Python scripts and | ||
training data which have already been sliced on the precast path in the distributed file | ||
system(We can use the different type of Kuberentes Volumes to mount different distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the users need to deploy the Python scripts and training data which have already been sliced on the precast path in the distributed file system" is a little confusing to me, because "sliced" and "on" sounds like they came together, but they are no (from what I guess). How about "the users need to slice the training data and deploy the Python scripts along with it into the distributed file system"
also I don't know why the path of data should be precast or predefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @putcn , There should be only one distributed file system used in the k8s cluster, and all the users would share with it, we need to precast the root path for each user, for example:
/pfs/home/example@email.com/xxx
Before creating a training job, the users need to deploy the Python scripts and | ||
training data which have already been sliced on the precast path in the distributed file | ||
system(We can use the different type of Kuberentes Volumes to mount different distributed | ||
file system). Before start training, The program would copy the training data into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"file system" -> "file systems"
Before creating a training job, the users need to deploy the Python scripts and | ||
training data which have already been sliced on the precast path in the distributed file | ||
system(We can use the different type of Kuberentes Volumes to mount different distributed | ||
file system). Before start training, The program would copy the training data into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Before start training" -> "Before training starts"
"would copy' -> "will copy"
 | ||
|
||
The above figure describes a distributed training architecture which contains 3 nodes, each | ||
Pod would mount a folder of the distributed file system to save training data and models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"would mount" -> "mounts"
|
||
The above figure describes a distributed training architecture which contains 3 nodes, each | ||
Pod would mount a folder of the distributed file system to save training data and models | ||
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod -> Pods
The above figure describes a distributed training architecture which contains 3 nodes, each | ||
Pod would mount a folder of the distributed file system to save training data and models | ||
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on | ||
3 nodes, each Pod has a PaddlePaddle container. After the containers have been created, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have been -> are
Pod would mount a folder of the distributed file system to save training data and models | ||
by Kubernetes Volume. Kubernetes created 3 Pod for this training phase and scheduled these on | ||
3 nodes, each Pod has a PaddlePaddle container. After the containers have been created, | ||
PaddlePaddle would start up the communication between PServer and Trainer and read training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would start -> starts
PaddlePaddle would start up the communication between PServer and Trainer and read training | ||
data for this training job. | ||
|
||
As the description above, we can start up a PaddlePaddle distributed training job on a ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready Kubernetes cluster -> Kubernetes ready cluster
data for this training job. | ||
|
||
As the description above, we can start up a PaddlePaddle distributed training job on a ready | ||
Kubernetes cluster as the following steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as -> with
|
||
### Build a Docker Image | ||
|
||
PaddlePaddle Docker Image needs to support the runtime environment of `Paddle PServer` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Training docker image needs to package the paddle pserver
and paddle trainer
runtimes, as well as two more processes before we can kick off the training:
- copying the training data into container
- Generating the initialization arguments for
paddle pserver
andpaddle training
processes.
- Copy the training data into the container. | ||
- Generate the start arguments of `Paddle PServer` and `Paddle Training` process. | ||
|
||
Because of the official Docker Image `paddlepaddle/paddle:latest` has already included the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the paddlepaddle official docker image already has the runtimes we need, we'll take it as the base image and pack some additional scripts for the processes mentioned above to build our training image. for more detail, please find from the following link:
docker push [YOUR_REPO]/paddle:mypaddle | ||
``` | ||
|
||
**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
representative -> represents
``` | ||
|
||
**[NOTE]**, in the above command arguments, `[YOUR_REPO]` representative your Docker repository, | ||
you need to use your repository instead of it. We will use `[YOUR_REPO]/paddle:mypaddle` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use your repository instead of it -> replace it with your repository name
### Prepare Training Data | ||
|
||
We can download and split the training job by creating a Kubernetes Job, or custom your image | ||
by editing [k8s_train](./src/k8s_train/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
./src/k8s_train/README.md -> ./src/k8s_train/
by editing [k8s_train](./src/k8s_train/README.md). | ||
|
||
Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of | ||
the different distributed file system, the generated dataset would be saved on this volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the different distributed file system -> distributed file system
restartPolicy: Never | ||
``` | ||
|
||
If success, you can see some information like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add "kubectl create -f XXX" here?
restartPolicy: Never | ||
``` | ||
|
||
If success, you can see some information like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If success -> If created successfully
``` | ||
|
||
The `paddle-cluster-job` above is the job name for this training job; we need 3 | ||
PaddlePaddle training node and save the split training data on `paddle-cluster-job` path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
training node -> training nodes
on -> in
|
||
The `paddle-cluster-job` above is the job name for this training job; we need 3 | ||
PaddlePaddle training node and save the split training data on `paddle-cluster-job` path, | ||
the folder `0`, `1` and `2` representative the `training_id` on each node, `quick_start` folder is used to store training data, `output` folder is used to store the models and logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
representative -> represents
|
||
### Create a Job | ||
|
||
Kubernetes allow users to create an object with YAML files, and we can use a command-line tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an object -> objects
|
||
In the above YAML file: | ||
- `metadata.name`, The job name. | ||
- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Kubernetes Job would create -> whether the Kubernetes Job would create
In the above YAML file: | ||
- `metadata.name`, The job name. | ||
- `parallelism`, The Kubernetes Job would create `parallelism` Pods at the same time. | ||
- `completions`, The Job would become the success status only the number of successful Pod(the exit code is 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only -> only when
``` | ||
|
||
Upon successful creation, Kubernetes would create 3 Pods as PaddlePaddle training node, | ||
, pull the Docker image and begin to train. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra ","
|
||
### Checkout the Output | ||
|
||
At the process of training, we can check the logs and the output models, such as we store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
such as we store the output on output
folder -> which is stored in the output
folder
|
||
### Using Environment Variables | ||
|
||
Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which running on -> which runs in
|
||
Usually we use the environment varialbes to configurate the PaddlePaddle Job which running on | ||
Kubernetes, `start_paddle.py` provides a start up script to convert the environment variable | ||
to the start up argument of PaddlePaddle process: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argument -> arguments
|
||
### Communication between Pods | ||
|
||
At the begin of `start_paddle.py`, it would initialize and parse the arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would initialize -> initializes
parse -> parses
idMap = getIdMap(podlist) | ||
``` | ||
|
||
**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would fetch -> fetches
pod -> pods
if some Pods are running, may cause some error. -> if some pods are alreay running, it may cause some error.
**NOTE**: `getPodList()` would fetch all the pod in the current namespace, if some Pods are running, may cause some error. We will use [statfulesets](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets) instead of | ||
Kubernetes Pod or Replicaset in the future. | ||
|
||
For the implement of `getIdMap(podlist)`, this function would fetch each IP address of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the implement of getIdMap(podlist)
, this function would fetch each IP address -> the function getIdMap(podlist)
fetches IPs addresses
|
||
### Create Job | ||
|
||
The main goal of `startPaddle` is generating the arguments of `Paddle PServer` and `Paddle Trainer` processes. Such as `Paddle Trainer`, we parse the environment variable and then get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such as Paddle Trainer
-> take Paddle Trainer
as an example
Great job @Yancey1989, I left some comments, please check. |
Thanks for @putcn 's review! updated by comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Fixed #8914