PaddlePaddle Job

Running PaddlePaddle distributed training job on Kubernetes cluster.

Usage

Prepare Training Data

You can implement a distributed dataset with reader function, an example:

def dataset_from_reader(filename, reader):
    with open(filename, "w") as fn:
        for batch_id, batch_data in enumerate(reader()):
            batch_data_str = [str(d) for d in batch_data]
            fn.write(",".join(batch_data_str))
            fn.write("\n")

An complete example for dataset: imikolov is here

Submit PaddleJob with Python Code

If you haven't configurated kubectl, do as the tutorial please.

Fetch Runtime information:
- trainer id: the unique id for each trainer, you can fetch current trainer id from environment variable TRAINER_ID
- trainer count: the trainer process count, you can fetch this one from environment variable TRAINERS

Dist Reader Interface

You can implement a dist_reader to reading data when the trainer is running on Kubernetes. An example implemention for dist reader creator:

def dist_reader(filename, trainers, trainer_id):
    def dist_reader_creator():
        with open (filename) as f:
            cnt = 0
            for line in f:
                cnt += 1
                if cnt % trainers == trainer_id:
                    csv_data = [int(cell) for cell in line.split(",")]
                    yield tuple(csv_data)
    return dist_reader_creator

NOTE: You can read files from CephFS on directory: /data/...

Create PaddleJob instance

import paddle.job as job
paddle_job=job.PaddleJob(
      runtime_image="yancey1989/paddle-job",
      job_name="paddle-job",
      cpu_nums=3,
      trainer_package="/example/word2vec",
      entry_point="python train.py",
      cephfs_volume=job.CephFSVolume(
          monitors_addr="172.19.32.166:6789"
      ))

Call paddle.dist_trainer to submit the Paddle Job

job.dist_train(
    trainer=dist_trainer(),
    paddle_job=paddle_job)

trainer is a trainer function, an example:

def dist_trainer():
  def trainer_creator():
    ...

  return trainer_creator

Build Runtime Docker Image on Base Docker Image

You can build a runtime Docker Image with the tools: ./tools/build_docker.sh, such as:
```
./tools/build_docker.sh <src_trainer_package> <dest_trainer_package> <base Docker image> <runtime Docker image>
```
- src_trainer_package: the trainer package on your host.
- dest_trainer_package: it's an absolute path, copies the src_trainer_package to the filesystem of the image at the path dest_trainer_package
- base Docker image: Usually, it's PaddlePaddle product Docker image which including paddle binary files and python packages. And of course, you can specify and image name hosted on any docker registry which users have the access right.
- runtime Docker image: your train package files are packaged into the runtime Docker image on base Docker image. Example:
```
./tools/build_docker.sh ./example/ /example paddlepaddle/paddle yancey1989/paddle-job
```
Push the Runtime Docker Image

You can push your Runtime Docker Image to Docker registry server
```
docker push <runtime Docker image>
```
Example:
```
docker push yancey1989/paddle-job
```

Submit Distributed Job

docker run --rm -it -v $HOME/.kube/config:/root/.kube/config <runtime image name> <entry point>

Example:

docker run --rm -it -v $HOME/.kube/config:/root/.kube/config python /example/train.py