Skip to content

Yancey1989/paddle-job

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaddlePaddle Job

Running PaddlePaddle distributed training job on Kubernetes cluster.

Usage

Prepare Training Data

You can implement a distributed dataset with reader function, an example:

def dataset_from_reader(filename, reader):
    with open(filename, "w") as fn:
        for batch_id, batch_data in enumerate(reader()):
            batch_data_str = [str(d) for d in batch_data]
            fn.write(",".join(batch_data_str))
            fn.write("\n")

An complete example for dataset: imikolov is here

Submit PaddleJob with Python Code

If you haven't configurated kubectl, do as the tutorial please.

  • Fetch Runtime information:

    • trainer id: the unique id for each trainer, you can fetch current trainer id from environment variable TRAINER_ID
    • trainer count: the trainer process count, you can fetch this one from environment variable TRAINERS
  • Dist Reader Interface

    You can implement a dist_reader to reading data when the trainer is running on Kubernetes. An example implemention for dist reader creator:

    def dist_reader(filename, trainers, trainer_id):
        def dist_reader_creator():
            with open (filename) as f:
                cnt = 0
                for line in f:
                    cnt += 1
                    if cnt % trainers == trainer_id:
                        csv_data = [int(cell) for cell in line.split(",")]
                        yield tuple(csv_data)
        return dist_reader_creator

    NOTE: You can read files from CephFS on directory: /data/...

  • Create PaddleJob instance

    import paddle.job as job
    paddle_job=job.PaddleJob(
          runtime_image="yancey1989/paddle-job",
          job_name="paddle-job",
          cpu_nums=3,
          trainer_package="/example/word2vec",
          entry_point="python train.py",
          cephfs_volume=job.CephFSVolume(
              monitors_addr="172.19.32.166:6789"
          ))
  • Call paddle.dist_trainer to submit the Paddle Job

    job.dist_train(
        trainer=dist_trainer(),
        paddle_job=paddle_job)
    • trainer is a trainer function, an example:
      def dist_trainer():
        def trainer_creator():
          ...
      
        return trainer_creator
  • Build Runtime Docker Image on Base Docker Image

    You can build a runtime Docker Image with the tools: ./tools/build_docker.sh, such as:

    ./tools/build_docker.sh <src_trainer_package> <dest_trainer_package> <base Docker image> <runtime Docker image>
    • src_trainer_package: the trainer package on your host.
    • dest_trainer_package: it's an absolute path, copies the src_trainer_package to the filesystem of the image at the path dest_trainer_package
    • base Docker image: Usually, it's PaddlePaddle product Docker image which including paddle binary files and python packages. And of course, you can specify and image name hosted on any docker registry which users have the access right.
    • runtime Docker image: your train package files are packaged into the runtime Docker image on base Docker image. Example:
    ./tools/build_docker.sh ./example/ /example paddlepaddle/paddle yancey1989/paddle-job
  • Push the Runtime Docker Image

    You can push your Runtime Docker Image to Docker registry server

    docker push <runtime Docker image>

    Example:

    docker push yancey1989/paddle-job
  • Submit Distributed Job

    docker run --rm -it -v $HOME/.kube/config:/root/.kube/config <runtime image name> <entry point>

    Example:

    docker run --rm -it -v $HOME/.kube/config:/root/.kube/config python /example/train.py

PaddlePaddle Job Configuration

PaddleJob parameters

  • Required Parameters
parameter type explanation
job_name string the unique name for the training job
entry_point string entry point for startup trainer process
memory string memory allocated for the job, a plain integer using one of these suffixes: E, P, T, G, M, K
cpu_nums int CPU count for the job
runtime_image string runtime Docker image
  • Advanced Parameters
parameter type default explanation
pservers int 2 Parameter Server process count
trainers int 3 Trainer process count
gpu_nums int 0 GPU count for the job
cephfs_volume CephFSVolume None CephFS volume configuration

CephFSVolume parameters

  • Required Parameters
parameter type explanation
monitors_addr string the address for Ceph cluster monitors.
  • Advanced Parameters
parameter type default explanation
user string admin Ceph cluster user name
secret_name string cephfs-secret Ceph cluster secret, it's Kubernetes Secret name
mount_path string /data CephFS mount path in Pod
path string / CephFS path

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages