pytorch job multiple node ddp #1713

Crazybean-lwb · 2022-12-27T12:12:17Z

I have used pytorch ddp in multiple node. I have to run the same shell in each node, script as folllows:

# node1
>>> python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.198.189.10" \
    --master_port=22222 \
    mnmc_ddp_launch.py

node 2
>>> python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.198.189.10" \
    --master_port=22222 \
    mnmc_ddp_launch.py

I understand training-operator create env for ddp, like the following example:

I have three nodes, my job yaml defined 1 master and 2 workers. My start command as follows:

    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: repository/kubeflow/arrikto-playground/dimpo/ranzcr-dist:latest
              command: ["sh","-c",
                                  "python -m torch.distributed.launch 
                                   --nnodes=3 
                                   --nproc_per_node=8
                                   /home/jovyan/ddp/ddp-mul-gpu.py"
                                   ]
              imagePullPolicy: "Always"
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                - mountPath: /home/jovyan
                  name: workspace-kaggle
              resources:
                limits:
                  memory: "10Gi"
                  cpu: "8"
                  nvidia.com/gpu: 8

Socket error accure after some time

I do not known how to define job yaml correctly with multiple nodes and gpus

I have asked similar question in another issue:
#1532 (comment)

johnugeorge · 2022-12-27T13:21:09Z

You can see this example - https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml

Crazybean-lwb · 2022-12-28T12:16:36Z

You can see this example - https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml

Yes I have tried the example in my first practice. No any detail about usage of torch.distributed.launch
I know how to run ddp mutiple workers with one gpu by this example: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml

johnugeorge · 2022-12-28T13:18:07Z

you don't need to set any variable like rank or master address. controller will automatically set these variables for you. You can specify gpu resources and k8s will schedule in the right nodes. If you need specific node affinity, you can add to the PodSpec.
In https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml, this spawns 1 master and 1 worker with 1 gpu each. If you need more workers, you can change worker replicas https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml#L23

Crazybean-lwb · 2022-12-29T05:58:52Z

you don't need to set any variable like rank or master address. controller will automatically set these variables for you. You can specify gpu resources and k8s will schedule in the right nodes. If you need specific node affinity, you can add to the PodSpec. In https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml, this spawns 1 master and 1 worker with 1 gpu each. If you need more workers, you can change worker replicas https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml#L23

Yes, I can add more workers by params worker replicas. If I set nvidia.com/gpu:2 or more in workers, I can only use one gpu in a worker.

johnugeorge · 2022-12-29T07:16:02Z

That is not true. Each worker will have same worker pod spec. Each worker will same number of GPUs that you have provided in the yaml

Crazybean-lwb · 2023-01-03T06:24:47Z

That is not true. Each worker will have same worker pod spec. Each worker will same number of GPUs that you have provided in the yaml

The use I means the amount of gpu really in process.
eg:
I set 2 gpus in each worker, however only one gpu running.

johnugeorge · 2023-01-03T07:34:09Z

Can you check pod spec of training jobs kubectl get pods -o yaml ? Do you see nvidia.com/gpu: 2 in spec?

Crazybean-lwb · 2023-01-03T09:19:04Z

kubectl get pods -o yaml

yes, it does

spec:
  containers:
  - command:
    - python
    - /home/jovyan/ddp.py
    env:
    - name: PYTHONUNBUFFERED
      value: "0"
    - name: MASTER_PORT
      value: "23456"
    - name: MASTER_ADDR
      value: ddp-test-master-0
    - name: WORLD_SIZE
      value: "2"
    - name: RANK
      value: "1"
    image: repository/kubeflow/arrikto-playground/dimpo/ranzcr-dist:latest
    imagePullPolicy: Always
    name: pytorch
    ports:
    - containerPort: 23456
      name: pytorchjob-port
      protocol: TCP
    resources:
      limits:
        cpu: "2"
        memory: 10Gi
        nvidia.com/gpu: "2"
      requests:
        cpu: "2"
        memory: 10Gi
        nvidia.com/gpu: "2"

johnugeorge · 2023-01-03T11:00:52Z

Strange. How will the pod get scheduled on k8s when requested resources are not met? Has pod gone into running state? Can you check if this pod was scheduled on a different node?

See kubectl describe pod as well

zw0610 · 2023-01-04T00:52:24Z

Maybe when using torch.distributed.launch to launch distributed training job, the number of process per worker needs to be specified via command line options: https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py#L33, which means replacing NUM_GPUS_YOU_HAVE to the number of GPUs on each worker Pod.

Meanwhile, as torch.distributed.launch came much later than training-operator (pytorch part), I think there does lack the compatibility check between the operator and the distributed module.

johnugeorge · 2023-01-04T05:40:09Z

@zw0610 From the comment #1713 (comment) , command used is python /home/jovyan/ddp.py (not using launch ). If I understand correctly, the issue raised is that multiple GPUs are not utilised in even in our operator mode

cakeislife100 · 2023-01-18T23:42:03Z

Hi @johnugeorge @Crazybean-lwb did you find any workaround for this? I'm seeing the exact same issue with a similar setup (Starting 2 workers, each with 2 GPUs, but they're each only using 1 GPU.

The code to replicate the issue is here (code, Dockerfile, and the Pytorchjob YAML): https://gist.github.com/cakeislife100/a7de3a480ed7c5c9005d04d21fd14069. It is adapted from a Pytorch DDP example.

I've attached a screenshot of only one of the GPUs being used, even though two are available.

Is there something I'm missing here?

eggiter · 2023-02-15T03:49:17Z

Hi @johnugeorge @Crazybean-lwb did you find any workaround for this? I'm seeing the exact same issue with a similar setup (Starting 2 workers, each with 2 GPUs, but they're each only using 1 GPU.

The code to replicate the issue is here (code, Dockerfile, and the Pytorchjob YAML): https://gist.github.com/cakeislife100/a7de3a480ed7c5c9005d04d21fd14069. It is adapted from a Pytorch DDP example.

I've attached a screenshot of only one of the GPUs being used, even though two are available.

Is there something I'm missing here?

WORLD_SIZE in training-operator means "number of instances" not the "number of total gpus". You can use all the gpus by exec python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE --nnodes=$WORLD_SIZE --node-rank=$RANK --master-addr=$MASTER_ADDR --master-port=$MASTER_PORT YOUR_TRAINING_SCRIPT.py

kubeflow/pytorch-operator#128 (comment)

Crazybean-lwb · 2023-04-20T03:31:45Z

thank you for your reply, I solved my problem according to your solution.

python -m torch.distributed.launch --nnodes=$WORLD_SIZE --nproc_per_node=NUM_GPUS_YOU_HAVE --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT YOUR_TRAINING_SCRIPT.py

Crazybean-lwb closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch job multiple node ddp #1713

pytorch job multiple node ddp #1713

Crazybean-lwb commented Dec 27, 2022

johnugeorge commented Dec 27, 2022

Crazybean-lwb commented Dec 28, 2022 •

edited

Loading

johnugeorge commented Dec 28, 2022

Crazybean-lwb commented Dec 29, 2022

johnugeorge commented Dec 29, 2022

Crazybean-lwb commented Jan 3, 2023

johnugeorge commented Jan 3, 2023

Crazybean-lwb commented Jan 3, 2023 •

edited

Loading

johnugeorge commented Jan 3, 2023

zw0610 commented Jan 4, 2023

johnugeorge commented Jan 4, 2023

cakeislife100 commented Jan 18, 2023 •

edited

Loading

eggiter commented Feb 15, 2023

Crazybean-lwb commented Apr 20, 2023

pytorch job multiple node ddp #1713

pytorch job multiple node ddp #1713

Comments

Crazybean-lwb commented Dec 27, 2022

johnugeorge commented Dec 27, 2022

Crazybean-lwb commented Dec 28, 2022 • edited Loading

johnugeorge commented Dec 28, 2022

Crazybean-lwb commented Dec 29, 2022

johnugeorge commented Dec 29, 2022

Crazybean-lwb commented Jan 3, 2023

johnugeorge commented Jan 3, 2023

Crazybean-lwb commented Jan 3, 2023 • edited Loading

johnugeorge commented Jan 3, 2023

zw0610 commented Jan 4, 2023

johnugeorge commented Jan 4, 2023

cakeislife100 commented Jan 18, 2023 • edited Loading

eggiter commented Feb 15, 2023

Crazybean-lwb commented Apr 20, 2023

Crazybean-lwb commented Dec 28, 2022 •

edited

Loading

Crazybean-lwb commented Jan 3, 2023 •

edited

Loading

cakeislife100 commented Jan 18, 2023 •

edited

Loading