Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch job multiple node ddp #1713

Closed
Crazybean-lwb opened this issue Dec 27, 2022 · 14 comments
Closed

pytorch job multiple node ddp #1713

Crazybean-lwb opened this issue Dec 27, 2022 · 14 comments

Comments

@Crazybean-lwb
Copy link

I have used pytorch ddp in multiple node. I have to run the same shell in each node, script as folllows:

# node1
>>> python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.198.189.10" \
    --master_port=22222 \
    mnmc_ddp_launch.py

node 2
>>> python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.198.189.10" \
    --master_port=22222 \
    mnmc_ddp_launch.py

I understand training-operator create env for ddp, like the following example:
1672142738359

I have three nodes, my job yaml defined 1 master and 2 workers. My start command as follows:

    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: repository/kubeflow/arrikto-playground/dimpo/ranzcr-dist:latest
              command: ["sh","-c",
                                  "python -m torch.distributed.launch 
                                   --nnodes=3 
                                   --nproc_per_node=8
                                   /home/jovyan/ddp/ddp-mul-gpu.py"
                                   ]
              imagePullPolicy: "Always"
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                - mountPath: /home/jovyan
                  name: workspace-kaggle
              resources:
                limits:
                  memory: "10Gi"
                  cpu: "8"
                  nvidia.com/gpu: 8

1672142960268
Socket error accure after some time
飞书20221227-200951

I do not known how to define job yaml correctly with multiple nodes and gpus

I have asked similar question in another issue:
#1532 (comment)

@johnugeorge
Copy link
Member

@Crazybean-lwb
Copy link
Author

Crazybean-lwb commented Dec 28, 2022

You can see this example - https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml

Yes I have tried the example in my first practice. No any detail about usage of torch.distributed.launch
I know how to run ddp mutiple workers with one gpu by this example: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml

@johnugeorge
Copy link
Member

you don't need to set any variable like rank or master address. controller will automatically set these variables for you. You can specify gpu resources and k8s will schedule in the right nodes. If you need specific node affinity, you can add to the PodSpec.
In https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml, this spawns 1 master and 1 worker with 1 gpu each. If you need more workers, you can change worker replicas https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml#L23

@Crazybean-lwb
Copy link
Author

you don't need to set any variable like rank or master address. controller will automatically set these variables for you. You can specify gpu resources and k8s will schedule in the right nodes. If you need specific node affinity, you can add to the PodSpec. In https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml, this spawns 1 master and 1 worker with 1 gpu each. If you need more workers, you can change worker replicas https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/v1/pytorch_job_mnist_nccl.yaml#L23

Yes, I can add more workers by params worker replicas. If I set nvidia.com/gpu:2 or more in workers, I can only use one gpu in a worker.

@johnugeorge
Copy link
Member

That is not true. Each worker will have same worker pod spec. Each worker will same number of GPUs that you have provided in the yaml

@Crazybean-lwb
Copy link
Author

That is not true. Each worker will have same worker pod spec. Each worker will same number of GPUs that you have provided in the yaml

The use I means the amount of gpu really in process.
eg:
I set 2 gpus in each worker, however only one gpu running.
1672726856211
1672727058052

@johnugeorge
Copy link
Member

Can you check pod spec of training jobs kubectl get pods -o yaml ? Do you see nvidia.com/gpu: 2 in spec?

@Crazybean-lwb
Copy link
Author

Crazybean-lwb commented Jan 3, 2023

kubectl get pods -o yaml

yes, it does

spec:
  containers:
  - command:
    - python
    - /home/jovyan/ddp.py
    env:
    - name: PYTHONUNBUFFERED
      value: "0"
    - name: MASTER_PORT
      value: "23456"
    - name: MASTER_ADDR
      value: ddp-test-master-0
    - name: WORLD_SIZE
      value: "2"
    - name: RANK
      value: "1"
    image: repository/kubeflow/arrikto-playground/dimpo/ranzcr-dist:latest
    imagePullPolicy: Always
    name: pytorch
    ports:
    - containerPort: 23456
      name: pytorchjob-port
      protocol: TCP
    resources:
      limits:
        cpu: "2"
        memory: 10Gi
        nvidia.com/gpu: "2"
      requests:
        cpu: "2"
        memory: 10Gi
        nvidia.com/gpu: "2"

@johnugeorge
Copy link
Member

Strange. How will the pod get scheduled on k8s when requested resources are not met? Has pod gone into running state? Can you check if this pod was scheduled on a different node?

See kubectl describe pod as well

@zw0610
Copy link
Member

zw0610 commented Jan 4, 2023

Maybe when using torch.distributed.launch to launch distributed training job, the number of process per worker needs to be specified via command line options: https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py#L33, which means replacing NUM_GPUS_YOU_HAVE to the number of GPUs on each worker Pod.

Meanwhile, as torch.distributed.launch came much later than training-operator (pytorch part), I think there does lack the compatibility check between the operator and the distributed module.

@johnugeorge
Copy link
Member

@zw0610 From the comment #1713 (comment) , command used is python /home/jovyan/ddp.py (not using launch ). If I understand correctly, the issue raised is that multiple GPUs are not utilised in even in our operator mode

@cakeislife100
Copy link
Contributor

cakeislife100 commented Jan 18, 2023

Hi @johnugeorge @Crazybean-lwb did you find any workaround for this? I'm seeing the exact same issue with a similar setup (Starting 2 workers, each with 2 GPUs, but they're each only using 1 GPU.

The code to replicate the issue is here (code, Dockerfile, and the Pytorchjob YAML): https://gist.github.com/cakeislife100/a7de3a480ed7c5c9005d04d21fd14069. It is adapted from a Pytorch DDP example.

I've attached a screenshot of only one of the GPUs being used, even though two are available.
Screen Shot 2023-01-18 at 3 24 41 PM

Is there something I'm missing here?

@eggiter
Copy link

eggiter commented Feb 15, 2023

Hi @johnugeorge @Crazybean-lwb did you find any workaround for this? I'm seeing the exact same issue with a similar setup (Starting 2 workers, each with 2 GPUs, but they're each only using 1 GPU.

The code to replicate the issue is here (code, Dockerfile, and the Pytorchjob YAML): https://gist.github.com/cakeislife100/a7de3a480ed7c5c9005d04d21fd14069. It is adapted from a Pytorch DDP example.

I've attached a screenshot of only one of the GPUs being used, even though two are available. Screen Shot 2023-01-18 at 3 24 41 PM

Is there something I'm missing here?

  • WORLD_SIZE in training-operator means "number of instances" not the "number of total gpus". You can use all the gpus by exec python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE --nnodes=$WORLD_SIZE --node-rank=$RANK --master-addr=$MASTER_ADDR --master-port=$MASTER_PORT YOUR_TRAINING_SCRIPT.py

kubeflow/pytorch-operator#128 (comment)

@Crazybean-lwb
Copy link
Author

thank you for your reply, I solved my problem according to your solution.

python -m torch.distributed.launch --nnodes=$WORLD_SIZE --nproc_per_node=NUM_GPUS_YOU_HAVE --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT YOUR_TRAINING_SCRIPT.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants