-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch job multiple node ddp #1713
Comments
You can see this example - https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml |
Yes I have tried the example in my first practice. No any detail about usage of |
you don't need to set any variable like rank or master address. controller will automatically set these variables for you. You can specify gpu resources and k8s will schedule in the right nodes. If you need specific node affinity, you can add to the PodSpec. |
Yes, I can add more workers by params |
That is not true. Each worker will have same worker pod spec. Each worker will same number of GPUs that you have provided in the yaml |
Can you check pod spec of training jobs |
yes, it does spec:
containers:
- command:
- python
- /home/jovyan/ddp.py
env:
- name: PYTHONUNBUFFERED
value: "0"
- name: MASTER_PORT
value: "23456"
- name: MASTER_ADDR
value: ddp-test-master-0
- name: WORLD_SIZE
value: "2"
- name: RANK
value: "1"
image: repository/kubeflow/arrikto-playground/dimpo/ranzcr-dist:latest
imagePullPolicy: Always
name: pytorch
ports:
- containerPort: 23456
name: pytorchjob-port
protocol: TCP
resources:
limits:
cpu: "2"
memory: 10Gi
nvidia.com/gpu: "2"
requests:
cpu: "2"
memory: 10Gi
nvidia.com/gpu: "2" |
Strange. How will the pod get scheduled on k8s when requested resources are not met? Has pod gone into running state? Can you check if this pod was scheduled on a different node? See kubectl describe pod as well |
Maybe when using Meanwhile, as |
@zw0610 From the comment #1713 (comment) , command used is |
Hi @johnugeorge @Crazybean-lwb did you find any workaround for this? I'm seeing the exact same issue with a similar setup (Starting 2 workers, each with 2 GPUs, but they're each only using 1 GPU. The code to replicate the issue is here (code, Dockerfile, and the Pytorchjob YAML): https://gist.github.com/cakeislife100/a7de3a480ed7c5c9005d04d21fd14069. It is adapted from a Pytorch DDP example. I've attached a screenshot of only one of the GPUs being used, even though two are available. Is there something I'm missing here? |
|
thank you for your reply, I solved my problem according to your solution. python -m torch.distributed.launch --nnodes=$WORLD_SIZE --nproc_per_node=NUM_GPUS_YOU_HAVE --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT YOUR_TRAINING_SCRIPT.py |
I have used pytorch ddp in multiple node. I have to run the same shell in each node, script as folllows:
I understand training-operator create env for ddp, like the following example:
I have three nodes, my job yaml defined 1 master and 2 workers. My start command as follows:
Socket error accure after some time
I do not known how to define job yaml correctly with multiple nodes and gpus
I have asked similar question in another issue:
#1532 (comment)
The text was updated successfully, but these errors were encountered: