How to set up DDP correctly if processes are created externally & CUDA_VISIBLE_DEVICES
is set
#13736
Unanswered
yongsiang-fb
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 4 replies
-
Do you mean each node has 1 GPU visible? if yes, in such a case, |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am trying to get PyTorch lightning work within a certain cluster environment.
In particular, the DDP processes would be created externally, and additionally,
CUDA_VISIBLE_DEVICES
will be set by the cluster manager so that only 1 device would be visible, which is the device the process is supposed to use.I found that in this situation, if I define a subclass of
ClusterEnvironment
wherelocal_rank
is set as the real local rank of the process, an exception would be thrown because PyTorch Lightning would attempt to accessself.parallel_devices[self.local_rank]
but there is only 1 device present inself.parallel_devices
because of theCUDA_VISIBLE_DEVICES
.What would be the best approach to make it work? Should I implement my own strategy class to override the behavior of
self.parallel_devices[self.local_rank]
?Thanks a lot!
Beta Was this translation helpful? Give feedback.
All reactions