🐛 Bug
Running DDP on a devgpu with 4 GPUs with --nprocs_per_node=2
and --nnodes=2
does not work when the script uses LOCAL_RANK
to set the cuda device.
torchx run dist.ddp -j 2x2
Module (check all that applies):
To Reproduce
See description above, easily repros with a training script:
if __name__ == "__main__":
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
try running the above with
torchx run dist.ddp -j 2x2 main.py
Expected behavior
TorchX local scheduler should set CUDA_VISIBLE_DEVICE=0,1 on the first two workers, and CUDA_VISIBLE_DEVICE=2,3 on the next two workers.
Environment
- torchx version (e.g. 0.1.0rc1):
- Python version:
- OS (e.g., Linux):
- How you installed torchx (
conda
, pip
, source, docker
):
- Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information:
Additional context