This repo contains the PyTorch Distributed Deep Learning workshop contents to run on the Nvidia DLI platform. It will simulate two host environment with 2 GPUs per host.
Use Dockerfile
to build a docker image with docker build --no-cache -t ptgtc .
.
Create private docker network for network connection across the hosts
docker network create -d bridge --subnet 192.168.0.0/24 --gateway 192.168.0.1 backend
Launch two docker containers (each simulating a host) using the docker image created.
Node 1:
docker run -d --name node1 --network=backend -p 8000:8888 --shm-size=1g -e NVIDIA_VISIBLE_DEVICES=0,1 --runtime=nvidia ptgtc
Node 2:
docker run -d --name node2 --network=backend -p 9000:8888 --shm-size=1g -e NVIDIA_VISIBLE_DEVICES=2,3 --runtime=nvidia ptgtc
Once the containers are running, visit the content in your browser at localhost:8000
and localhost:9000
Open a terminal window inside the juperlab browser window above and verify following commands are running.
ping node2
lsof -i -P -n
to see the list of all the open ports
For testing the distributed data parallel across the two hosts, follow below sequence of steps:
- On node1, open two terminal windows
- From first terminal window run command
export NCCL_DEBUG=info
- From first terminal window run command
python ddp_tutorial.py 0 0
- From second terminal window run command
python ddp_tutorial.py 1 1
- On node2, open two terminal windows
- From first terminal window run command
export NCCL_DEBUG=info
- From first terminal window run command
python ddp_tutorial.py 2 0
- From second terminal window run command
python ddp_tutorial.py 3 1
The DDP training job should run and complete.