This repo contains the PyTorch Distributed Deep Learning workshop contents to run on the Nvidia DLI platform. It will simulate two host environment with 2 GPUs per host.
For the purposes of the GTC demo, you will need
docker-ce 18.03
docker-compose 1.24
- NVIDIA driver 418.67 on host
Use docker-compose build
to build the services node1
and node2
.
Use docker-compose up
to launch the services for development.
Use docker-compose up -d
to launch the services as daemons, and docker-compose down
to terminate the services.
Use Dockerfile
to build a docker image with docker build --no-cache -t ptgtc .
.
Create private docker network for network connection across the hosts
docker network create -d bridge --subnet 192.168.0.0/24 --gateway 192.168.0.1 backend
Launch two docker containers (each simulating a host) using the docker image created.
Node 1:
docker run -d --name node1 --network=backend -p 8000:8888 --shm-size=16g -e NVIDIA_VISIBLE_DEVICES=0,1 --runtime=nvidia ptgtc
Node 2:
docker run -d --name node2 --network=backend -p 9000:8888 --shm-size=16g -e NVIDIA_VISIBLE_DEVICES=2,3 --runtime=nvidia ptgtc
Once the containers are running, visit the content in your browser at localhost:8000
and localhost:9000
Open a terminal window inside the juperlab browser window above and verify following commands are running.
ping node2
lsof -i -P -n
to see the list of all the open ports
For testing the distributed data parallel across the two hosts, follow below sequence of steps:
- On node1, open two terminal windows
- From first terminal window run command
export NCCL_DEBUG=info
- From first terminal window run command
python ddp_tutorial.py 0 0
- From second terminal window run command
python ddp_tutorial.py 1 1
- On node2, open two terminal windows
- From first terminal window run command
export NCCL_DEBUG=info
- From first terminal window run command
python ddp_tutorial.py 2 0
- From second terminal window run command
python ddp_tutorial.py 3 1
The DDP training job should run and complete.