Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue (#…

…377) Resolves issue #199 Updating the torch version from `torch==1.13` to `torch==2.1.0` in the docker file. Torch versions later than `1.12` had a bug which did not allow us to use `num_samplers` > 0. In Pytorch 2.1.0 release the bug is resolved. We have verified the solution through the following experiments. #### Experiment setup: Dataset: ogbn-mag (partitioned into 2) DGL versions: '1.0.4+cu117' and '1.1.1+cu113' Torch versions: '2.1.0+cu118' ### Experiment 1: 1 trainer and 4 samplers ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 1 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 13.5191 | Time: 3.2363 Epoch 00000 | Batch 020 | Train Loss: 3.2547 | Time: 0.4499 Epoch 00000 | Batch 040 | Train Loss: 2.0744 | Time: 0.5477 Epoch 00000 | Batch 060 | Train Loss: 1.6599 | Time: 0.5524 Epoch 00000 | Batch 080 | Train Loss: 1.4543 | Time: 0.4597 Epoch 00000 | Batch 100 | Train Loss: 1.2397 | Time: 0.4665 Epoch 00000 | Batch 120 | Train Loss: 1.0915 | Time: 0.4823 Epoch 00000 | Batch 140 | Train Loss: 0.9683 | Time: 0.4576 Epoch 00000 | Batch 160 | Train Loss: 0.8798 | Time: 0.5382 Epoch 00000 | Batch 180 | Train Loss: 0.7762 | Time: 0.5681 Epoch 00000 | Batch 200 | Train Loss: 0.7021 | Time: 0.4492 Epoch 00000 | Batch 220 | Train Loss: 0.6619 | Time: 0.4450 Epoch 00000 | Batch 240 | Train Loss: 0.6001 | Time: 0.4437 Epoch 00000 | Batch 260 | Train Loss: 0.5591 | Time: 0.4540 Epoch 00000 | Batch 280 | Train Loss: 0.5115 | Time: 0.3577 Epoch 0 take 134.6200098991394 ``` ### Experiment 2: 4 trainers and 4 samplers: ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 4 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 11.1130 | Time: 4.6957 Epoch 00000 | Batch 020 | Train Loss: 3.3098 | Time: 0.7897 Epoch 00000 | Batch 040 | Train Loss: 1.9996 | Time: 0.8633 Epoch 00000 | Batch 060 | Train Loss: 1.5202 | Time: 0.4229 Epoch 0 take 56.44491267204285 successfully save the model to /data/ogbn-map-lp/model/epoch-0 Time on save model 5.461951017379761 ``` By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
awslabs · Oct 27, 2023 · 51b4256 · 51b4256
1 parent 1230ac6
commit 51b4256
Show file tree

Hide file tree

Showing 3 changed files with 8 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -107,7 +107,9 @@ python3 -m graphstorm.run.gs_link_prediction \
 ## Limitation
 GraphStorm framework now supports using CPU or NVidia GPU for model training and inference. But it only works with PyTorch-gloo backend. It was only tested on AWS CPU instances or AWS GPU instances equipped with NVidia GPUs including P4, V100, A10 and A100.
 
-Multiple samplers are not supported for PyTorch versions greater than 1.12. Please use `--num-samplers 0` when your PyTorch version is above 1.12. You can find more details [here](https://github.com/awslabs/graphstorm/issues/199).
+Multiple samplers are supported in PyTorch versions <= 1.12 and >= 2.1.0. Please use `--num-samplers 0` for other PyTorch versions. More details [here](https://github.com/awslabs/graphstorm/issues/199).
+
+To use multiple samplers on sagemaker please use PyTorch versions <= 1.12.
 
 ## License
 This project is licensed under the Apache-2.0 License.

diff --git a/docker/Dockerfile.local b/docker/Dockerfile.local
@@ -9,7 +9,7 @@ RUN apt-get install -y python3-pip git wget psmisc
 RUN apt-get install -y cmake
 
 # Install Pytorch
-RUN pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+RUN pip3 install torch==2.1.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
 
 # Install DGL
 RUN pip3 install dgl==1.0.4+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
@@ -49,4 +49,4 @@ RUN cp ${SSHDIR}/id_rsa.pub ${SSHDIR}/authorized_keys
 
 EXPOSE 2222
 RUN mkdir /run/sshd
-CMD ["/usr/sbin/sshd", "-D"]
+CMD ["/usr/sbin/sshd", "-D"]
diff --git a/docs/source/install/env-setup.rst b/docs/source/install/env-setup.rst
@@ -29,20 +29,20 @@ Users can use ``pip`` or ``pip3`` to install GraphStorm.
 
 Install Dependencies
 .....................
-Users should install PyTorch v1.13.1 and DGL v1.0.4 that is the core dependency of GraphStorm using the following commands.
+Users should install PyTorch v2.1.0 and DGL v1.0.4 that is the core dependency of GraphStorm using the following commands.
 
 For Nvidia GPU environment:
 
 .. code-block:: bash
 
-    pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
+    pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
     pip install dgl==1.0.4+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
 
 For CPU environment:
 
 .. code-block:: bash
 
-    pip install torch==1.13.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu
+    pip3 install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
     pip install dgl==1.0.4 -f https://data.dgl.ai/wheels-internal/repo.html
 
 Configure SSH No-password login