Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update docker to Torch 2.1.0+CUDA11.8 to resolve multi-sampler issue (#…
…377) Resolves issue #199 Updating the torch version from `torch==1.13` to `torch==2.1.0` in the docker file. Torch versions later than `1.12` had a bug which did not allow us to use `num_samplers` > 0. In Pytorch 2.1.0 release the bug is resolved. We have verified the solution through the following experiments. #### Experiment setup: Dataset: ogbn-mag (partitioned into 2) DGL versions: '1.0.4+cu117' and '1.1.1+cu113' Torch versions: '2.1.0+cu118' ### Experiment 1: 1 trainer and 4 samplers ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 1 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 13.5191 | Time: 3.2363 Epoch 00000 | Batch 020 | Train Loss: 3.2547 | Time: 0.4499 Epoch 00000 | Batch 040 | Train Loss: 2.0744 | Time: 0.5477 Epoch 00000 | Batch 060 | Train Loss: 1.6599 | Time: 0.5524 Epoch 00000 | Batch 080 | Train Loss: 1.4543 | Time: 0.4597 Epoch 00000 | Batch 100 | Train Loss: 1.2397 | Time: 0.4665 Epoch 00000 | Batch 120 | Train Loss: 1.0915 | Time: 0.4823 Epoch 00000 | Batch 140 | Train Loss: 0.9683 | Time: 0.4576 Epoch 00000 | Batch 160 | Train Loss: 0.8798 | Time: 0.5382 Epoch 00000 | Batch 180 | Train Loss: 0.7762 | Time: 0.5681 Epoch 00000 | Batch 200 | Train Loss: 0.7021 | Time: 0.4492 Epoch 00000 | Batch 220 | Train Loss: 0.6619 | Time: 0.4450 Epoch 00000 | Batch 240 | Train Loss: 0.6001 | Time: 0.4437 Epoch 00000 | Batch 260 | Train Loss: 0.5591 | Time: 0.4540 Epoch 00000 | Batch 280 | Train Loss: 0.5115 | Time: 0.3577 Epoch 0 take 134.6200098991394 ``` ### Experiment 2: 4 trainers and 4 samplers: ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 4 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 11.1130 | Time: 4.6957 Epoch 00000 | Batch 020 | Train Loss: 3.3098 | Time: 0.7897 Epoch 00000 | Batch 040 | Train Loss: 1.9996 | Time: 0.8633 Epoch 00000 | Batch 060 | Train Loss: 1.5202 | Time: 0.4229 Epoch 0 take 56.44491267204285 successfully save the model to /data/ogbn-map-lp/model/epoch-0 Time on save model 5.461951017379761 ``` By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
- Loading branch information