Add distributed training sage e2e example on homogeneous graphs [5/6] #8029

ZhengHongming888 · 2023-09-14T07:08:01Z

This code belongs to the part of the whole distributed training for PyG.

You can run this example as command below in 2 nodes (2 partitions) -

This homo e2e example will base on ogbn-product dataset with 2 partitions which is generate by our partition example script in the same folder. The default number of sampler workers is 2 and concurrency for mp.queue is 2.. And you can also setup other number to give you more gain on throughput.

node 0:
python dist_train_sage_for_homo.py --dataset_root_dir=your partition folder --num_nodes=2 --node_rank=0 --num_training_procs=1 --master_addr= master ip

node 1:
python dist_train_sage_for_homo.py --dataset_root_dir=your partition folder --num_nodes=2 --node_rank=1 --num_training_procs=1 --master_addr= master ip

The training will output the progress bar as the status like
Epoch 00: 27%|████████████████████████████████▊ | 26624/98307 [ ]

After training/test epochs there will generate 3 result files -

dist_train_sage_for_homo.txt - general training information /arguments
dist_train_sage_for_homo_rank0.txt - results for training loss, accuracy, epoch time on rank0 node
dist_train_sage_for_homo_rank1.txt - results for training loss, accuracy, epoch time on rank1 node

You also can reference from our e2e README for more distributed example cases and will add more like hetero, edge_sampler, etc and their running command. Welcome for any comments!

Thanks.

for more information, see https://pre-commit.ci

…ing888/pytorch_geometric into dist_e2e_homo_example sync

for more information, see https://pre-commit.ci

JakubPietrakIntel

@ZhengHongming888 I left a request to remove barrier waits and enable persistent workers instead.

examples/distributed/pyg/dist_train_sage_for_homo.py

for more information, see https://pre-commit.ci

…ing888/pytorch_geometric into dist_e2e_homo_example sync

for more information, see https://pre-commit.ci

…ing888/pytorch_geometric into dist_e2e_homo_example update

JakubPietrakIntel · 2024-01-04T11:05:31Z

#8713 adds an updated version of e2e examples for dist PyG, so I'm closing this PR in agreement with @ZhengHongming888

This PR adds an improved and refactored E2E example using `GraphSAGE` and `OGB` datasets for both homogenous (`ogbn-products`) and heterogenous (`ogbn-mag`) data. Changes wrt #8029: - Added heterogenous example - Merged homo & hetero into one script - Aligned with partitioning changes #8638 - Simplified user input - Improved display & logging - Enabled multithreading by default - hotfix for slow hetero sampling - Enabled 'persistent_workers` by default - hotfix for breaking RPC connection between train & test stages - Updated README Review: - Moved attribute assignment from `load_partition_info()` to LGS/LFS `from_partition()` and simplified Stores initiation from partition files. - Adjusted the tests. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kinga Gajdamowicz <kinga.gajdamowicz@intel.com> Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com> Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

ZhengHongming888 added 4 commits September 13, 2023 23:21

dist_train_sage_for_homo e2e example

49ef2de

minor

c71ce5b

minors

d3cca78

minors

1a1a529

ZhengHongming888 requested a review from wsad1 as a code owner September 14, 2023 07:08

pre-commit-ci bot and others added 7 commits September 14, 2023 07:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

cef3222

for more information, see https://pre-commit.ci

README.md for e2e example

db33463

Merge branch 'pyg-team:master' into dist_e2e_homo_example

9f58646

Merge branch 'dist_e2e_homo_example' of https://github.com/ZhengHongm…

6899b9c

…ing888/pytorch_geometric into dist_e2e_homo_example sync

[pre-commit.ci] auto fixes from pre-commit.com hooks

ce51b40

for more information, see https://pre-commit.ci

Update README.md

fcbdcd6

Update README.md

1d13903

rusty1s changed the title ~~Add distributed training sage e2e example for homo~~ Add distributed training sage e2e example on homogeneous graphs Sep 15, 2023

rusty1s assigned ZhengHongming888 Sep 15, 2023

rusty1s added feature 0 - Priority P0 example labels Sep 15, 2023

Update README.md

5b9fd4a

rusty1s added the distributed label Oct 30, 2023

rusty1s changed the title ~~Add distributed training sage e2e example on homogeneous graphs~~ Add distributed training sage e2e example on homogeneous graphs [5/6] Oct 30, 2023

JakubPietrakIntel requested changes Nov 23, 2023

View reviewed changes

examples/distributed/pyg/dist_train_sage_for_homo.py Outdated Show resolved Hide resolved

examples/distributed/pyg/dist_train_sage_for_homo.py Outdated Show resolved Hide resolved

ZhengHongming888 and others added 9 commits November 26, 2023 17:57

Merge branch 'pyg-team:master' into dist_e2e_homo_example

0319061

Merge branch 'pyg-team:master' into dist_e2e_homo_example

5d8c025

[pre-commit.ci] auto fixes from pre-commit.com hooks

ea8ebcd

for more information, see https://pre-commit.ci

minor

50c08f7

Merge branch 'dist_e2e_homo_example' of https://github.com/ZhengHongm…

f4b2d41

…ing888/pytorch_geometric into dist_e2e_homo_example sync

update readme

be479fe

[pre-commit.ci] auto fixes from pre-commit.com hooks

efece78

for more information, see https://pre-commit.ci

Merge branch 'pyg-team:master' into dist_e2e_homo_example

085435f

add torch.distributed.barrier

d7e098f

ZhengHongming888 and others added 7 commits December 17, 2023 22:01

Merge branch 'pyg-team:master' into dist_e2e_homo_example

f379f9b

change test part

771f5f2

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4749a0

for more information, see https://pre-commit.ci

remove graph labels

6299dcd

Merge branch 'pyg-team:master' into dist_e2e_homo_example

2f63efc

Merge branch 'dist_e2e_homo_example' of https://github.com/ZhengHongm…

622c149

…ing888/pytorch_geometric into dist_e2e_homo_example update

Merge branch 'pyg-team:master' into dist_e2e_homo_example

5c0c292

JakubPietrakIntel closed this Jan 4, 2024

JakubPietrakIntel mentioned this pull request Jan 4, 2024

Refactored E2E example for Distributed CPU solution #8713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed training sage e2e example on homogeneous graphs [5/6] #8029

Add distributed training sage e2e example on homogeneous graphs [5/6] #8029

ZhengHongming888 commented Sep 14, 2023 •

edited

Loading

JakubPietrakIntel left a comment

JakubPietrakIntel commented Jan 4, 2024

Add distributed training sage e2e example on homogeneous graphs [5/6] #8029

Add distributed training sage e2e example on homogeneous graphs [5/6] #8029

Conversation

ZhengHongming888 commented Sep 14, 2023 • edited Loading

JakubPietrakIntel left a comment

Choose a reason for hiding this comment

JakubPietrakIntel commented Jan 4, 2024

ZhengHongming888 commented Sep 14, 2023 •

edited

Loading