-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed training sage e2e example on homogeneous graphs [5/6] #8029
Closed
ZhengHongming888
wants to merge
28
commits into
pyg-team:master
from
ZhengHongming888:dist_e2e_homo_example
Closed
Add distributed training sage e2e example on homogeneous graphs [5/6] #8029
ZhengHongming888
wants to merge
28
commits into
pyg-team:master
from
ZhengHongming888:dist_e2e_homo_example
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for more information, see https://pre-commit.ci
…ing888/pytorch_geometric into dist_e2e_homo_example sync
for more information, see https://pre-commit.ci
JakubPietrakIntel
requested changes
Nov 23, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhengHongming888 I left a request to remove barrier waits and enable persistent workers instead.
for more information, see https://pre-commit.ci
…ing888/pytorch_geometric into dist_e2e_homo_example sync
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ing888/pytorch_geometric into dist_e2e_homo_example update
#8713 adds an updated version of e2e examples for dist PyG, so I'm closing this PR in agreement with @ZhengHongming888 |
rusty1s
added a commit
that referenced
this pull request
Feb 4, 2024
This PR adds an improved and refactored E2E example using `GraphSAGE` and `OGB` datasets for both homogenous (`ogbn-products`) and heterogenous (`ogbn-mag`) data. Changes wrt #8029: - Added heterogenous example - Merged homo & hetero into one script - Aligned with partitioning changes #8638 - Simplified user input - Improved display & logging - Enabled multithreading by default - hotfix for slow hetero sampling - Enabled 'persistent_workers` by default - hotfix for breaking RPC connection between train & test stages - Updated README Review: - Moved attribute assignment from `load_partition_info()` to LGS/LFS `from_partition()` and simplified Stores initiation from partition files. - Adjusted the tests. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kinga Gajdamowicz <kinga.gajdamowicz@intel.com> Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com> Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This code belongs to the part of the whole distributed training for PyG.
You can run this example as command below in 2 nodes (2 partitions) -
This homo e2e example will base on ogbn-product dataset with 2 partitions which is generate by our partition example script in the same folder. The default number of sampler workers is 2 and concurrency for mp.queue is 2.. And you can also setup other number to give you more gain on throughput.
node 0:
python dist_train_sage_for_homo.py --dataset_root_dir=your partition folder --num_nodes=2 --node_rank=0 --num_training_procs=1 --master_addr= master ip
node 1:
python dist_train_sage_for_homo.py --dataset_root_dir=your partition folder --num_nodes=2 --node_rank=1 --num_training_procs=1 --master_addr= master ip
The training will output the progress bar as the status like
Epoch 00: 27%|████████████████████████████████▊ | 26624/98307 [ ]
After training/test epochs there will generate 3 result files -
You also can reference from our e2e README for more distributed example cases and will add more like hetero, edge_sampler, etc and their running command. Welcome for any comments!
Thanks.