Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed training sage e2e example on homogeneous graphs [5/6] #8029

Closed

Conversation

ZhengHongming888
Copy link
Contributor

@ZhengHongming888 ZhengHongming888 commented Sep 14, 2023

This code belongs to the part of the whole distributed training for PyG.

You can run this example as command below in 2 nodes (2 partitions) -

This homo e2e example will base on ogbn-product dataset with 2 partitions which is generate by our partition example script in the same folder. The default number of sampler workers is 2 and concurrency for mp.queue is 2.. And you can also setup other number to give you more gain on throughput.

node 0:
python dist_train_sage_for_homo.py --dataset_root_dir=your partition folder --num_nodes=2 --node_rank=0 --num_training_procs=1 --master_addr= master ip

node 1:
python dist_train_sage_for_homo.py --dataset_root_dir=your partition folder --num_nodes=2 --node_rank=1 --num_training_procs=1 --master_addr= master ip

The training will output the progress bar as the status like
Epoch 00: 27%|████████████████████████████████▊ | 26624/98307 [ ]

After training/test epochs there will generate 3 result files -

  1. dist_train_sage_for_homo.txt - general training information /arguments
  2. dist_train_sage_for_homo_rank0.txt - results for training loss, accuracy, epoch time on rank0 node
  3. dist_train_sage_for_homo_rank1.txt - results for training loss, accuracy, epoch time on rank1 node

You also can reference from our e2e README for more distributed example cases and will add more like hetero, edge_sampler, etc and their running command. Welcome for any comments!

Thanks.

@rusty1s rusty1s changed the title Add distributed training sage e2e example for homo Add distributed training sage e2e example on homogeneous graphs Sep 15, 2023
@rusty1s rusty1s changed the title Add distributed training sage e2e example on homogeneous graphs Add distributed training sage e2e example on homogeneous graphs [5/6] Oct 30, 2023
Copy link
Contributor

@JakubPietrakIntel JakubPietrakIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhengHongming888 I left a request to remove barrier waits and enable persistent workers instead.

@JakubPietrakIntel
Copy link
Contributor

#8713 adds an updated version of e2e examples for dist PyG, so I'm closing this PR in agreement with @ZhengHongming888

rusty1s added a commit that referenced this pull request Feb 4, 2024
This PR adds an improved and refactored E2E example using `GraphSAGE`
and `OGB` datasets for both homogenous (`ogbn-products`) and
heterogenous (`ogbn-mag`) data.

Changes wrt #8029:
- Added heterogenous example
- Merged homo & hetero into one script
- Aligned with partitioning changes #8638
- Simplified user input
- Improved display & logging
- Enabled multithreading by default - hotfix for slow hetero sampling
- Enabled 'persistent_workers` by default - hotfix for breaking RPC
connection between train & test stages
- Updated README
Review:
- Moved attribute assignment from `load_partition_info()` to LGS/LFS
`from_partition()` and simplified Stores initiation from partition
files.
- Adjusted the tests.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kinga Gajdamowicz <kinga.gajdamowicz@intel.com>
Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants