Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] Add WholeGraph to accelerate PyG dataloaders with GPUs #9714

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

chang-l
Copy link

@chang-l chang-l commented Oct 17, 2024

This PR demonstrates how to integrate the NVIDIA WholeGraph into PyG’s graph and feature store base classes, providing a modular and PyG-like way to extend PyG's dataloader for better GPU utilization. Let WholeGraph handle the optimization of data access on NVIDIA hardware and manage graph and feature storage, with potential sharding across distributed disk, RAM or device memory.

Compared to existing examples, there are three key differences:

  • The WholeGraph library does not provide a dataloader but host underlying distributed graph and feature storage with associated efficient primitive operations (e.g., GPU-accelerated fast embedding retrieval and graph sampling).

  • It is efficient, minimizing CPU interruptions, and can be built into PyG's feature store and graph store (compatible with existing PyG native dataloaders). Please see feature_store.py and graph_store.py implementation.

  • There is no distinction between single-GPU, multi-GPU, and multi-node multi-GPU training with this new feature store or graph store. Users do not need to partition the graph or hand-craft third-party launch scripts. Everything falls under the traditional PyTorch DDP workflow, The example (papers100m_dist_wholegraph_nc.py or benchmark_data.py) shows how to achieve this from any existing PyG DDP example.

By running benchmark script (benchmark_data.py), we observed 2X, 5X and 9X speedup on single GPU with NVIDIA T4, A100, and H100 GPU, compared to native PyG NeighborLoader. Running with 4 GPUs, the speedups increase to 6.4X, 15X and 35X, respectively (numbers may vary depending on actual CPU used for baseline run).

Meanwhile, given the demonstrated compatibility in this PR and performance benefits, I’d like to propose integrating WholeGraph, as an option, to back data.FeatureStore/HeteroData.FeatureStore first; and to support the WholeMemory type as a new option in index_select function;

if isinstance(value, Tensor):
, making it (UVA) accessible to more users.

cc. @puririshi98 @TristonC @alexbarghi-nv @linhu-nv @rusty1s

@chang-l chang-l requested a review from wsad1 as a code owner October 17, 2024 23:04
Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. @rusty1s I wonder if you think it would be a better fit to have these helper files directly integrated into torch_geometric.distributed.wholegraph or something like that.

also @chang-l please remove the stale examples from examples/multi_gpu/

these:
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/data_parallel.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_batching.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.sbatch
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/mag240m_graphsage.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn_multinode.py

for taobao and pcqm4m examples in the folder, i think it would be best to add a comment to the top mentioning mp.spawn is deprecated and to point to your new examples.
please also update the readme of that folder accordingly.

lastly. please add similar deprecation comment and pointer to new examples for these 2:
https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_gpu_vanilla.rst
https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_node_multi_gpu_vanilla.rst

@alexbarghi-nv
Copy link
Contributor

alexbarghi-nv commented Oct 22, 2024

Overall looks good to me. @rusty1s I wonder if you think it would be a better fit to have these helper files directly integrated into torch_geometric.distributed.wholegraph or something like that.

also @chang-l please remove the stale examples from examples/multi_gpu/

these: https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/data_parallel.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_batching.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.sbatch https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/mag240m_graphsage.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn_multinode.py

for taobao and pcqm4m examples in the folder, i think it would be best to add a comment to the top mentioning mp.spawn is deprecated and to point to your new examples. please also update the readme of that folder accordingly.

lastly. please add similar deprecation comment and pointer to new examples for these 2: https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_gpu_vanilla.rst https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_node_multi_gpu_vanilla.rst

@puririshi98 can we hold off on this for now? We are having a meeting in a couple hours to discuss this PR and how we want to go about it.

@alexbarghi-nv
Copy link
Contributor

@puririshi98 @chang-l We can go ahead and instruct users to use torchrun and the example WG Graph/Feature stores. At some point, we will replace the ones in the examples directory with official ones that are part of cugraph. Our long-term strategy, I think, based on our discussion, is to have this take over feature storage in cuGraph. The cuGraph loaders will remain for users that need them for extreme scale applications. Then, for sampling, we will eventually replace the WholeGraph samplers with cuGraph ones once our C++ code can support custom partitioning schemes.

@chang-l
Copy link
Author

chang-l commented Oct 22, 2024

Okay, I guess, from our side, we can keep this PR as it is (as one of distributed example) for now and gradually merge it within cuGraph along the way while keeping the examples up to date. Sounds good? @alexbarghi-nv @puririshi98 @TristonC @BradReesWork

@alexbarghi-nv
Copy link
Contributor

@chang-l sounds good to me 👍

@chang-l
Copy link
Author

chang-l commented Oct 22, 2024

@puririshi98 Thank you Rishi for the suggestions. I will file another PR to update and reorg existing multiGPU/multi-node examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexbarghi-nv do you mind add a README file under this directory later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants