-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Example] Add WholeGraph to accelerate PyG dataloaders with GPUs #9714
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. @rusty1s I wonder if you think it would be a better fit to have these helper files directly integrated into torch_geometric.distributed.wholegraph or something like that.
also @chang-l please remove the stale examples from examples/multi_gpu/
these:
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/data_parallel.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_batching.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.sbatch
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/mag240m_graphsage.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn.py
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn_multinode.py
for taobao and pcqm4m examples in the folder, i think it would be best to add a comment to the top mentioning mp.spawn is deprecated and to point to your new examples.
please also update the readme of that folder accordingly.
lastly. please add similar deprecation comment and pointer to new examples for these 2:
https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_gpu_vanilla.rst
https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_node_multi_gpu_vanilla.rst
@puririshi98 @chang-l We can go ahead and instruct users to use |
Okay, I guess, from our side, we can keep this PR as it is (as one of distributed example) for now and gradually merge it within cuGraph along the way while keeping the examples up to date. Sounds good? @alexbarghi-nv @puririshi98 @TristonC @BradReesWork |
@chang-l sounds good to me 👍 |
@puririshi98 Thank you Rishi for the suggestions. I will file another PR to update and reorg existing multiGPU/multi-node examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexbarghi-nv do you mind add a README file under this directory later?
This PR demonstrates how to integrate the NVIDIA WholeGraph into PyG’s graph and feature store base classes, providing a modular and PyG-like way to extend PyG's dataloader for better GPU utilization. Let WholeGraph handle the optimization of data access on NVIDIA hardware and manage graph and feature storage, with potential sharding across distributed disk, RAM or device memory.
Compared to existing examples, there are three key differences:
The WholeGraph library does not provide a dataloader but host underlying distributed graph and feature storage with associated efficient primitive operations (e.g., GPU-accelerated fast embedding retrieval and graph sampling).
It is efficient, minimizing CPU interruptions, and can be built into PyG's feature store and graph store (compatible with existing PyG native dataloaders). Please see
feature_store.py
andgraph_store.py
implementation.There is no distinction between single-GPU, multi-GPU, and multi-node multi-GPU training with this new feature store or graph store. Users do not need to partition the graph or hand-craft third-party launch scripts. Everything falls under the traditional PyTorch DDP workflow, The example (
papers100m_dist_wholegraph_nc.py
orbenchmark_data.py
) shows how to achieve this from any existing PyG DDP example.By running benchmark script (
benchmark_data.py
), we observed 2X, 5X and 9X speedup on single GPU with NVIDIA T4, A100, and H100 GPU, compared to native PyG NeighborLoader. Running with 4 GPUs, the speedups increase to 6.4X, 15X and 35X, respectively (numbers may vary depending on actual CPU used for baseline run).Meanwhile, given the demonstrated compatibility in this PR and performance benefits, I’d like to propose integrating WholeGraph, as an option, to back
data.FeatureStore
/HeteroData.FeatureStore
first; and to support the WholeMemory type as a new option inindex_select
function;pytorch_geometric/torch_geometric/loader/utils.py
Line 57 in 7f844d7
cc. @puririshi98 @TristonC @alexbarghi-nv @linhu-nv @rusty1s