-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528
Comments
Hi, @daniil-sizov is this duplicate to #5529 which is able to run though performance degrades? any difference of the To Reproduce part between these 2 tickets? I'm a bit confused. |
@Rhett-Ying Might be a duplicate, but seems like two different issues. Same "To Reproduce" part. Sometimes it doesn't crash, then it just shows performance degradation |
How often does it crash? And as mentioned in #5529 (comment), could you try with |
Crashes reproduce with both fanout orders |
does it crash if use the default arguments? And could you share how you enable |
just libgoogle-perftools4 on client nodes and set LD_PRELOAD
|
@daniil-sizov I tried in my side and it works well. I have re-run 5 times. instance type: 4 x |
I've just found crash happens when |
And I tried to increase |
this issue happens even with previous |
And I tried to install |
@daniil-sizov As I mentioned here, it crashed even tcmalloc is loaded in my side. could you share how you load tcmalloc? |
this issue is reproduced with below command which runs
|
|
Yes. That's in my case too. |
Related issue: #5480 |
Potential fix from PyTorch: pytorch/pytorch#96664 |
🐛 Bug
dgl/examples/pytorch/graphsage/dist
example crashes after #4269To Reproduce
Steps to reproduce the behavior:
Prepare ogbn-papers100M dataset (8 part split):
python3 partition_graph.py --dataset ogb-paper100M --num_parts 8 --output parts_8 --undirected --balance_train --balance_edges
Run example
Error message:
Expected behavior
No crash
Environment
conda
,pip
, source): pip8 x r5.16xlarge AWS instances
Additional context
The issue doesn't reproduce with tcmalloc
The text was updated successfully, but these errors were encountered: