Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528

Open
daniil-sizov opened this issue Apr 6, 2023 · 17 comments
Open

[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528

daniil-sizov opened this issue Apr 6, 2023 · 17 comments
Assignees
Labels
bug:confirmed Something isn't working

Comments

@daniil-sizov
Copy link
Contributor

🐛 Bug

dgl/examples/pytorch/graphsage/dist example crashes after #4269

To Reproduce

Steps to reproduce the behavior:

  1. Prepare ogbn-papers100M dataset (8 part split):
    python3 partition_graph.py --dataset ogb-paper100M --num_parts 8 --output parts_8 --undirected --balance_train --balance_edges

  2. Run example

python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_omp_threads 20 --num_trainers 1 --num_samplers 2 --num_servers 1 --part_config data/ogb-paper100M.json --ip_config ~/workspace/config.txt "python3 train_dist.py --graph_name ogb-paper100M --ip_config ~/workspace/config.txt --num_epochs 10 --eval_every 20 --num_hidden 256 --num
_layers 3 --fan_out 15,10,5 --lr 0.006"

Error message:

part 5, train: 150897 (local: 127740), val: 15658 (local: 0), test: 26792 (local: 0)
part 6, train: 150897 (local: 144545), val: 15658 (local: 0), test: 26792 (local: 0)
#labels: 172
#labels: 172
#labels: 172
#labels: 172
#labels: 172
Traceback (most recent call last):
  File "train_dist.py", line 440, in <module>
    main(args)
  File "train_dist.py", line 379, in main
    run(args, device, data)
  File "train_dist.py", line 208, in run
    batch_inputs, batch_labels = load_subtensor(
  File "train_dist.py", line 20, in load_subtensor
    g.ndata["features"][input_nodes].to(device) if load_feat else None
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/distributed/dist_tensor.py", line 205, in __getitem__
    return self.kvstore.pull(name=self._name, id_tensor=idx)
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/distributed/kvstore.py", line 1463, in pull
    return rpc.fast_pull(
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/distributed/rpc.py", line 1166, in fast_pull
    res_tensor = _CAPI_DGLRPCFastPull(
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/_ffi/_ctypes/function.py", line 212, in __call__
    check_call(
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/_ffi/base.py", line 70, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [13:01:20] /home/ubuntu/dgl_old/dgl/src/rpc/rpc.cc:489: Check failed: p_id < machine_count (8 vs. 8) : Invalid partition ID.
Stack trace:
  [bt] (0) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x5ec8b8) [0x7f1c369d78b8]
  [bt] (1) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x5efe0b) [0x7f1c369dae0b]
  [bt] (2) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x5f1158) [0x7f1c369dc158]
  [bt] (3) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(DGLFuncCall+0x60) [0x7f1c36832760]
  [bt] (4) /lib/x86_64-linux-gnu/libffi.so.7(+0x6ff5) [0x7f1c52c47ff5]
  [bt] (5) /lib/x86_64-linux-gnu/libffi.so.7(+0x640a) [0x7f1c52c4740a]
  [bt] (6) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x5b6) [0x7f1c52c60306]
  [bt] (7) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x139dc) [0x7f1c52c609dc]
  [bt] (8) /usr/bin/python3(_PyObject_MakeTpCall+0x296) [0x5f7056]

Expected behavior

No crash

Environment

  • DGL Version (e.g., 1.0): 1.1
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.13.1+cpu
  • OS (e.g., Linux): Ubuntu 20.04
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
git clone https://github.com/dmlc/dgl.git
cd dgl
git submodule update --init --recursive
mkdir build
cd build
cmake ..
make -j32
  • Python version: Python 3.8.10
  • CUDA/cuDNN version (if applicable): n/a
  • GPU models and configuration (e.g. V100): CPU Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
  • Any other relevant information:
    8 x r5.16xlarge AWS instances

Additional context

The issue doesn't reproduce with tcmalloc

@Rhett-Ying
Copy link
Collaborator

Hi, @daniil-sizov is this duplicate to #5529 which is able to run though performance degrades? any difference of the To Reproduce part between these 2 tickets? I'm a bit confused.

@Rhett-Ying Rhett-Ying added the bug:unconfirmed May be a bug. Need further investigation. label Apr 13, 2023
@daniil-sizov
Copy link
Contributor Author

@Rhett-Ying Might be a duplicate, but seems like two different issues. Same "To Reproduce" part. Sometimes it doesn't crash, then it just shows performance degradation

@Rhett-Ying
Copy link
Collaborator

How often does it crash? And as mentioned in #5529 (comment), could you try with --fan_out 5,10,15 to see if it crashes?

@daniil-sizov
Copy link
Contributor Author

Crashes reproduce with both fanout orders

@Rhett-Ying
Copy link
Collaborator

does it crash if use the default arguments? And could you share how you enable tcmalloc? re-link libgdgl.so?

@Rhett-Ying
Copy link
Collaborator

just libgoogle-perftools4 on client nodes and set LD_PRELOAD

python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_omp_threads 16 --num_trainers 1 --num_samplers 2 --num_servers 1 --part_config /home/ubuntu/workspace/dgl/examples/pytorch/graphsage/dist/parts_8/ogb-paper100M.json --ip_config ~/workspace/config.txt "LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4 ~/workspace/venv/bin/python3 train_dist.py --graph_name ogb-paper100M --ip_config ~/workspace/config.txt --num_epochs 10 --eval_every 16 --num_hidden 256 --num_layers 3 --fan_out 15,10,5 --lr 0.006"

@Rhett-Ying
Copy link
Collaborator

@daniil-sizov I tried in my side and it works well. I have re-run 5 times.

instance type: 4 x r6i.metal (128 vCPUs, 1024GB RAM)
dgl version: 1.0.2
train_dist.py: latest clone from DGL master(2023.04.20)
pytorch version: 1.13.1
dataset: ogbn-papers100M
num_parts: 4
cmd: python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_omp_threads 20 --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config ~/workspace/ogbn-papers100M_metis_4parts/ogb-paper100M.json --ip_config ~/workspace/ip_config4.txt "python3 train_dist.py --graph_name ogb-paper100M --ip_config ~/workspace/ip_config4.txt --num_epochs 10 --eval_every 20 --num_hidden 256 --num_layers 3 --fan_out 15,10,5 --lr 0.006"

@Rhett-Ying
Copy link
Collaborator

I've just found crash happens when --num_samplers 4. The main difference between --num_samplers x>0 and --num_samplers 0 is that several dedicated sampler process are forked if >0 and sampling happens on the main process for =0.
@daniil-sizov could you check ulimit -n in your instances? and try to increase it as file descriptor is used for tensor share between processes in default.

@Rhett-Ying
Copy link
Collaborator

And --num_samplers 1 works well while --num_samplers > 1 crashed at very beginning.

I tried to increase ulimit -n, not works.

@Rhett-Ying Rhett-Ying added bug:confirmed Something isn't working and removed bug:unconfirmed May be a bug. Need further investigation. labels Apr 21, 2023
@Rhett-Ying
Copy link
Collaborator

this issue happens even with previous train_dist.py and no matter what order of fanout.

@Rhett-Ying
Copy link
Collaborator

And I tried to install sudo apt-get install google-perftools and run with LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4 python3 train_dist.py, crashed yet.

@Rhett-Ying
Copy link
Collaborator

@daniil-sizov As I mentioned here, it crashed even tcmalloc is loaded in my side. could you share how you load tcmalloc?

@Rhett-Ying
Copy link
Collaborator

Rhett-Ying commented May 25, 2023

this issue is reproduced with below command which runs //example/pytorch/graphsage/dist/train_dist.py with ogbn-products on 4xr6i.16xlarge. DGL: 1.1.0. Pytorch: 2.0.1

python3 ~/workspace/dgl/tools/launch.py \
        --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
        --num_trainers 2 \
        --num_samplers 2 \
        --num_servers 2 \
        --part_config data/ogb-product.json \
        --ip_config ip_config.txt \
        "python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 10 --batch_size 1000"

@Rhett-Ying
Copy link
Collaborator

num_samplers 0 works well even with --num_trainers 4 --num_servers 4.

@isratnisa
Copy link
Collaborator

Yes. That's in my case too.

@frozenbugs
Copy link
Collaborator

Related issue: #5480

@frozenbugs
Copy link
Collaborator

Potential fix from PyTorch: pytorch/pytorch#96664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working
Projects
Status: 🏠 Backlog
Development

No branches or pull requests

5 participants