[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528

daniil-sizov · 2023-04-06T13:15:40Z

🐛 Bug

dgl/examples/pytorch/graphsage/dist example crashes after #4269

To Reproduce

Steps to reproduce the behavior:

Prepare ogbn-papers100M dataset (8 part split):
python3 partition_graph.py --dataset ogb-paper100M --num_parts 8 --output parts_8 --undirected --balance_train --balance_edges
Run example

python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_omp_threads 20 --num_trainers 1 --num_samplers 2 --num_servers 1 --part_config data/ogb-paper100M.json --ip_config ~/workspace/config.txt "python3 train_dist.py --graph_name ogb-paper100M --ip_config ~/workspace/config.txt --num_epochs 10 --eval_every 20 --num_hidden 256 --num
_layers 3 --fan_out 15,10,5 --lr 0.006"

Error message:

part 5, train: 150897 (local: 127740), val: 15658 (local: 0), test: 26792 (local: 0)
part 6, train: 150897 (local: 144545), val: 15658 (local: 0), test: 26792 (local: 0)
#labels: 172
#labels: 172
#labels: 172
#labels: 172
#labels: 172
Traceback (most recent call last):
  File "train_dist.py", line 440, in <module>
    main(args)
  File "train_dist.py", line 379, in main
    run(args, device, data)
  File "train_dist.py", line 208, in run
    batch_inputs, batch_labels = load_subtensor(
  File "train_dist.py", line 20, in load_subtensor
    g.ndata["features"][input_nodes].to(device) if load_feat else None
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/distributed/dist_tensor.py", line 205, in __getitem__
    return self.kvstore.pull(name=self._name, id_tensor=idx)
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/distributed/kvstore.py", line 1463, in pull
    return rpc.fast_pull(
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/distributed/rpc.py", line 1166, in fast_pull
    res_tensor = _CAPI_DGLRPCFastPull(
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/_ffi/_ctypes/function.py", line 212, in __call__
    check_call(
  File "/usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/_ffi/base.py", line 70, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [13:01:20] /home/ubuntu/dgl_old/dgl/src/rpc/rpc.cc:489: Check failed: p_id < machine_count (8 vs. 8) : Invalid partition ID.
Stack trace:
  [bt] (0) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x5ec8b8) [0x7f1c369d78b8]
  [bt] (1) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x5efe0b) [0x7f1c369dae0b]
  [bt] (2) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(+0x5f1158) [0x7f1c369dc158]
  [bt] (3) /usr/local/lib/python3.8/dist-packages/dgl-1.1-py3.8-linux-x86_64.egg/dgl/libdgl.so(DGLFuncCall+0x60) [0x7f1c36832760]
  [bt] (4) /lib/x86_64-linux-gnu/libffi.so.7(+0x6ff5) [0x7f1c52c47ff5]
  [bt] (5) /lib/x86_64-linux-gnu/libffi.so.7(+0x640a) [0x7f1c52c4740a]
  [bt] (6) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x5b6) [0x7f1c52c60306]
  [bt] (7) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x139dc) [0x7f1c52c609dc]
  [bt] (8) /usr/bin/python3(_PyObject_MakeTpCall+0x296) [0x5f7056]

Expected behavior

No crash

Environment

DGL Version (e.g., 1.0): 1.1
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.13.1+cpu
OS (e.g., Linux): Ubuntu 20.04
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):

git clone https://github.com/dmlc/dgl.git
cd dgl
git submodule update --init --recursive
mkdir build
cd build
cmake ..
make -j32

Python version: Python 3.8.10
CUDA/cuDNN version (if applicable): n/a
GPU models and configuration (e.g. V100): CPU Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Any other relevant information:
8 x r5.16xlarge AWS instances

Additional context

The issue doesn't reproduce with tcmalloc

The text was updated successfully, but these errors were encountered:

Rhett-Ying · 2023-04-13T06:43:34Z

Hi, @daniil-sizov is this duplicate to #5529 which is able to run though performance degrades? any difference of the To Reproduce part between these 2 tickets? I'm a bit confused.

daniil-sizov · 2023-04-13T14:07:25Z

@Rhett-Ying Might be a duplicate, but seems like two different issues. Same "To Reproduce" part. Sometimes it doesn't crash, then it just shows performance degradation

Rhett-Ying · 2023-04-14T03:01:52Z

How often does it crash? And as mentioned in #5529 (comment), could you try with --fan_out 5,10,15 to see if it crashes?

daniil-sizov · 2023-04-20T02:04:07Z

Crashes reproduce with both fanout orders

Rhett-Ying · 2023-04-20T02:08:39Z

does it crash if use the default arguments? And could you share how you enable tcmalloc? re-link libgdgl.so?

Rhett-Ying · 2023-04-20T02:32:39Z

just libgoogle-perftools4 on client nodes and set LD_PRELOAD

python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_omp_threads 16 --num_trainers 1 --num_samplers 2 --num_servers 1 --part_config /home/ubuntu/workspace/dgl/examples/pytorch/graphsage/dist/parts_8/ogb-paper100M.json --ip_config ~/workspace/config.txt "LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4 ~/workspace/venv/bin/python3 train_dist.py --graph_name ogb-paper100M --ip_config ~/workspace/config.txt --num_epochs 10 --eval_every 16 --num_hidden 256 --num_layers 3 --fan_out 15,10,5 --lr 0.006"

Rhett-Ying · 2023-04-20T08:34:11Z

@daniil-sizov I tried in my side and it works well. I have re-run 5 times.

instance type: 4 x r6i.metal (128 vCPUs, 1024GB RAM)
dgl version: 1.0.2
train_dist.py: latest clone from DGL master(2023.04.20)
pytorch version: 1.13.1
dataset: ogbn-papers100M
num_parts: 4
cmd: python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_omp_threads 20 --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config ~/workspace/ogbn-papers100M_metis_4parts/ogb-paper100M.json --ip_config ~/workspace/ip_config4.txt "python3 train_dist.py --graph_name ogb-paper100M --ip_config ~/workspace/ip_config4.txt --num_epochs 10 --eval_every 20 --num_hidden 256 --num_layers 3 --fan_out 15,10,5 --lr 0.006"

Rhett-Ying · 2023-04-21T03:30:32Z

I've just found crash happens when --num_samplers 4. The main difference between --num_samplers x>0 and --num_samplers 0 is that several dedicated sampler process are forked if >0 and sampling happens on the main process for =0.
@daniil-sizov could you check ulimit -n in your instances? and try to increase it as file descriptor is used for tensor share between processes in default.

Rhett-Ying · 2023-04-21T07:18:53Z

And --num_samplers 1 works well while --num_samplers > 1 crashed at very beginning.

I tried to increase ulimit -n, not works.

Rhett-Ying · 2023-04-21T08:21:28Z

this issue happens even with previous train_dist.py and no matter what order of fanout.

Rhett-Ying · 2023-04-21T08:25:36Z

And I tried to install sudo apt-get install google-perftools and run with LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4 python3 train_dist.py, crashed yet.

Rhett-Ying · 2023-05-18T08:45:13Z

@daniil-sizov As I mentioned here, it crashed even tcmalloc is loaded in my side. could you share how you load tcmalloc?

Rhett-Ying · 2023-05-25T06:39:22Z

this issue is reproduced with below command which runs //example/pytorch/graphsage/dist/train_dist.py with ogbn-products on 4xr6i.16xlarge. DGL: 1.1.0. Pytorch: 2.0.1

python3 ~/workspace/dgl/tools/launch.py \
        --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
        --num_trainers 2 \
        --num_samplers 2 \
        --num_servers 2 \
        --part_config data/ogb-product.json \
        --ip_config ip_config.txt \
        "python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 10 --batch_size 1000"

Rhett-Ying · 2023-05-25T06:57:35Z

num_samplers 0 works well even with --num_trainers 4 --num_servers 4.

isratnisa · 2023-05-25T13:48:33Z

Yes. That's in my case too.

frozenbugs · 2023-06-01T01:07:34Z

Related issue: #5480

frozenbugs · 2023-06-01T01:10:26Z

Potential fix from PyTorch: pytorch/pytorch#96664

frozenbugs assigned Rhett-Ying Apr 13, 2023

Rhett-Ying added the bug:unconfirmed May be a bug. Need further investigation. label Apr 13, 2023

Rhett-Ying added bug:confirmed Something isn't working and removed bug:unconfirmed May be a bug. Need further investigation. labels Apr 21, 2023

jermainewang added this to DGL Project Tracker May 18, 2023

github-project-automation bot moved this to 🏠 Backlog in DGL Project Tracker May 18, 2023

Rhett-Ying mentioned this issue May 25, 2023

[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

Open

isratnisa mentioned this issue May 25, 2023

Issues with multiple samplers on torch 1.13 awslabs/graphstorm#199

Closed

frozenbugs assigned TristonC Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528

[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528

daniil-sizov commented Apr 6, 2023

Rhett-Ying commented Apr 13, 2023

daniil-sizov commented Apr 13, 2023

Rhett-Ying commented Apr 14, 2023

daniil-sizov commented Apr 20, 2023

Rhett-Ying commented Apr 20, 2023

Rhett-Ying commented Apr 20, 2023

Rhett-Ying commented Apr 20, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented May 18, 2023

Rhett-Ying commented May 25, 2023 •

edited

Loading

Rhett-Ying commented May 25, 2023

isratnisa commented May 25, 2023

frozenbugs commented Jun 1, 2023

frozenbugs commented Jun 1, 2023

[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528

[DistDGL] GraphSAGE example crashes on ogbn-papers100M dataset #5528

Comments

daniil-sizov commented Apr 6, 2023

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Rhett-Ying commented Apr 13, 2023

daniil-sizov commented Apr 13, 2023

Rhett-Ying commented Apr 14, 2023

daniil-sizov commented Apr 20, 2023

Rhett-Ying commented Apr 20, 2023

Rhett-Ying commented Apr 20, 2023

Rhett-Ying commented Apr 20, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented Apr 21, 2023

Rhett-Ying commented May 18, 2023

Rhett-Ying commented May 25, 2023 • edited Loading

Rhett-Ying commented May 25, 2023

isratnisa commented May 25, 2023

frozenbugs commented Jun 1, 2023

frozenbugs commented Jun 1, 2023

Rhett-Ying commented May 25, 2023 •

edited

Loading