Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault or free(): invalid pointer when importing dgl with other libraries due to RTLD_GLOBAL #2255

Closed
skrsna opened this issue Oct 1, 2020 · 9 comments
Assignees
Labels
help wanted Need helps from the community

Comments

@skrsna
Copy link

skrsna commented Oct 1, 2020

🐛 Bug

importing dgl after importing C++ based library with pybind interface leads to segfault or free(): invalid pointer. The C++ library in question is an internal library that is not available publicly. I found some relevant issues on pytorch repo pytorch/pytorch#3059 and RobotLocomotion/drake#12073. I was able to find a workaround by deleting ctypes.RTLD_GLOBAL here in the dgl source code. Pytorch and tensorflow seemed to move away from RTLD_GLOBAL. ref (pytorch/pytorch#28536). Just wondering if something similar can be done in dgl.

To Reproduce

Sorry the library I'm using that causes this error is not available publicly and uses TBB allocator.
Steps to reproduce the behavior:

Expected behavior

import dgl without segfault or free(): invalid pointer Aborted

Environment

  • DGL Version (e.g., 1.0): 0.5.2 cpu
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 1.7 nightly
  • OS (e.g., Linux): linux
  • How you installed DGL (conda, pip, source): conda
  • Build command you used (if compiling from source):
  • Python version: 3.8
  • CUDA/cuDNN version (if applicable): 10.2, 7.6
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

@VoVAllen
Copy link
Collaborator

bump this

@BarclayII
Copy link
Collaborator

BarclayII commented Oct 30, 2020

Does it immediately crash after importing DGL after importing the said library?

@skrsna
Copy link
Author

skrsna commented Oct 30, 2020

Hi @BarclayII,

If I import dgl then import the private library, it doesn't crash right away but crashes when there's a call to any dgl functions or the library's functions. On the other hand if I import the library first and then dgl it crashes right away.

@dgasmith
Copy link

This issue was references in #2328, but then the line was crossed out. I didn't see an immediate reason of why in the issue.

If this is a longer term item, could we introduce an env variable to dynamically change the CDLL load in specific circumstance-- perhaps DGL_RTLD_SETTING?

@BarclayII
Copy link
Collaborator

BarclayII commented Dec 24, 2020

As I mentioned in the crossed-out text, directly changing RTLD_GLOBAL will make some examples (namely examples/pytorch/graphsage/train_sampling.py with num_workers=0) freeze.

I wasn't able to figure out the reason yet, so I had to work around it by ensuring PyTorch/MXNet/Tensorflow C library to be loaded before libdgl.so. Obviously not a fix to this issue per se.

@jermainewang jermainewang added the help wanted Need helps from the community label Dec 28, 2020
@dgasmith
Copy link

Ah got it. In the meantime would take a PR that allows us to alter this setting via env variable?

@BarclayII
Copy link
Collaborator

@dgasmith @skrsna I removed the flag in a recent PR. Please give the nightly builds a try. I tested on the GraphSAGE examples and it currently run without any problems.

@dgasmith
Copy link

dgasmith commented Jan 6, 2021

@BarclayII Thanks! I really appreciate that, we will evaluate the nightly builds ASAP.

@BarclayII
Copy link
Collaborator

So far no issues as per our experience. Please reopen the issue if the problem still exists in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Need helps from the community
Projects
None yet
Development

No branches or pull requests

5 participants