Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom cuda extensions make installing ao hard #288

Closed
msaroufim opened this issue May 29, 2024 · 5 comments
Closed

custom cuda extensions make installing ao hard #288

msaroufim opened this issue May 29, 2024 · 5 comments

Comments

@msaroufim
Copy link
Member

msaroufim commented May 29, 2024

i'm collecting a few issues I've seen, I have no clear picture of how to solve them as of this moment but aggregating them in the hopes that inspiration will strike

Problems

Problem 1

The below issue is solved by installing ao and then cd out of the ao directory. IIRC PyTorch has a similar problem in a repro shared by @jerryzh168

Traceback (most recent call last):
  File "/home/jerryzh/ao/example.py", line 2, in <module>
    from torchao.quantization.quant_primitives import MappingType, ZeroPointDomain
  File "/home/jerryzh/ao/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: cannot import name 'C' from partially initialized module 'torchao' (most likely due to a circular import) (/home/jerryzh/ao/torchao/__init_.py)

Problem 2

Another issue here is building the fp6 kernels is failing https://hastebin.com/share/riridivafa.rust but the nvcc and gcc versions seem fine in a repro shared by @CoffeeVampir3

Problem 3

This error shows up when you either pip install ao or build it with a mismatch in cuda versions in a repro shared by @vayuda

python test/quantization/test_quant_api.py
Traceback (most recent call last):
  File "/u/pj8wfq/ao/test/quantization/test_quant_api.py", line 21, in <module>
    from torchao.dtypes import (
  File "/u/pj8wfq/ao/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: /u/pj8wfq/ao/torchao/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

Problem 4

pypi binaries are crashing on non CUDA devices

File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torchao/init.py", line 14, in
from . import _C
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Solutions

We need graceful solutions but in the meantime I'm embarassed to say I've been recommending a nuclear option which is to disable C extensions

Specifically in torchao/__init__.py delete

if not _IS_FBCODE:
    from . import _C
    from . import ops

And in setup.py delete

    ext_modules=get_extensions(),
@gau-nernst
Copy link
Collaborator

Maybe there should be a flag to skip compiling extension modules and make sure the package can still run without extension module built. (still, it's a stopgap measure, doesn't tackle the root problem)

@msaroufim
Copy link
Member Author

msaroufim commented May 29, 2024

Indeed an env variable doing the nuclear options seems practical although yeah it's gonna be clunky to have to tell people please install us with NO_CPP=True pip install torchao

@malfet
Copy link

malfet commented May 29, 2024

  • Pythonic solution for such problems (that torchvisoin is using for example) is good old try-catch:
try:
   import _C as _C
except:
   _C = None
  • For CUDA extensions, unless one is using Driver API/CUBLAS, adding -lcudart_static should fix problem Move ao to torchao #4 (and to be frank, I'm surprised you are running into it, as this has been the default since CUDA-5)

  • Compilation problems are always fun to debug, there are two potential culprit, one is solved by include-what-you-use concept (i.e. even if something compiles on your system, but you are using say uint16_t, make sure to #include <cstdint> in the file that relies on it) and another is appropriate guarding (GPUs before Maxwell does not support half precision types, though it's very unlikely you run into it)

@jerryzh168
Copy link
Contributor

another issue similar to Problem 3:

Traceback (most recent call last):
  File "/home/jerryzh/ao/test/quantization/test_quant_api.py", line 21, in <module>
    from torchao.dtypes import (
  File "/home/jerryzh/anaconda3/envs/ao_new/lib/python3.9/site-packages/torchao-0.2.0-py3.9-linux-x86_64.egg/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: /home/jerryzh/anaconda3/envs/ao_new/lib/python3.9/site-packages/torchao-0.2.0-py3.9-linux-x86_64.egg/torchao/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit11parseSchemaERKSs

@msaroufim
Copy link
Member Author

These issues were mostly fixed so far, can reopen if more stuff comes up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants