Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Colab setup env hit Cuda/extension version mismatch issue #53

Closed
marvin-0042 opened this issue Feb 27, 2024 · 2 comments
Closed
Labels
question Further information is requested

Comments

@marvin-0042
Copy link

Thank you so much for the great work!!!

I'm trying to setup the environment in Google Colab to train. but hit Cuda extension version mismatch issue. My python/pytorch/cuda version matches the requirement. Does anyone happen to know why? Really appreciated !!

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"

but hit below issue

RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.

my env:
python: 3.10
pytorch: 2.1.0
cuda: 12.1

full log:

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
DEPRECATION: --build-option and --global-option are deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use --config-settings. Discussion can be found at pypa/pip#11859
WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option.
Processing /content/ColossalAI/OpenDiT/apex
Running command Preparing metadata (pyproject.toml)

torch.version = 2.1.0+cu121

running dist_info
creating /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info
writing /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/dependency_links.txt
writing requirements to /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/requires.txt
writing top-level names to /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/top_level.txt
writing manifest file '/tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/SOURCES.txt'
reading manifest file '/tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file '/tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/SOURCES.txt'
creating '/tmp/pip-modern-metadata-6nsd1o2v/apex-0.1.dist-info'
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: packaging>20.6 in /usr/local/lib/python3.10/dist-packages (from apex==0.1) (23.2)
Building wheels for collected packages: apex
WARNING: Ignoring --global-option when building apex using PEP 517
Running command Building wheel for apex (pyproject.toml)

torch.version = 2.1.0+cu121

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
from /usr/local/cuda/bin

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
main()
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
return _build_backend().build_wheel(wheel_directory, config_settings,
File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 416, in build_wheel
return self._build_with_temp_dir(['bdist_wheel'], '.whl',
File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 401, in _build_with_temp_dir
self.run_setup()
File "/usr/local/lib/python3.10/dist-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 178, in
File "", line 40, in check_cuda_torch_binary_vs_bare_metal
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
In some cases, a minor-version mismatch will not cause later errors: NVIDIA/apex#323 (comment). You can try commenting out this check (at your own risk).
error: subprocess-exited-with-error

× Building wheel for apex (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmp6d28a180
cwd: /content/ColossalAI/OpenDiT/apex
Building wheel for apex (pyproject.toml) ... error
ERROR: Failed building wheel for apex
Failed to build apex
ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects

@KKZ20
Copy link
Collaborator

KKZ20 commented Feb 28, 2024

Hi, thanks for supporting our work!

It seems that your CUDA version mismatches with the apex version. Do you use a virtual Python environment? If not, maybe you can check the native CUDA version to see if it meets the requirements of apex. Maybe you can try to install apex by directly executing pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ without checking out to commit 741bdf50825a97664db08574981962d66436d16a. You can also check apex's repo for more instructions on apex installation.

Feel free to ask if you have further questions!

@oahzxl
Copy link
Collaborator

oahzxl commented Feb 28, 2024

it seems that the pytorch cuda version does not match your system cuda version. the easy way to fix it is to install a new pytorch that aligns with your system cuda version

@KKZ20 KKZ20 added the question Further information is requested label Feb 28, 2024
@oahzxl oahzxl closed this as completed Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants