Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMake: default to -arch=native for CUDA build #10320

Merged
merged 1 commit into from
Nov 17, 2024

Conversation

JohannesGaessler
Copy link
Collaborator

This PR extends the CUDA build documentation by explaining how to speed up local builds.

Also I changed "documentations" to the singular in the README since I think it sounds more natural.

@slaren
Copy link
Collaborator

slaren commented Nov 15, 2024

It might be good to make CMAKE_CUDA_ARCHITECTURES default to native when GGML_NATIVE is enabled, since that already makes a build that is only compatible with the current CPU. native is not supported on older CUDA toolkit versions, however.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 15, 2024
@JohannesGaessler
Copy link
Collaborator Author

I personally think either way is fine. The target group for these changes/options are I would argue developers who are going to frequently recompile the code. Currently the logic is that CUDA architectures are only set automatically if the user does not set CMAKE_CUDA_ARCHITECTURES. So I think we should just change the logic for the automatic CUDA architectures; if someone wants to compile with both GGML_NATIVE and an old CUDA version they can still do so by manually setting CMAKE_CUDA_ARCHITECTURES.

@JohannesGaessler
Copy link
Collaborator Author

Actually no, if CMAKE_CROSSCOMPILING=OFF then the default is GGML_NATIVE=ON. So we should not implicitly also make CMAKE_CUDA_ARCHITECTURES=native the default since that is going to trip up a lot of users with old CUDA versions. The only other option would be to condition the setting of CMAKE_CUDA_ARCHITECTURES on the CUDA version but since for me the whole point is to increase developer productivity I think it's preferable to just have a comment in the documentation instead of architecture selection logic that needs to be maintained.

@slaren
Copy link
Collaborator

slaren commented Nov 16, 2024

It should be possible to check the CUDA toolkit version in the CMakeLists.txt and only use native if it is supported, so I am not sure that's really a problem. native is also the default in the Makefile and it doesn't seem to cause much confusion.

@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Nov 16, 2024
@JohannesGaessler
Copy link
Collaborator Author

I misremembered both the CUDA version with which -arch=native was added and the complexity of checking the CUDA version from within CMake so the whole thing ended up being much less problematic than I thought.

@JohannesGaessler JohannesGaessler changed the title docs: explain faster CUDA CMake compile [no ci] CMake: default to -arch=native for CUDA build Nov 17, 2024
@JohannesGaessler JohannesGaessler merged commit 467576b into ggerganov:master Nov 17, 2024
54 checks passed
# 60 == P100, FP16 CUDA intrinsics
# 61 == Pascal, __dp4a instruction (per-byte integer dot product)
# 70 == V100, FP16 tensor cores
# 75 == Turing, int6 tensor cores
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think int6 -> int8?

@ggerganov
Copy link
Owner

Quick question: on my RTX2060 which has compute capability 7.5, the best configuration to build with (in terms of full feature support and least amount of compile time) is:

cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="75" ..

Is that correct?

nvidia-smi 

Sun Nov 17 12:04:19 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off |   00000000:06:00.0 Off |                  N/A |
|  0%   43C    P8              6W /  175W |      19MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1422      G   /usr/lib/xorg/Xorg                             12MiB |
|    0   N/A  N/A      1583      G   /usr/bin/gnome-shell                            4MiB |
+-----------------------------------------------------------------------------------------+

@JohannesGaessler
Copy link
Collaborator Author

The "high-level" C-like CUDA code is first compiled to PTX which is the CUDA equivalent of assembly. The PTX code is then converted to PTXAS which is the binary format that the GPU can actually run. I think when you set -arch=compute_75 you tell NVCC to generate PTX code and when you set -arch=sm_75 you tell it to generate PTXAS code. I haven't looked up what CMake does internally when you set CMAKE_CUDA_ARCHITECTURES but I would expect a number to generate PTXAS for the selected compute capability (+ probably something for forward compatibility) and native to generate PTXAS only for the connected GPUs.

For llama.cpp/GGML the code should always work correctly if you compile for exactly the compute capability that you are going to use. The listed compute capabilities are the breakpoints where different features are used and the PTX code ends up being different. So all compute capabilities >= 7.5 should generate the same PTX code and only maybe different PTXAS code. But so far I have never observed any performance difference from compiling with a compute capability that is higher than the minimum for PTX.

@slaren
Copy link
Collaborator

slaren commented Nov 17, 2024

If CMAKE_CUDA_ARCHITECTURES is set to a plain number, it includes both the virtual and real architectures. Eg. -DCMAKE_CUDA_ARCHITECTURES=86 results in --generate-code=arch=compute_86,code=[compute_86,sm_86].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants