CMake: default to -arch=native for CUDA build #10320

JohannesGaessler · 2024-11-15T20:21:27Z

This PR extends the CUDA build documentation by explaining how to speed up local builds.

Also I changed "documentations" to the singular in the README since I think it sounds more natural.

slaren · 2024-11-15T20:24:19Z

It might be good to make CMAKE_CUDA_ARCHITECTURES default to native when GGML_NATIVE is enabled, since that already makes a build that is only compatible with the current CPU. native is not supported on older CUDA toolkit versions, however.

JohannesGaessler · 2024-11-16T22:16:14Z

I personally think either way is fine. The target group for these changes/options are I would argue developers who are going to frequently recompile the code. Currently the logic is that CUDA architectures are only set automatically if the user does not set CMAKE_CUDA_ARCHITECTURES. So I think we should just change the logic for the automatic CUDA architectures; if someone wants to compile with both GGML_NATIVE and an old CUDA version they can still do so by manually setting CMAKE_CUDA_ARCHITECTURES.

JohannesGaessler · 2024-11-16T22:29:27Z

Actually no, if CMAKE_CROSSCOMPILING=OFF then the default is GGML_NATIVE=ON. So we should not implicitly also make CMAKE_CUDA_ARCHITECTURES=native the default since that is going to trip up a lot of users with old CUDA versions. The only other option would be to condition the setting of CMAKE_CUDA_ARCHITECTURES on the CUDA version but since for me the whole point is to increase developer productivity I think it's preferable to just have a comment in the documentation instead of architecture selection logic that needs to be maintained.

slaren · 2024-11-16T22:40:53Z

It should be possible to check the CUDA toolkit version in the CMakeLists.txt and only use native if it is supported, so I am not sure that's really a problem. native is also the default in the Makefile and it doesn't seem to cause much confusion.

JohannesGaessler · 2024-11-16T23:06:09Z

I misremembered both the CUDA version with which -arch=native was added and the complexity of checking the CUDA version from within CMake so the whole thing ended up being much less problematic than I thought.

ggerganov · 2024-11-17T10:03:37Z

ggml/src/ggml-cuda/CMakeLists.txt

+        # 60     == P100, FP16 CUDA intrinsics
+        # 61     == Pascal, __dp4a instruction (per-byte integer dot product)
+        # 70     == V100, FP16 tensor cores
+        # 75     == Turing, int6 tensor cores


I think int6 -> int8?

ggerganov · 2024-11-17T10:06:32Z

Quick question: on my RTX2060 which has compute capability 7.5, the best configuration to build with (in terms of full feature support and least amount of compile time) is:

cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="75" ..

Is that correct?

nvidia-smi 

Sun Nov 17 12:04:19 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off |   00000000:06:00.0 Off |                  N/A |
|  0%   43C    P8              6W /  175W |      19MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1422      G   /usr/lib/xorg/Xorg                             12MiB |
|    0   N/A  N/A      1583      G   /usr/bin/gnome-shell                            4MiB |
+-----------------------------------------------------------------------------------------+

JohannesGaessler · 2024-11-17T12:05:03Z

The "high-level" C-like CUDA code is first compiled to PTX which is the CUDA equivalent of assembly. The PTX code is then converted to PTXAS which is the binary format that the GPU can actually run. I think when you set -arch=compute_75 you tell NVCC to generate PTX code and when you set -arch=sm_75 you tell it to generate PTXAS code. I haven't looked up what CMake does internally when you set CMAKE_CUDA_ARCHITECTURES but I would expect a number to generate PTXAS for the selected compute capability (+ probably something for forward compatibility) and native to generate PTXAS only for the connected GPUs.

For llama.cpp/GGML the code should always work correctly if you compile for exactly the compute capability that you are going to use. The listed compute capabilities are the breakpoints where different features are used and the PTX code ends up being different. So all compute capabilities >= 7.5 should generate the same PTX code and only maybe different PTXAS code. But so far I have never observed any performance difference from compiling with a compute capability that is higher than the minimum for PTX.

slaren · 2024-11-17T12:08:55Z

If CMAKE_CUDA_ARCHITECTURES is set to a plain number, it includes both the virtual and real architectures. Eg. -DCMAKE_CUDA_ARCHITECTURES=86 results in --generate-code=arch=compute_86,code=[compute_86,sm_86].

github-actions bot added the documentation Improvements or additions to documentation label Nov 15, 2024

CMake: default to -arch=native for CUDA build

62751a8

JohannesGaessler force-pushed the doc-update branch from 6131aea to 62751a8 Compare November 16, 2024 23:04

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Nov 16, 2024

slaren approved these changes Nov 16, 2024

View reviewed changes

JohannesGaessler changed the title ~~docs: explain faster CUDA CMake compile [no ci]~~ CMake: default to -arch=native for CUDA build Nov 17, 2024

JohannesGaessler merged commit 467576b into ggerganov:master Nov 17, 2024
54 checks passed

ggerganov reviewed Nov 17, 2024

View reviewed changes

henryclw mentioned this pull request Nov 17, 2024

Bug: llama-server-cuda docker image build failure after #10320 #10367

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

CMake: default to -arch=native for CUDA build (ggerganov#10320)

d84419b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMake: default to -arch=native for CUDA build #10320

CMake: default to -arch=native for CUDA build #10320

JohannesGaessler commented Nov 15, 2024

slaren commented Nov 15, 2024

JohannesGaessler commented Nov 16, 2024

JohannesGaessler commented Nov 16, 2024

slaren commented Nov 16, 2024 •

edited

Loading

JohannesGaessler commented Nov 16, 2024

ggerganov Nov 17, 2024

ggerganov commented Nov 17, 2024

JohannesGaessler commented Nov 17, 2024

slaren commented Nov 17, 2024

CMake: default to -arch=native for CUDA build #10320

CMake: default to -arch=native for CUDA build #10320

Conversation

JohannesGaessler commented Nov 15, 2024

slaren commented Nov 15, 2024

JohannesGaessler commented Nov 16, 2024

JohannesGaessler commented Nov 16, 2024

slaren commented Nov 16, 2024 • edited Loading

JohannesGaessler commented Nov 16, 2024

ggerganov Nov 17, 2024

Choose a reason for hiding this comment

ggerganov commented Nov 17, 2024

JohannesGaessler commented Nov 17, 2024

slaren commented Nov 17, 2024

slaren commented Nov 16, 2024 •

edited

Loading