Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm Port #1087

Merged
merged 105 commits into from
Aug 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
0fd8363
use hipblas based on cublas
SlyEcho Apr 19, 2023
54a63c1
Update Makefile for the Cuda kernels
SlyEcho Apr 20, 2023
0e005f7
Build file changes
SlyEcho Apr 20, 2023
d3e1984
add rpath
SlyEcho Apr 21, 2023
3677235
More build file changes
SlyEcho Apr 22, 2023
db7a012
Merge 'origin/master' into hipblas
SlyEcho Apr 23, 2023
3a004b2
add rpath
SlyEcho Apr 23, 2023
608aa33
change default GPU arch to match CMake
SlyEcho Apr 25, 2023
d571d16
Merge 'origin/master' into hipblas
SlyEcho Apr 25, 2023
ef51e9e
Merge branch 'ggerganov:master' into hipblas
SlyEcho Apr 26, 2023
ecc0565
only .cu file needs to be complied as device
SlyEcho Apr 27, 2023
a1caa48
add more cuda defines
SlyEcho Apr 28, 2023
3b4a531
Merge 'origin/master' into hipblas
SlyEcho Apr 28, 2023
2ab9d11
Merge 'origin/master' into hipblas
SlyEcho Apr 28, 2023
d194586
Merge 'origin/master' into hipblas
SlyEcho Apr 28, 2023
d8ea75e
Merge 'origin/master' into hipblas
SlyEcho Apr 29, 2023
c73def1
Merge 'origin/master' into hipblas
SlyEcho Apr 30, 2023
fcbc262
Merge 'origin/master' into hipblas
SlyEcho May 1, 2023
b67cc50
Merge 'origin/master' into hipblas
SlyEcho May 3, 2023
d83cfba
Merge 'origin/master' into hipblas
SlyEcho May 4, 2023
04c0d48
Move all HIP stuff to ggml-cuda.cu
SlyEcho May 4, 2023
1107194
Merge 'origin/master' into hipblas
SlyEcho May 5, 2023
289073a
Merge 'origin/master' into hipblas
SlyEcho May 6, 2023
baeb482
Revert to default copy
SlyEcho May 7, 2023
0aefa6a
Merge 'origin/master' into hipblas
SlyEcho May 7, 2023
a3296d5
Merge 'origin/master' into hipblas
SlyEcho May 7, 2023
070cbcc
occupanct function
SlyEcho May 7, 2023
127f68e
Merge 'origin/master' into hipblas
SlyEcho May 11, 2023
605560d
Merge 'origin/master' into hipblas
SlyEcho May 12, 2023
0fe6384
fix makefile
SlyEcho May 12, 2023
2956630
Merge 'origin/master' into hipblas
SlyEcho May 13, 2023
8bab456
Merge 'origin/master' into hipblas
SlyEcho May 14, 2023
a0b2d5f
Merge 'origin/master' into hipblas
SlyEcho May 16, 2023
c66115b
Merge 'origin/master' into hipblas
SlyEcho May 20, 2023
b19fefe
Forwardcompat
SlyEcho May 20, 2023
600ace3
update warp size
SlyEcho May 20, 2023
f80ce7a
Merge branch 'origin/master' into hipblas
SlyEcho May 24, 2023
174bf6a
Merge 'origin/master' into hipblas
SlyEcho May 25, 2023
a593a4f
Add missing parameters
SlyEcho May 25, 2023
30d921a
and makefile
SlyEcho May 25, 2023
4c8b3fb
add configurable vars
SlyEcho May 25, 2023
a4648c1
Merge 'origin/master' into hipblas
SlyEcho May 27, 2023
9fdaa1d
Add more defs
SlyEcho May 27, 2023
33091a9
Merge 'origin/master' into hipblas
SlyEcho Jun 6, 2023
5d6eb72
warp size fixes
SlyEcho Jun 6, 2023
1ba4ce4
Revert "warp size fixes"
SlyEcho Jun 6, 2023
fa5b3d7
fix makefile.
SlyEcho Jun 6, 2023
4362e80
Merge 'origin/master' into hipblas
SlyEcho Jun 6, 2023
85f902d
Merge 'origin/master' into hipblas
SlyEcho Jun 8, 2023
a836529
Merge 'origin/master' into hipblas
SlyEcho Jun 14, 2023
61df8e9
add cudaMemset
SlyEcho Jun 14, 2023
6f7c156
Merge 'origin/master' into hipblas
SlyEcho Jun 17, 2023
67e229b
Merge 'origin/master' into hipblas
SlyEcho Jun 17, 2023
5dd2fbe
Merge 'origin/master' into hipblas
SlyEcho Jun 19, 2023
df7346c
Merge 'origin/master' into hipblas
SlyEcho Jun 22, 2023
35a6031
Merge 'origin/master' into hipblas
SlyEcho Jun 25, 2023
c1e5c83
Merge 'origin/master' into hipblas
SlyEcho Jun 25, 2023
c8ae945
Merge 'origin/master' into hipblas
SlyEcho Jun 27, 2023
bb16eff
headers fix; add kquants_iter for hipblas and add gfx803 (#1)
YellowRoseCx Jun 28, 2023
04419f1
Merge 'origin/master' into hipblas
SlyEcho Jun 28, 2023
15db19a
Merge 'origin/master' into hipblas
SlyEcho Jul 2, 2023
c3e3733
ROCm fixes
SlyEcho Jul 2, 2023
7735c5a
Merge 'origin/master' into hipblas
SlyEcho Jul 4, 2023
80e4e54
Merge 'origin/master' into hipblas
SlyEcho Jul 9, 2023
e610466
Expand arch list and make it overrideable
SlyEcho Jul 11, 2023
8c2c497
Merge 'origin/master' into hipblas
SlyEcho Jul 11, 2023
afcb8fe
Add new config option
SlyEcho Jul 11, 2023
cd36b18
Merge 'origin/master' into hipblas
SlyEcho Jul 13, 2023
2ec4466
Update build flags.
SlyEcho Jul 13, 2023
3db70b5
Merge 'origin/master' into hipblas
SlyEcho Jul 17, 2023
1f6294d
Fix multi GPU on multiple amd architectures with rocblas_initialize()…
YellowRoseCx Jul 24, 2023
8e8054a
Add rocblas to build files
SlyEcho Jul 24, 2023
cde52d6
Merge 'origin/master' into hipblas
SlyEcho Jul 24, 2023
d2ade63
Merge 'origin/master' into hipblas
SlyEcho Jul 29, 2023
f8e3fc6
rocblas init stuff
SlyEcho Jul 29, 2023
4336231
add hipBLAS to README
SlyEcho Jul 29, 2023
c1664a0
Merge 'origin/master' into hipblas
SlyEcho Jul 31, 2023
c1cb70d
new build arg LLAMA_CUDA_MMQ_Y
SlyEcho Jul 31, 2023
d91456a
fix half2 decomposition
ardfork Jul 31, 2023
ab62128
Merge 'origin/master' into hipblas
SlyEcho Aug 8, 2023
4024f91
Add intrinsics polyfills for AMD
SlyEcho Aug 8, 2023
610ba4c
Merge 'origin/master' into hipblas
SlyEcho Aug 9, 2023
8f8ab6c
hipLDFLAG Path change Unix to multisystem in Makefile
YellowRoseCx Aug 9, 2023
29a59b5
Fix merge
SlyEcho Aug 10, 2023
f41920e
AMD assembly optimized __dp4a
Engininja2 Aug 10, 2023
42e055d
ws fix
SlyEcho Aug 10, 2023
e6b6ae5
Undo mess
SlyEcho Aug 11, 2023
c299c4a
New __dp4a assembly
Engininja2 Aug 11, 2023
b815e97
Merge 'origin/master' into hipblas
SlyEcho Aug 11, 2023
4e58a05
Allow overriding CC_TURING
SlyEcho Aug 11, 2023
6415610
gfx1100 support
SlyEcho Aug 12, 2023
70e2f7c
Merge 'origin/master' into hipblas
SlyEcho Aug 14, 2023
68e79cc
Merge 'origin/master' into hipblas
SlyEcho Aug 16, 2023
3de6a9a
reenable LLAMA_CUDA_FORCE_DMMV
SlyEcho Aug 16, 2023
bbbc0ce
makefile rewrite
SlyEcho Aug 16, 2023
c88c2a9
probably lld is not required
SlyEcho Aug 16, 2023
423db74
Merge 'origin/master' into hipblas
SlyEcho Aug 21, 2023
391dd9a
Merge 'origin/master' into hipblas
SlyEcho Aug 22, 2023
5d3e7b2
use "ROCm" instead of "CUDA"
SlyEcho Aug 22, 2023
7b84217
Merge 'origin/master' into hipblas
SlyEcho Aug 24, 2023
058f905
ignore all build dirs
SlyEcho Aug 24, 2023
a60231f
Add Dockerfiles
SlyEcho Aug 24, 2023
81ecaa4
fix llama-bench
SlyEcho Aug 24, 2023
238335f
fix -nommq help for non CUDA/HIP
SlyEcho Aug 24, 2023
9035cfc
Merge 'origin/master' into hipblas
SlyEcho Aug 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .devops/full-rocm.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
ARG UBUNTU_VERSION=22.04

# This needs to generally match the container host's environment.
ARG ROCM_VERSION=5.6

# Target the CUDA build image
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete

FROM ${BASE_ROCM_DEV_CONTAINER} as build

# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
ARG ROCM_DOCKER_ARCH=\
gfx803 \
gfx900 \
gfx906 \
gfx908 \
gfx90a \
gfx1010 \
gfx1030 \
gfx1100 \
gfx1101 \
gfx1102

COPY requirements.txt requirements.txt

RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt

WORKDIR /app

COPY . .

# Set nvcc architecture
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++

RUN make

ENTRYPOINT ["/app/.devops/tools.sh"]
44 changes: 44 additions & 0 deletions .devops/main-rocm.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
ARG UBUNTU_VERSION=22.04

# This needs to generally match the container host's environment.
ARG ROCM_VERSION=5.6

# Target the CUDA build image
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete

FROM ${BASE_ROCM_DEV_CONTAINER} as build

# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
ARG ROCM_DOCKER_ARCH=\
gfx803 \
gfx900 \
gfx906 \
gfx908 \
gfx90a \
gfx1010 \
gfx1030 \
gfx1100 \
gfx1101 \
gfx1102

COPY requirements.txt requirements.txt

RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt

WORKDIR /app

COPY . .

# Set nvcc architecture
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++

RUN make

ENTRYPOINT [ "/app/main" ]
9 changes: 1 addition & 8 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,7 @@
.vscode/
.DS_Store

build/
build-em/
build-debug/
build-release/
build-static/
build-no-accel/
build-sanitize-addr/
build-sanitize-thread/
build*/

models/*

Expand Down
15 changes: 1 addition & 14 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,7 @@
.vs/
.vscode/

build/
build-em/
build-debug/
build-release/
build-ci-debug/
build-ci-release/
build-static/
build-cublas/
build-opencl/
build-metal/
build-mpi/
build-no-accel/
build-sanitize-addr/
build-sanitize-thread/
build*/
out/
tmp/

Expand Down
38 changes: 38 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ set(LLAMA_CUDA_DMMV_X "32" CACHE STRING "llama: x stride for dmmv CUDA kern
set(LLAMA_CUDA_MMV_Y "1" CACHE STRING "llama: y block size for mmv CUDA kernels")
option(LLAMA_CUDA_F16 "llama: use 16 bit floats for some calculations" OFF)
set(LLAMA_CUDA_KQUANTS_ITER "2" CACHE STRING "llama: iters./thread per block for Q2_K/Q6_K")
option(LLAMA_HIPBLAS "llama: use hipBLAS" OFF)
option(LLAMA_CLBLAST "llama: use CLBlast" OFF)
option(LLAMA_METAL "llama: use Metal" OFF)
option(LLAMA_MPI "llama: use MPI" OFF)
Expand Down Expand Up @@ -352,6 +353,43 @@ if (LLAMA_CLBLAST)
endif()
endif()

if (LLAMA_HIPBLAS)
list(APPEND CMAKE_PREFIX_PATH /opt/rocm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROCm path shouldn't be hardcoded to /opt/rocm. It's common to use the env var ROCM_PATH (also ROCM_HOME is sometime used). /opt/rocm should only be a fallback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this from AMD's docs, but they have updated it now: Using CMake. Probably because it is not going to work in Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't taken a look at AMD's docs. But they at least internally use ROCM_PATH on all the projects that I have seen.

As the CMake config would probably need change anyway for windows, and I don't think a lot of people will be impacted by not using their configured ROCm path, I think it's fine to let it that way for now. But whenever change to CMake config to add support for windows, it would be nice to also add support for one of the ROCM_PATH/HIP_PATH/ROCM_HOME on linux.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the latest docs say to always manually use a CMake prefix for configuring. Guess that makes sense because on Windows, people could install it anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows, you'd instead have HIP_PATH set IIRC. But someone would need to check the HIP Windows SDK installation to be sure.


if (NOT ${CMAKE_C_COMPILER_ID} MATCHES "Clang")
message(WARNING "Only LLVM is supported for HIP, hint: CC=/opt/rocm/llvm/bin/clang")
endif()
if (NOT ${CMAKE_CXX_COMPILER_ID} MATCHES "Clang")
message(WARNING "Only LLVM is supported for HIP, hint: CXX=/opt/rocm/llvm/bin/clang++")
endif()

find_package(hip)
find_package(hipblas)
find_package(rocblas)

if (${hipblas_FOUND} AND ${hip_FOUND})
message(STATUS "HIP and hipBLAS found")
add_compile_definitions(GGML_USE_HIPBLAS GGML_USE_CUBLAS)
add_library(ggml-rocm OBJECT ggml-cuda.cu ggml-cuda.h)
if (LLAMA_CUDA_FORCE_DMMV)
target_compile_definitions(ggml-rocm PRIVATE GGML_CUDA_FORCE_DMMV)
endif()
target_compile_definitions(ggml-rocm PRIVATE GGML_CUDA_DMMV_X=${LLAMA_CUDA_DMMV_X})
target_compile_definitions(ggml-rocm PRIVATE GGML_CUDA_MMV_Y=${LLAMA_CUDA_MMV_Y})
target_compile_definitions(ggml-rocm PRIVATE K_QUANTS_PER_ITERATION=${LLAMA_CUDA_KQUANTS_ITER})
target_compile_definitions(ggml-rocm PRIVATE CC_TURING=1000000000)
set_source_files_properties(ggml-cuda.cu PROPERTIES LANGUAGE CXX)
target_link_libraries(ggml-rocm PRIVATE hip::device PUBLIC hip::host roc::rocblas roc::hipblas)

if (LLAMA_STATIC)
message(FATAL_ERROR "Static linking not supported for HIP/ROCm")
endif()
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} ggml-rocm)
else()
message(WARNING "hipBLAS or HIP not found. Try setting CMAKE_PREFIX_PATH=/opt/rocm")
endif()
endif()

if (LLAMA_ALL_WARNINGS)
if (NOT MSVC)
set(c_flags
Expand Down
24 changes: 24 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,30 @@ ggml-opencl.o: ggml-opencl.cpp ggml-opencl.h
$(CXX) $(CXXFLAGS) -c $< -o $@
endif # LLAMA_CLBLAST

ifdef LLAMA_HIPBLAS
ROCM_PATH ?= /opt/rocm
HIPCC ?= $(ROCM_PATH)/bin/hipcc
GPU_TARGETS ?= $(shell $(ROCM_PATH)/llvm/bin/amdgpu-arch)
LLAMA_CUDA_DMMV_X ?= 32
LLAMA_CUDA_MMV_Y ?= 1
LLAMA_CUDA_KQUANTS_ITER ?= 2
CFLAGS += -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS
CXXFLAGS += -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS
LDFLAGS += -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib
LDFLAGS += -lhipblas -lamdhip64 -lrocblas
HIPFLAGS += $(addprefix --offload-arch=,$(GPU_TARGETS))
HIPFLAGS += -DGGML_CUDA_DMMV_X=$(LLAMA_CUDA_DMMV_X)
HIPFLAGS += -DGGML_CUDA_MMV_Y=$(LLAMA_CUDA_MMV_Y)
HIPFLAGS += -DK_QUANTS_PER_ITERATION=$(LLAMA_CUDA_KQUANTS_ITER)
HIPFLAGS += -DCC_TURING=1000000000
ifdef LLAMA_CUDA_FORCE_DMMV
HIPFLAGS += -DGGML_CUDA_FORCE_DMMV
endif # LLAMA_CUDA_FORCE_DMMV
OBJS += ggml-cuda.o
ggml-cuda.o: ggml-cuda.cu ggml-cuda.h
$(HIPCC) $(CXXFLAGS) $(HIPFLAGS) -x hip -c -o $@ $<
endif # LLAMA_HIPBLAS

ifdef LLAMA_METAL
CFLAGS += -DGGML_USE_METAL -DGGML_METAL_NDEBUG
CXXFLAGS += -DGGML_USE_METAL
Expand Down
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,35 @@ Building the program with BLAS support may lead to some performance improvements
| LLAMA_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |

- #### hipBLAS

This provide BLAS acceleation on HIP supported GPU like AMD GPU.
Make sure to have ROCm installed.
You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html).
Windows support is coming soon...

- Using `make`:
```bash
make LLAMA_HIPBLAS=1
```
- Using `CMake`:
```bash
mkdir build
cd build
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake .. -DLLAMA_HIPBLAS=ON
cmake --build .
```

The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
If your GPU is not officialy supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 or 11.0.0 on RDNA3.
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):

| Option | Legal values | Default | Description |
|-------------------------|------------------------|---------|-------------|
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |

- #### CLBlast

OpenCL acceleration is provided by the matrix multiplication kernels from the [CLBlast](https://github.com/CNugteren/CLBlast) project and custom kernels for ggml that can generate tokens on the GPU.
Expand Down
4 changes: 3 additions & 1 deletion common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -613,9 +613,11 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
fprintf(stdout, " how to split tensors across multiple GPUs, comma-separated list of proportions, e.g. 3,1\n");
fprintf(stdout, " -mg i, --main-gpu i the GPU to use for scratch and small tensors\n");
fprintf(stdout, " -lv, --low-vram don't allocate VRAM scratch buffer\n");
#ifdef GGML_USE_CUBLAS
fprintf(stdout, " -nommq, --no-mul-mat-q\n");
fprintf(stdout, " use cuBLAS instead of custom mul_mat_q CUDA kernels.\n");
fprintf(stdout, " use " GGML_CUBLAS_NAME " instead of custom mul_mat_q " GGML_CUDA_NAME " kernels.\n");
fprintf(stdout, " Not recommended since this is both slower and uses more VRAM.\n");
#endif // GGML_USE_CUBLAS
#endif
fprintf(stdout, " --mtest compute maximum memory usage\n");
fprintf(stdout, " --export export the computation graph to 'llama.ggml'\n");
Expand Down
4 changes: 1 addition & 3 deletions examples/llama-bench/llama-bench.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,7 @@
#include "llama.h"
#include "common.h"
#include "build-info.h"
#ifdef GGML_USE_CUBLAS
#include "ggml-cuda.h"
#endif

// utils
static uint64_t get_time_ns() {
Expand Down Expand Up @@ -504,7 +502,7 @@ struct test {

static std::string get_backend() {
if (cuda) {
return "CUDA";
return GGML_CUDA_NAME;
}
if (opencl) {
return "OpenCL";
Expand Down
Loading
Loading