-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CUDA without cuBLAS #82
Conversation
Numbers match cublas, but using this code leads to LLaVA outputting nothing but white squares.
Here are some quick numbers using a GCE VM with a Xeon and NVIDIA L4.
So your TINYBLAS Your TINYBLAS library doesn't increase the The output of The only issue remaining is that the
See also https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf for reading material on how to create a better-than-naive matrix multiplication function. Lastly your work might be of interest to ggerganov/ggml#293. |
I can get 25 tokens per second by slightly changing this PR to inline the constant parameters: diff --git a/llama.cpp/naive-gemm.cu b/llama.cpp/naive-gemm.cu
index 82edfe9..4647b6b 100644
--- a/llama.cpp/naive-gemm.cu
+++ b/llama.cpp/naive-gemm.cu
@@ -1,3 +1,5 @@
+// -*- cuda -*-
+
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <cublas_v2.h>
@@ -6,9 +8,7 @@
#define READ0(A, trans, ld, i, j) \
(((trans) == CUBLAS_OP_N) ? (A)[(i) + (j) * (ld)] : (A)[(j) + (i) * (ld)])
#define READ(A, type, trans, ld, i, j) \
- ((type) == CUDA_R_16F \
- ? __half2float(READ0((half *)(A), (trans), (ld), (i), (j))) \
- : READ0((float *)(A), (trans), (ld), (i), (j)))
+ __half2float(READ0((half *)(A), (trans), (ld), (i), (j)))
static __device__ __forceinline__ void matmul(cublasOperation_t transa,
cublasOperation_t transb,
@@ -28,17 +28,11 @@ static __device__ __forceinline__ void matmul(cublasOperation_t transa,
for (int j = 0; j < n; ++j) {
float sum = 0.0;
for (int l = 0; l < k; ++l) {
- sum += READ(A, Atype, transa, lda, i, l) *
- READ(B, Btype, transb, ldb, l, j);
- }
- if (Ctype == CUDA_R_16F) {
- half *cptr = (half *)C + i + ldc * j;
- *cptr = __float2half(MULZERO(alpha, sum) +
- MULZERO(beta, __half2float(*cptr)));
- } else {
- float *cptr = (float *)C + i + ldc * j;
- *cptr = MULZERO(alpha, sum) + MULZERO(beta, *cptr);
+ sum += READ(A, Atype, CUBLAS_OP_T, lda, i, l) *
+ READ(B, Btype, CUBLAS_OP_N, ldb, l, j);
}
+ half *cptr = (half *)C + i + ldc * j;
+ *cptr = __float2half(sum);
}
}
} |
For the other gemm routines, should be easy. For the rest, I'm not sure, but I'll see what I can do. Afaict the ones that are impactful are
That's surprising, I would've expected more of the branch predictor! Maybe it's worth doing template specializations after all. |
Template specialization would be good. It would also be perfectly acceptable to say: if (Atype != CUDA_R_16F || Btype != CUDA_R_16F || Ctype != CUDA_R_16F ||
transa != CUBLAS_OP_T || transb != CUBLAS_OP_N ||
computeType != CUBLAS_COMPUTE_16F ||
__half2float(*(half *)pBeta) != 0.0f ||
__half2float(*(half *)pAlpha) != 1.0f) {
return CUBLAS_STATUS_NOT_SUPPORTED;
} Since that's the only way GGML currently uses this API. |
I hardcoded it to the GGML use case, added a very naive and slow
|
Uses some fairly disgusting preprocessor macros to get the job done while preserving behavior when `-DGGML_USE_CUBLAS`. With a bit of investigation into `ggml_cuda_mul_mat_mat_batched_cublas`, these can probably be removed or simplified.
At this point there are no remaining cublas dependencies when compiled with |
N.B. we include the source file rather than the header file in `ggml-cuda.cu` because `llamafile/cuda.c` assumes that everything lives in a single compilation unit.
The header dependency on |
My inclination is to do performance improvements in another PR, and I'm not sure yet how you want to decide whether to link against cublas or not. So this PR is done on my end, pending review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on Jetson and NVIDIA L4 GCE. Confirmed it doesn't link cuBLAS and goes significantly faster than CPU inference. I can add the code for compilation fallback. Looking forward to any additional performance improvements you can send us in a subsequent PR. Thank you!
Introduces a
tinyblas
library with naive CUDA implementations of the few remainingcublas
operations used inllama.cpp/ggml-cuda.cu
. Produces the same results withLLaVA
at temperature 0 on the prompt I tried. Saves about 500MB of dependencies, but runs about 6x slower (but still quite a bit faster than CPU) on my machine.The new mode is gated behind the
GGML_USE_TINYBLAS
cpp define.