-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cuda : perform cublas fp16 matrix multiplication as fp16 #3370
Conversation
Is this actually correct? I believe compute capability 7.0 is volta, not turing. Line 82 in 7d5674d
|
The computing capacity of Turing is 7.5, while that of Volta is 7.0. However, Volta also supports FP16. |
Hm, this change might actually degrade the TG performance: Before:
build: 99115f3 (1273) After:
build: da04003 (1280) Still testing to verify |
False alarm - forgot to build with |
…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401) train : fix KQ_pos allocation (ggerganov#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206) readme : update hot topics + model links (ggerganov#3399) readme : add link to grammars app (ggerganov#3388) swift : fix build on xcode 15 (ggerganov#3387) build : enable more non-default compiler warnings (ggerganov#3200) ggml_tensor: update the structure comments. (ggerganov#3283) ggml : release the requested thread pool resource (ggerganov#3292) llama.cpp : split llama_context_params into model and context params (ggerganov#3301) ci : multithreaded builds (ggerganov#3311) train : finetune LORA (ggerganov#2632) gguf : basic type checking in gguf_get_* (ggerganov#3346) gguf : make token scores and types optional (ggerganov#3347) ci : disable freeBSD builds due to lack of VMs (ggerganov#3381) llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228) docs : mark code as Bash (ggerganov#3375) readme : add Mistral AI release 0.1 (ggerganov#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)
…nov#3370) * ggml-cuda : perform cublas fp16 matrix multiplication as fp16 * try to fix rocm build * restrict fp16 mat mul to volta and up
This commit broke llama.cpp on CUDA 10. identifier "CUBLAS_COMPUTE_16F" is undefined |
Let's fix this ok? I can provide SSH access if needed. |
Old CUDA versions seem to be a low priority, but you could open a new issue to track this and maybe someone will fix it eventually. |
I am also seeing this as well as a "CUBLAS_TF32_TENSOR_OP_MATH" is undefined error when trying to compile with CUDA 10 and it would be nice to get it fixed or at least a work around we can try to get something working now. |
Improves prompt processing performance with fp16 models.
3090 Ti/WSL2: