llama : switch KQ multiplication to use F32 precision by default #10015

ggerganov · 2024-10-23T11:35:17Z

The list of models that require higher floating point range in the attention keeps growing, so to be on the safe side, default to F32 for the KQ multiplication.

ggml-ci

…rganov#10015 Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

LostRuins · 2024-11-04T07:21:29Z

Hi, just forwarding some findings from community members - apparently this change to using F32 precision causes a significant speed regression (-30% T/s) on HIPBLAS. I personally use a Nvidia GPU and there's been no issues there for me.

They are using ROCM rx 6800xt and 5800x3d, 7900 xtx (driver 24.9.1)
models compared: llama 2 7b chat q8_0 and Mistral Small Instruct Q6_k

Regression was found by bisecting commits until arriving at this one. Not sure if anything can be done, but just thought I'd highlight it to see if anyone else using AMD GPUs has observed similar issues.

Also tagging @YellowRoseCx as they use AMD and can assist in testing/verifying if needed.

… only for now pending ROCM investigation re ggerganov#10015

…ov#10015) ggml-ci

llama : switch KQ multiplication to use F32 precision by default

20011f1

ggml-ci

slaren linked an issue Oct 23, 2024 that may be closed by this pull request

Bug: Unexpected output from Granite 3.0 MoE 1b when all layers on NVIDIA GPU #9991

Closed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 24, 2024

llama : switch KQ multiplication to use F32 precision by default gge…

3992df7

…rganov#10015 Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

This was referenced Oct 24, 2024

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Closed

Bug: K cache without FA ikawrakow/ik_llama.cpp#103

Closed

ggerganov merged commit 8841ce3 into master Oct 27, 2024
60 checks passed

ggerganov deleted the gg/default-kq-f32-prec branch October 27, 2024 19:00

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Nov 4, 2024

add exception for ibm granite, then keep using f16 kq mul for HIPBLAS…

c7e351b

… only for now pending ROCM investigation re ggerganov#10015

ggerganov mentioned this pull request Nov 4, 2024

metal : optimize FA kernels #10171

Merged

2 tasks

steampunque mentioned this pull request Nov 15, 2024

repeatability problem with CUDA backend #7228

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama : switch KQ multiplication to F32 precision by default (ggergan…

a44bf5d

…ov#10015) ggml-ci

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama : switch KQ multiplication to F32 precision by default (ggergan…

5aea5e3

…ov#10015) ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : switch KQ multiplication to use F32 precision by default #10015

llama : switch KQ multiplication to use F32 precision by default #10015

ggerganov commented Oct 23, 2024 •

edited

Loading

LostRuins commented Nov 4, 2024

llama : switch KQ multiplication to use F32 precision by default #10015

llama : switch KQ multiplication to use F32 precision by default #10015

Conversation

ggerganov commented Oct 23, 2024 • edited Loading

LostRuins commented Nov 4, 2024

ggerganov commented Oct 23, 2024 •

edited

Loading