Quantizing V cache not working yet #4425

CISC · 2023-12-12T09:40:08Z

Quantizing the K cache (-ctk) works, however quantizing the V cache (-ctv) does not, I've tried with q4_0, q4_1, q8 etc...

Using the cublas- cu12.2.0 release build I get the following error:

llama_kv_cache_init: VRAM kv self = 336.00 MB
llama_new_context_with_model: KV self size = 336.00 MiB, K (f16): 256.00 MiB, V (q4_1): 80.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 291.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
llama_new_context_with_model: total VRAM used: 4719.06 MiB (model: 4095.05 MiB, context: 624.00 MiB)

CUDA error 1 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: invalid argument
current device: 0
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: !"CUDA error"

stduhpf · 2023-12-12T12:27:12Z

It doesn't work on CPU only, or with openCL either. I think quantum V cache is just not implemented yet.
(see here:

llama.cpp/ggml.c

Line 6890 in fecac45

GGML_ASSERT(false); // TODO: implement

)

CISC · 2023-12-12T12:47:06Z

Yeah, I just noticed: #4309

V cache quantization is not yet supported

Ar57m · 2023-12-18T16:13:41Z

seems that K cache quantization doesn't work on StableLM models, maybe on other archs too

ggerganov · 2023-12-18T16:16:07Z

The head size has to be a multiple of 32. I think in StableLM it is not

Ar57m · 2023-12-18T16:22:12Z

The head size has to be a multiple of 32. I think in StableLM it is not

Zephyr 3b config.json

  "num_attention_heads": 32,
  "num_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32

I tried to run the 3km quant with -ctk q8_0 and I got this:

GGML_ASSERT: llama.cpp:8934: hparams.n_embd_head() % ggml_blck_size(type_k) == 0
Aborted

ggerganov · 2023-12-18T16:50:44Z

The head size is equal to n_embd / n_head

github-actions · 2024-03-18T01:36:31Z

This issue is stale because it has been open for 30 days with no activity.

DesperateZero · 2024-04-01T12:42:44Z

I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.

jukofyork · 2024-04-02T01:22:26Z

I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.

Yeah, I agree - it seems the trend is for released models to have longer and longer context lengths recently.

CISC · 2024-04-02T09:47:20Z

@DesperateZero @jukofyork Maybe it would help to tag this issue differently (good-deep-dive-issue :) ) to get someone to pick this up? Or maybe @ggerganov is planning on tackling this at some point himself?

ggerganov · 2024-04-03T12:37:45Z

The plan is after merging #5021 to add kernels that work with the a quantum KV cache. We are working towards this, but it might take some time to get there

github-actions · 2024-05-18T01:58:31Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

CISC · 2024-05-19T17:22:47Z

Perhaps change label on this issue to bug so it doesn't go stale and auto-close?

vladfaust · 2024-07-31T07:54:52Z

I'm not sure if it's the right issue, but KV cache quantization is definitely the feature I'm looking forward to, given that my application reuses session dumps a lot; optimizing dump size would be very beneficial.

CISC added the bug-unconfirmed label Dec 12, 2023

CISC changed the title ~~Quantizing V cache not working on CUDA~~ Quantizing V cache not working yet Dec 12, 2023

github-actions bot added the stale label Mar 18, 2024

github-actions bot removed the stale label Apr 3, 2024

CISC mentioned this issue Apr 8, 2024

KV cache quantization fails with GGML_ASSERT abetlen/llama-cpp-python#1335

Open

github-actions bot added the stale label May 4, 2024

github-actions bot closed this as completed May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantizing V cache not working yet #4425

Quantizing V cache not working yet #4425

CISC commented Dec 12, 2023 •

edited

Loading

stduhpf commented Dec 12, 2023

CISC commented Dec 12, 2023

Ar57m commented Dec 18, 2023

ggerganov commented Dec 18, 2023

Ar57m commented Dec 18, 2023

ggerganov commented Dec 18, 2023

github-actions bot commented Mar 18, 2024

DesperateZero commented Apr 1, 2024

jukofyork commented Apr 2, 2024

CISC commented Apr 2, 2024

ggerganov commented Apr 3, 2024

github-actions bot commented May 18, 2024

CISC commented May 19, 2024

vladfaust commented Jul 31, 2024

Quantizing V cache not working yet #4425

Quantizing V cache not working yet #4425

Comments

CISC commented Dec 12, 2023 • edited Loading

stduhpf commented Dec 12, 2023

CISC commented Dec 12, 2023

Ar57m commented Dec 18, 2023

ggerganov commented Dec 18, 2023

Ar57m commented Dec 18, 2023

ggerganov commented Dec 18, 2023

github-actions bot commented Mar 18, 2024

DesperateZero commented Apr 1, 2024

jukofyork commented Apr 2, 2024

CISC commented Apr 2, 2024

ggerganov commented Apr 3, 2024

github-actions bot commented May 18, 2024

CISC commented May 19, 2024

vladfaust commented Jul 31, 2024

CISC commented Dec 12, 2023 •

edited

Loading