Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantizing V cache not working yet #4425

Closed
CISC opened this issue Dec 12, 2023 · 14 comments
Closed

Quantizing V cache not working yet #4425

CISC opened this issue Dec 12, 2023 · 14 comments

Comments

@CISC
Copy link
Contributor

CISC commented Dec 12, 2023

Quantizing the K cache (-ctk) works, however quantizing the V cache (-ctv) does not, I've tried with q4_0, q4_1, q8 etc...

Using the cublas- cu12.2.0 release build I get the following error:

llama_kv_cache_init: VRAM kv self = 336.00 MB
llama_new_context_with_model: KV self size = 336.00 MiB, K (f16): 256.00 MiB, V (q4_1): 80.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 291.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
llama_new_context_with_model: total VRAM used: 4719.06 MiB (model: 4095.05 MiB, context: 624.00 MiB)

CUDA error 1 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: invalid argument
current device: 0
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: !"CUDA error"

@stduhpf
Copy link
Contributor

stduhpf commented Dec 12, 2023

It doesn't work on CPU only, or with openCL either. I think quantum V cache is just not implemented yet.
(see here:

llama.cpp/ggml.c

Line 6890 in fecac45

GGML_ASSERT(false); // TODO: implement
)

@CISC CISC changed the title Quantizing V cache not working on CUDA Quantizing V cache not working yet Dec 12, 2023
@CISC
Copy link
Contributor Author

CISC commented Dec 12, 2023

Yeah, I just noticed: #4309

  • V cache quantization is not yet supported

@Ar57m
Copy link

Ar57m commented Dec 18, 2023

seems that K cache quantization doesn't work on StableLM models, maybe on other archs too

@ggerganov
Copy link
Owner

The head size has to be a multiple of 32. I think in StableLM it is not

@Ar57m
Copy link

Ar57m commented Dec 18, 2023

The head size has to be a multiple of 32. I think in StableLM it is not

Zephyr 3b config.json

  "num_attention_heads": 32,
  "num_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32

I tried to run the 3km quant with -ctk q8_0 and I got this:

GGML_ASSERT: llama.cpp:8934: hparams.n_embd_head() % ggml_blck_size(type_k) == 0
Aborted

@ggerganov
Copy link
Owner

The head size is equal to n_embd / n_head

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
@DesperateZero
Copy link

I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.

@jukofyork
Copy link
Contributor

I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.

Yeah, I agree - it seems the trend is for released models to have longer and longer context lengths recently.

@CISC
Copy link
Contributor Author

CISC commented Apr 2, 2024

@DesperateZero @jukofyork Maybe it would help to tag this issue differently (good-deep-dive-issue :) ) to get someone to pick this up? Or maybe @ggerganov is planning on tackling this at some point himself?

@github-actions github-actions bot removed the stale label Apr 3, 2024
@ggerganov
Copy link
Owner

The plan is after merging #5021 to add kernels that work with the a quantum KV cache. We are working towards this, but it might take some time to get there

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@CISC
Copy link
Contributor Author

CISC commented May 19, 2024

Perhaps change label on this issue to bug so it doesn't go stale and auto-close?

@vladfaust
Copy link

I'm not sure if it's the right issue, but KV cache quantization is definitely the feature I'm looking forward to, given that my application reuses session dumps a lot; optimizing dump size would be very beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants