-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantizing V cache not working yet #4425
Comments
It doesn't work on CPU only, or with openCL either. I think quantum V cache is just not implemented yet. Line 6890 in fecac45
|
Yeah, I just noticed: #4309
|
seems that K cache quantization doesn't work on StableLM models, maybe on other archs too |
The head size has to be a multiple of 32. I think in StableLM it is not |
Zephyr 3b config.json
I tried to run the 3km quant with -ctk q8_0 and I got this:
|
The head size is equal to |
This issue is stale because it has been open for 30 days with no activity. |
I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature. |
Yeah, I agree - it seems the trend is for released models to have longer and longer context lengths recently. |
@DesperateZero @jukofyork Maybe it would help to tag this issue differently ( |
The plan is after merging #5021 to add kernels that work with the a quantum KV cache. We are working towards this, but it might take some time to get there |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Perhaps change label on this issue to |
I'm not sure if it's the right issue, but KV cache quantization is definitely the feature I'm looking forward to, given that my application reuses session dumps a lot; optimizing dump size would be very beneficial. |
Quantizing the K cache (-ctk) works, however quantizing the V cache (-ctv) does not, I've tried with q4_0, q4_1, q8 etc...
Using the cublas- cu12.2.0 release build I get the following error:
llama_kv_cache_init: VRAM kv self = 336.00 MB
llama_new_context_with_model: KV self size = 336.00 MiB, K (f16): 256.00 MiB, V (q4_1): 80.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 291.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
llama_new_context_with_model: total VRAM used: 4719.06 MiB (model: 4095.05 MiB, context: 624.00 MiB)
CUDA error 1 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: invalid argument
current device: 0
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: !"CUDA error"
The text was updated successfully, but these errors were encountered: