You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://arxiv.org/abs/2402.02750).
1242
+
It allows the model to generate longer sequence length without allocating too much memory for Key and Value cache by applying quantization.
1243
+
The cache has two types of storage, one for original precision and one for the quantized cache. A `residual length` is set as a maximum capacity for the
1244
+
original precision cache. When the length goes beyond maximum capacity, the original precision cache is discarded and moved into the quantized cache. The
1245
+
quantization is done per-channel with a set `q_group_size` for both Keys and Values, in contrast to what was described in the paper.
1246
+
It stores Keys and Values a list of quantized tensors (tuples in case we need to store metadata), one for each layer. Additionally, it stores the Key and
1247
+
Value in original precision states as a list of tensors, one for each layer. The size of each tensor
1248
+
is `[batch_size, num_heads, seq_len - residual_length, head_dim]`.
1249
+
See `Cache` for details on common methods that are implemented by all cache classes.
A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://huggingface.co/papers/2402.02750).
1242
1279
It allows the model to generate longer sequence length without allocating too much memory for Key and Value cache by applying quantization.
A quantizer cache similar to what is described in the [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache paper](https://arxiv.org/abs/2402.02750).
1289
1322
It allows the model to generate longer sequence length without allocating too much memory for Key and Value cache by applying quantization.
0 commit comments