Key-Value Cache (KV cache) retains previously computed key-value pairs, eliminating the need for costly re-computation of these key and value vectors for all previous tokens.
Blog for KV cache (In English); A nice visualization of auto-regressive GPT model (In Chinese and English) including Self-Attention, Masked Self-attention with KV cache.
However, accessing the KV cache from off-chip memory during token generation introduces additional memory latencies and is constrained by memory bandwidth limitations, leading to scalability challenges. Take an example, in following figure by Keyformer, increasing the sequence length by 16× (from 512 to 8K) results in more than 50× increase in inference latency, and approximately 40% of the total inference time (highlighted in green) is consumed by KV cache data movement, and also KV cache size surpasses the model size when the sequence length exceeds 8K (right sub-figure).