llama : try to avoid context swap #2060

ggerganov · 2023-06-30T19:53:55Z

Currently, when the context becomes full, we pick part of the tokens and recompute the KV cache.

Instead, try to either:

store non-RoPEd KV cache, "shift" it when the context is full and compute the RoPE over the entire cache for every new token taking into account the current positions
store RoPEd KV cache (as we do now), "shift" it when the context is full and apply extra shift-RoPE on it (assuming RoPE is "additive")

SlyEcho · 2023-07-02T21:18:46Z

Maybe it could also be compressed using interpolation?

cebtenzzre · 2023-09-17T18:21:48Z

Storing non-RoPEd KV cache would allow us to implement dynamic NTK or YaRN RoPE scaling, which is the state-of-the-art for context scaling on non-finetuned models. See section 3.3 of this paper.

ggerganov added performance Speed related topics research 🔬 labels Jun 30, 2023

ggerganov added this to ggml : roadmap Jun 30, 2023

ggerganov moved this to Todo in ggml : roadmap Jul 14, 2023

ggerganov mentioned this issue Jul 14, 2023

Apple M1 metal lag #1730

Closed

ggerganov mentioned this issue Aug 23, 2023

Strided perplexity #2714

Merged

This was referenced Sep 16, 2023

llama : add example for tree-based parallel decoding #3137

Closed

llama : custom attention mask + parallel decoding + no context swaps #3228

Merged

ggerganov closed this as completed in #3228 Sep 28, 2023

ggerganov moved this from Todo to Done in ggml : roadmap Sep 28, 2023

cebtenzzre mentioned this issue Feb 13, 2024

Add support for BERT embedding models #5423

Merged

ggerganov mentioned this issue Oct 19, 2024

llama.vim : plugin for Neovim #9787

Merged

7 tasks

murrellb mentioned this issue Nov 18, 2024

KV Cache with RoPE scaling. MurrellGroup/Jjama3.jl#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : try to avoid context swap #2060

llama : try to avoid context swap #2060

ggerganov commented Jun 30, 2023

SlyEcho commented Jul 2, 2023

cebtenzzre commented Sep 17, 2023

llama : try to avoid context swap #2060

llama : try to avoid context swap #2060

Comments

ggerganov commented Jun 30, 2023

SlyEcho commented Jul 2, 2023

cebtenzzre commented Sep 17, 2023