-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral & Sliding Window Attention - GGUF ROPE accomodation. #3867
Comments
Does PyTorch model work with long prompts? How does its output compare to outputs of llama.cpp and KoboldCPP? I tried to test it, but ran out of system RAM. |
I've been wondering the same thing recently Mistrals get So them being processed as trained 32k context makes no sense, because from the point of view of the current implementation that model should be processed as having 4k trained context. Any thoughts? |
Dunno anything about pytorch, being exclusively an KoboldCPP user. Anyhow, the person who suggested the usable ROPE for 24k thinks it might be the model configuration at fault, but that will be for smarter minds to figure out. Here is their speculation.
|
Maybe look at #2268 - YaRN/NTK type-scaling. It may work better than the flat scaling. |
I don't think llama.cpp supports Sliding Window Attention yet... :( |
It actually might already support it - see the discussion #3581 |
Using Mistral models, I set the --rope-freq-base, and I no longer get garbage output, after 8k context I may be completely using it wrong, but I'm still learning all of the ins and outs of this repo. |
FYI: Here is an implementation of SWA using full cache. Tests show that the output looks OK after 9k+ tokens are generated. |
@foldl Can you explain your approach? Let's say you have a KV cache of 8192 tokens and it is now full, and you want to process the 8193th token at position 8192 - what do you do? |
Some terminologies: let's name each item It is simple from the perspective of implementation. The required modifications
Key point: Tokens are processed one by one. (Batch processing is also possible, To explain it, here is a secret story that I'd like to share. A: What's the meaning of the n-th B: Easy. It's an understanding of n-th input token. A: No. It's the output from the previous layer which has attended to all previous B: Okay. Then what? A: It is waste of memory and computation. For each layer, when a new input vector B: Please elaborate. A: Now, let's think layer by layer.
and so on. B: Oh, I got it. Because only A: STOP. Ring buffer is a poor well-known terminology. We need something new. Rolling Cache, B: Nice. Obviously, if cache stores only |
Thanks for the explanation. I guess this corresponds to calling: llama_kv_cache_seq_rm(ctx, 0, n_past % w_len, n_past % w_len); before each decode in order to free a KV slot. But I still can't say I have an intuition of why this works.
|
For Layer 1: As shown above, the n-th k_vector/v_vector in the cache corresponds The n-th output of this layer carries information about
Therefore, the tokens covered by n-th output of this layer carries information about |
I am not quite sure about the internal logic of llama_kv_cache_seq_rm(ctx, 0, -1, n_past - w_len); And then, when calling |
Correct! I made a mistake. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I have been trying to use Mistral with extended context. The SWA supposedly allows for up to 32k context, but in practice I get garbage. However, someone one Reddit mentioned that using a ROPE of 45,000 makes 24k coherent. So I compared 24k with default ROPE in KoboldCPP, then with the custom ROPE. The latter works, the former was gibberish.
My guess is that the current GGUF wasn't built with SWA in mind.
The text was updated successfully, but these errors were encountered: