-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add RWKV models support #846
Comments
closing this in favor of ggerganov/ggml#21 also https://github.com/saharNooby/rwkv.cpp seems to be it. |
Now that support for other models is being added directly to llama.cpp, would rwkv support be reconsidered? It would be very nice to support it since support would mean it gets all the benefits that llama.cpp has over a separate project for only rwkv. |
We should try to add it - it will probably be the most different compared to all other arches that we support as it is LSTM based so it will be a good exercise to see how easy it would fit in the existing framework |
@ggerganov Please check these :) v4 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py v5 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py fast v4 & v5.2 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py v5.2 1.5B demo (great for its size): https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio v5.2 1.5B benchmarks: https://twitter.com/BlinkDL_AI/status/1717543614434402661 a few remarks:
|
Not sure if it helps, but I have a GGML-based Rust implementation here: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv/src/ggml/graph.rs (that's just v4 inference) This is actually the reason I made my first contribution to the project, trying to get the map ops (now superseded) to work around what GGML didn't support. I think that's mostly still the case, so the majority of these will probably still need to use custom mapping: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv/src/ggml/map_ops.rs (the |
Hi all! Maintainer of rwkv.cpp here. Indeed, having a separate repository for RWKV leads to That said, I like compactness and simplicity of In the end, users will decide :) On a more practical note: If support for RWKV will be added into Furthermore, if support for both RWKV v4 and RWKV v5 is implemented in Until then, my plan is to continue support I won't be able to help with migrating |
Hi @saharNooby - great work with rwkv.cpp I'm mainly interested to see what would I'm looking forward to contributions as I doubt I will have the time to implement it myself. So we will have to see if RWKV support will end up in Alternatively, we should also look for other LLM architectures that would present some sort of a challenge and try to integrate them as well, in the same spirit to understand what |
Regarding Regarding
The only difference is that Attention was replaced with WKV, which can be computed in recurrent manner. Everything else -- layer structure, MLP, embed/unembed are same as in Transformers. Some early versions of RWKV even use the popular
Yep! |
the real difference is RWKV (and other "linear attention" models) uses a fixed-size state instead of a growing kv cache :) so it's like:
and you can clone & save states, to make a "state cache" for various inputs to accelerate inference. |
RWKV v4 in 100 lines (using numpy): https://johanwind.github.io/2023/03/23/rwkv_details.html another blogpost: https://fullstackdeeplearning.com/blog/posts/rwkv-explainer/ v4 details: https://ben.bolte.cc/rwkv-model RWKV zoom talk (TUE, NOV 7 · 9:30 AM CST): https://www.meetup.com/silicon-valley-generative-ai/events/296395124/ RWKV sf meet (Saturday, Nov 11 1:00pm PT): https://partiful.com/e/bi6lGCvZXCzZQNN5FjXW |
I'm excited to see rwkv's progress, I love this model. |
Is there a way to make RWKV's state stuff fit in with the current concept of sequences and KV cache manipulation? Can you do parallel generation with multiple independent sequences? |
If it's helpful, I asked some questions in the RWKV discord: [2:06 AM] Kerfuffle: This might be a pretty dumb question, but just thinking about how RWKV could fit into llama.cpp. Probably the biggest thing is figuring out how it can work with llama.cpp's idea of batches and sequences and parallel generation. [3:12 AM] Tomeno: you can run rwkv in parallel, but you can't edit the state like that - what you can do though is save and roll back to previous versions of the state cheaply [3:20 AM] Kerfuffle: Thanks for the answer. Is there a way to save/roll back the state just for specific sequences when doing parallel generation? [3:30 AM] Tomeno: well, i should say, save and load the state - the state is a "compressed" version of the entire context/sequence up to that point [3:45 AM] Tomeno: so no, once it's processed, you can't separate the tokens that went into it [3:46 AM] Tomeno: what you could do is something like save the state after every reply of a chatbot, and then you could load any point in that conversation back up and continue from there [3:47 AM] Tomeno: or save a number of states to disk and load them back up at any time, no matter how long the input sequence was, the state is about the same size [3:52 AM] Kerfuffle: Thanks again. I guess the main issue is keeping the state of sequences separate which I guess actually isn't possible. [3:53 AM] Kerfuffle: Seems like it would be really hard to fit RWKV into llama.cpp as an alternative model architecture. [4:17 AM] Kerfuffle: I feel like there's got to be a way to do separate sequences in general otherwise it's a HUGE strike against RWKV. Just for example, suppose I have an RWKV model that works as well as ChatGPT. I want to set up a website where people can query it. A service like that requires submitting queries in huge batches, doing a completely separate decode for each individual user just wouldn't work. [4:20 AM] Tomeno: oh wait, i misunderstood what you meant [4:20 AM] Tomeno: when you process multiple sequences in parallel, each of them has its own associated state [4:21 AM] Tomeno: put very simply, the input to rwkv is state + next token [4:23 AM] Kerfuffle: Ah, okay, good. Yeah, I have a vague idea of how it probably works then. [4:23 AM] Tomeno: i thought when you wrote "roll back the state for specific sequences" you meant, like, take out a set of tokens from the context [4:23 AM] Kerfuffle: You could just let each sequence have its own state and somehow do the calculation so the correct state is involved for each sequence. [4:23 AM] Kerfuffle: You were correct. :) I was actually asking about both things. [4:24 AM] Kerfuffle: I'm just generally trying to figure out how practical it is (or practical within my capabilities) to try to add RWKV support to llama.cpp [4:24 AM] Tomeno: there were some demos of parallel inference posted recently though i have no idea how to find it [4:25 AM] Kerfuffle: Well, the first step is knowing it's even possible, so that definitely helps. [4:26 AM] Mathmagician: I think web-rwkv lets you inference multiple sequences in parallel This is the From that conversion, it seems like parallel generation wouldn't be too much of a problem. Howevever KV editing operations like rewinding or whatever seem like they would be extremely difficult. Tomeno mentioned saving the RWKV sequence state per token, which may be possible but I'm guessing the per token state is going to be too large to really make that practical. So I think the only way it could really work with how On an unrelated note, a WebGPU backend seems like an interesting idea... |
you can save RWKV state per n tokens. and you can save them to ram / hd. |
I'm looking at it from the perspective of how it can be integrated into |
(2+64)*2560 numbers for each block 32*(2+64)*2560 numbers for full model |
There's been renewed progress in the RWKV space with Eagle-7b: https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers. |
RWKV should reconsider to implement on llama cpp given recent merge of MAMBA SSM. |
If nobody else does it, I'll have time to work on RWKV in Mamba took me a bit more than a month to implement in If anyone reading this is interested in working on this before I have more time, feel free to go ahead. |
I've been taking up the task of implementing support for the RWKV 5 architecture. I've had some issues getting the included python conversion code adapted for RWKV, however. Of course, this is the first step to getting RWKV working. |
Great to know that 🥰🥰🥰 |
please try the much stronger v6.0-world 2.1 model :) design similar to v5. 1b6 done, 3b 7b soon https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1 https://twitter.com/BlinkDL_AI/status/1773503808221712722 The difference between v6 and v5: |
Over easter we've got a long weekend here, but I figured I'd give a few updates on my work on this:
On RWKV v6, I hadn't seen that demo yet! It looks straightforward to add both once one of the two is working. |
I got the tokenizer in and functional so far, working with the "tokenize" example. I'm considering submitting the tokenizer by itself as a small PR, to reduce review load, any thoughts on this? |
Either way would be fine - the tokenizer alone might not be useful for anything else other than RWKV, so no point in merging it alone |
I'm hitting some issues with the vk cache initialization, taking this moment to update on the work done so far. WIP code available here: https://github.com/RWKV/llama.cpp This can be tested using a partial generated GGUF over here, generated using gguf-swiss: Currently I'm having some issues tracking down an initialization issue:
|
The KV cache for recurrent models is sized from the GGUF metadata keys The following are used to size the Lines 1865 to 1875 in 0d56246
If RWKV uses 2 different recurrent states (e.g. one for time mix and the other for channel mix, though I'm not yet sure how they are used), it might be useful to add a new metadata key for the stride of the convolution and make it 0 for RWKV (possibly called Re-using |
Another update; thank for the notes! I've resolved initial crash issues on initialization, though mostly with hacky temporary placeholders (like re-using ssm scope keys). I'll put up a new version of the temporary GGUF file on Monday. The remainder of the work to be done now is to fill in the rest of the network graph, link it up with the KV cache hack for tracking state, and then start handling all the individual hacks one by one. |
please check the unit tests in https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py (v.s. reference tokenizer) |
https://github.com/RWKV/rwkv.cpp supports v6 now |
Conversion and quantization using b3651 worked fine (src HF model: https://huggingface.co/RWKV/v6-Finch-7B-HF). Conversation (using llama-server) initially produced some output, but on 6 attempts it crashed 3 times after 1st or 2nd message - ending up with
It doesn't look like fully supported / working, yet. |
Thanks for your testing. |
@MoonRide303 @MollySophia This should be fixed in #9249 |
I briefly tested Q6_K quant of Finch 7B using llama-server b3658 - seems to be okay (no longer crashing). |
What tps speeds are you getting on a GPU? |
Would rwkv-7 support be possible in the future given that we now have a model release? https://huggingface.co/BlinkDL/rwkv-7-world/tree/main |
Absolutely! We will work on that soon. |
There is also a 32b model converted from qwen-2.5-32b, based on rwkv-6 which just released: https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1/blob/main/modeling_rwkv6qwen2.py. The modelling code just takes qwen's code and replaces it's attention with rwkv-6's attention. Edit: It appears that the code for the rwkv-attention used in the qrwkv6 is slightly different from the one used in rwkv6. In the modeling code on huggingface, it ends up calling this kernel instead of this kernel, which means qrwkv6 doesn't have the |
RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves memory.
Info: https://github.com/BlinkDL/ChatRWKV
RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
Experimental GGML port: https://github.com/saharNooby/rwkv.cpp
The lastest "Raven"-series Alpaca-style-tuned RWKV 14B & 7B models are very good.
Online demo: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
Download: https://huggingface.co/BlinkDL/rwkv-4-raven
Edit by @ggerganov:
Adding @BlinkDL's comment below to OP for visibility:
The text was updated successfully, but these errors were encountered: