Description
RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves memory.
Info: https://github.com/BlinkDL/ChatRWKV
RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
Experimental GGML port: https://github.com/saharNooby/rwkv.cpp
The lastest "Raven"-series Alpaca-style-tuned RWKV 14B & 7B models are very good.
Online demo: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
Download: https://huggingface.co/BlinkDL/rwkv-4-raven
Edit by @ggerganov:
Adding @BlinkDL's comment below to OP for visibility:
v4 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py
v5 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py
fast v4 & v5.2 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py
v5.2 1.5B demo (great for its size): https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
v5.2 1.5B benchmarks: https://twitter.com/BlinkDL_AI/status/1717543614434402661
a few remarks:
- rwkv models have RNN-style "one" mode, and GPT-style "seq" mode
- i am actually using exp(-exp(w))
- seems it's good to precompute embedding+emb_layernorm in bf16
- when using fp16, i am doing /2 every 6 layers, to avoid overflow