Skip to content

llama : add RWKV models support #846

Closed
@multimediaconverter

Description

@multimediaconverter

RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves memory.

Info: https://github.com/BlinkDL/ChatRWKV

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Experimental GGML port: https://github.com/saharNooby/rwkv.cpp

The lastest "Raven"-series Alpaca-style-tuned RWKV 14B & 7B models are very good.
Online demo: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
Download: https://huggingface.co/BlinkDL/rwkv-4-raven


Edit by @ggerganov:

Adding @BlinkDL's comment below to OP for visibility:

v4 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

v5 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

fast v4 & v5.2 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

v5.2 1.5B demo (great for its size): https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

v5.2 1.5B benchmarks: https://twitter.com/BlinkDL_AI/status/1717543614434402661

a few remarks:

  • rwkv models have RNN-style "one" mode, and GPT-style "seq" mode
  • i am actually using exp(-exp(w))
  • seems it's good to precompute embedding+emb_layernorm in bf16
  • when using fp16, i am doing /2 every 6 layers, to avoid overflow

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions