@@ -10,9 +10,10 @@ This project provides [a C library rwkv.h](rwkv.h) and [a convinient Python wrap
1010
1111** TODO (contributions welcome!)** :
1212
13- 1 . Measure latency and perplexity of different model sizes (169M to 14B) and data types (FP32, FP16, Q4_0, Q4_1)
14- 2 . Test on Linux (including Colab) and MacOS
15- 3 . Make required memory calculation more robust (see #4 )
13+ 1 . Optimize AVX2 implementation of ` Q4_1_O ` matmul — currently, it is as slow as ` FP32 `
14+ 2 . Measure latency and perplexity of different model sizes (169M to 14B) and data types (` FP32 ` , ` FP16 ` , ` Q4_0 ` , ` Q4_1 ` , ` Q4_1_O ` )
15+ 3 . Test on Linux (including Colab) and MacOS
16+ 4 . Make required memory calculation more robust (see [ #4 ] ( https://github.com/saharNooby/rwkv.cpp/issues/4 ) )
1617
1718## How to use
1819
@@ -68,7 +69,7 @@ If everything went OK, `librwkv.so` (Linux) or `rwkv.o` (MacOS) file should appe
6869
6970``` commandline
7071# Windows
71- python rwkv\convert_rwkv_to_ggml .py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float16
72+ python rwkv\convert_pytorch_to_ggml .py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float16
7273
7374# Linux / MacOS
7475python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin float16
@@ -80,13 +81,17 @@ To convert the model into INT4 quantized format, run:
8081
8182``` commandline
8283# Windows
83- python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1 .bin 3
84+ python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1_O .bin 4
8485
8586# Linux / MacOS
86- python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1 .bin 3
87+ python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1_O .bin 4
8788```
8889
89- Pass ` 2 ` for ` Q4_0 ` format (smaller size, lower quality), ` 3 ` for ` Q4_1 ` format (larger size, higher quality).
90+ Formats available:
91+
92+ - ` 4 ` : ` Q4_1_O ` , best quality, very slow (as ` FP32 ` ).
93+ - ` 3 ` : ` Q4_1 ` , poor quality, very fast (as ` FP16 ` ).
94+ - ` 2 ` : ` Q4_0 ` , worst quality, breaks larger models, moderately fast (between ` FP16 ` and ` FP32 ` ).
9095
9196### 4. Run the model
9297
@@ -98,20 +103,20 @@ To generate some text, run:
98103
99104``` commandline
100105# Windows
101- python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1 .bin
106+ python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1_O .bin
102107
103108# Linux / MacOS
104- python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1 .bin
109+ python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1_O .bin
105110```
106111
107112To chat with a bot, run:
108113
109114``` commandline
110115# Windows
111- python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1 .bin
116+ python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1_O .bin
112117
113118# Linux / MacOS
114- python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1 .bin
119+ python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1_O .bin
115120```
116121
117122Edit [ generate_completions.py] ( rwkv%2Fgenerate_completions.py ) or [ chat_with_bot.py] ( rwkv%2Fchat_with_bot.py ) to change prompts and sampling settings.
0 commit comments