Skip to content

Commit 84e0698

Browse files
authored
Merge pull request ggml-org#16 from saharNooby/outliers-preserving-quantization-PR
Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32
2 parents d12088e + 874826c commit 84e0698

10 files changed

+873
-206
lines changed

README.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@ This project provides [a C library rwkv.h](rwkv.h) and [a convinient Python wrap
1010

1111
**TODO (contributions welcome!)**:
1212

13-
1. Measure latency and perplexity of different model sizes (169M to 14B) and data types (FP32, FP16, Q4_0, Q4_1)
14-
2. Test on Linux (including Colab) and MacOS
15-
3. Make required memory calculation more robust (see #4)
13+
1. Optimize AVX2 implementation of `Q4_1_O` matmul — currently, it is as slow as `FP32`
14+
2. Measure latency and perplexity of different model sizes (169M to 14B) and data types (`FP32`, `FP16`, `Q4_0`, `Q4_1`, `Q4_1_O`)
15+
3. Test on Linux (including Colab) and MacOS
16+
4. Make required memory calculation more robust (see [#4](https://github.com/saharNooby/rwkv.cpp/issues/4))
1617

1718
## How to use
1819

@@ -68,7 +69,7 @@ If everything went OK, `librwkv.so` (Linux) or `rwkv.o` (MacOS) file should appe
6869

6970
```commandline
7071
# Windows
71-
python rwkv\convert_rwkv_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float16
72+
python rwkv\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float16
7273
7374
# Linux / MacOS
7475
python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin float16
@@ -80,13 +81,17 @@ To convert the model into INT4 quantized format, run:
8081

8182
```commandline
8283
# Windows
83-
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1.bin 3
84+
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1_O.bin 4
8485
8586
# Linux / MacOS
86-
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1.bin 3
87+
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin 4
8788
```
8889

89-
Pass `2` for `Q4_0` format (smaller size, lower quality), `3` for `Q4_1` format (larger size, higher quality).
90+
Formats available:
91+
92+
- `4`: `Q4_1_O`, best quality, very slow (as `FP32`).
93+
- `3`: `Q4_1`, poor quality, very fast (as `FP16`).
94+
- `2`: `Q4_0`, worst quality, breaks larger models, moderately fast (between `FP16` and `FP32`).
9095

9196
### 4. Run the model
9297

@@ -98,20 +103,20 @@ To generate some text, run:
98103

99104
```commandline
100105
# Windows
101-
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1.bin
106+
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1_O.bin
102107
103108
# Linux / MacOS
104-
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1.bin
109+
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
105110
```
106111

107112
To chat with a bot, run:
108113

109114
```commandline
110115
# Windows
111-
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1.bin
116+
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1_O.bin
112117
113118
# Linux / MacOS
114-
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1.bin
119+
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
115120
```
116121

117122
Edit [generate_completions.py](rwkv%2Fgenerate_completions.py) or [chat_with_bot.py](rwkv%2Fchat_with_bot.py) to change prompts and sampling settings.

0 commit comments

Comments
 (0)