Improve FP6-LLM 2+4bit weight splitting + user API #279

gau-nernst · 2024-05-25T23:55:50Z

Address #208

2+4bit weight splitting

Port https://github.com/pytorch/ao/blob/4ca3985be603e6496da7ec57adf1942c8b32a78e/torchao/csrc/fp6_llm/weight_prepacking.cpp to pure PyTorch.

FP16 weight, (8192, 8192). Ryzen 5600, 4070Ti SUPER

device	op	time (m/s)
CPU (num_threads=1)	original FP16->FP6 + original 2+4bit splitting	1380.96
CPU (num_threads=1)	new FP16->FP6 + original 2+4bit splitting	616.285
CPU (num_threads=1)	new FP16->FP6 + new 2+4bit splitting	587.965
CPU (num_threads=4)	original FP16->FP6 + original 2+4bit splitting	1213.08
CPU (num_threads=4)	new FP16->FP6 + original 2+4bit splitting	334.502
CPU (num_threads=4)	new FP16->FP6 + new 2+4bit splitting	246.911
CUDA	new FP16->FP6 + original 2+4bit splitting	257.539
CUDA	new FP16->FP6 + new 2+4bit splitting	1.05908

Note:

original 2+4bit splitting only works on CPU. Thus, for the 2nd last row, FP16->FP6 is done on GPU, but 2+4bit splitting is done on CPU.

User API

from torchao.quantization.fp6_llm import convert_fp6_llm

model = ...
convert_fp6_llm(model)  # nn.Linear modules will be replaced with Fp6LlmLinear in-place

I opt for custom linear module instead of tensor subclass mainly because it's easier to implement.

Note:

Fp6LlmLinear will cast input to FP16 and cast output to original dtype.

pytorch-bot · 2024-05-25T23:55:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/279

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d798eaf with merge base 4ca3985 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

* add annotation * add weight splitting logic * update from fp6_quant * merge to_tc_float6_e3m2 * add more optimized version * add some notes * add from_tc_float6_e3m2 * add some docs * make fp6_llm.py * add test for linear * fix fp6 llm * switch to v2 since it's faster * fix type hint for old python * simplify further * fix typing for old python * add test * eliminate indexing.faster on CUDA * skip fp6_llm on cpu * improve error message * add support for extra batch dims * cast output to original dtype * fix precision error due to dtype

gau-nernst added 5 commits May 22, 2024 16:32

add annotation

36add71

Merge branch 'main' into fp6_weight_split

18824a7

add weight splitting logic

58bcf2f

Merge branch 'main' into fp6_weight_split

dd84cc2

update from fp6_quant

5e44f1c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 25, 2024

gau-nernst added 2 commits May 26, 2024 08:21

merge to_tc_float6_e3m2

bccc4f6

add more optimized version

cfa304c

gau-nernst mentioned this pull request May 26, 2024

FP6 dtype! #208

Open

gau-nernst added 3 commits May 26, 2024 15:49

add some notes

ed9cb1d

add from_tc_float6_e3m2

f609b6f

add some docs

dec40af

gau-nernst marked this pull request as ready for review May 26, 2024 08:41

make fp6_llm.py

5e5dfdc

gau-nernst marked this pull request as draft May 26, 2024 09:38

gau-nernst changed the title ~~Improve FP6-LLM 2+4bit weight splitting~~ Improve FP6-LLM 2+4bit weight splitting + user API May 26, 2024

gau-nernst added 9 commits May 26, 2024 17:50

add test for linear

5bdcd50

fix fp6 llm

708d485

switch to v2 since it's faster

ce0ffc1

fix type hint for old python

59e39ce

simplify further

cb43a7b

fix typing for old python

b90938a

add test

66cbe1d

eliminate indexing.faster on CUDA

5ed6767

skip fp6_llm on cpu

fa08a3d

gau-nernst marked this pull request as ready for review May 26, 2024 15:16

gau-nernst added 3 commits May 26, 2024 23:43

improve error message

6945498

add support for extra batch dims

70b5a4c

cast output to original dtype

d6c6b6a

fix precision error due to dtype

d798eaf

msaroufim self-requested a review May 26, 2024 17:35

msaroufim approved these changes May 26, 2024

View reviewed changes

msaroufim merged commit 7511b1d into pytorch:main May 26, 2024
13 checks passed

gau-nernst deleted the fp6_weight_split branch May 26, 2024 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve FP6-LLM 2+4bit weight splitting + user API #279

Improve FP6-LLM 2+4bit weight splitting + user API #279

gau-nernst commented May 25, 2024 •

edited

Loading

pytorch-bot bot commented May 25, 2024 •

edited

Loading

Improve FP6-LLM 2+4bit weight splitting + user API #279

Improve FP6-LLM 2+4bit weight splitting + user API #279

Conversation

gau-nernst commented May 25, 2024 • edited Loading

2+4bit weight splitting

User API

pytorch-bot bot commented May 25, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/279

✅ No Failures

gau-nernst commented May 25, 2024 •

edited

Loading

pytorch-bot bot commented May 25, 2024 •

edited

Loading