Add torchao quant for mixtral and qwen_moe #1418

jerryzh168 · 2024-09-14T00:03:33Z

Summary:
Similar to #1341 we add torchao quantization to mixtral model

Test Plan:
Note: compile is not working yet, and I can't install torchnightly locally and make it work either. I'll wait for pytorch 2.5 release which happens in mid Oct, or check that again later

python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8  Warmup ...
Prefill. latency: 0.05532 s, throughput: 2313.73 token/s
Decode. latency: 0.00896 s, throughput: 111.65 token/s
Decode. latency: 0.00833 s, throughput: 120.04 token/s
Decode. latency: 0.00869 s, throughput: 115.06 token/s
Decode. latency: 0.00842 s, throughput: 118.79 token/s
Decode. median latency: 0.00855 s, median throughput: 116.89 token/s
Total. latency: 0.090 s, throughput: 1471.26 token/s
Benchmark ...
Prefill. latency: 0.04294 s, throughput: 2980.61 token/s
Decode. latency: 0.00839 s, throughput: 119.12 token/s
Decode. latency: 0.00828 s, throughput: 120.78 token/s
Decode. latency: 0.00857 s, throughput: 116.64 token/s
Decode. latency: 0.00853 s, throughput: 117.19 token/s
Decode. latency: 0.00859 s, throughput: 116.39 token/s
Decode. median latency: 0.00853 s, median throughput: 117.17 token/s
Total. latency: 0.111 s, throughput: 1226.84 token/s

python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128

Warmup ...
Prefill. latency: 0.06413 s, throughput: 1996.05 token/s
Decode. latency: 0.00764 s, throughput: 130.84 token/s
Decode. latency: 0.00748 s, throughput: 133.73 token/s
Decode. latency: 0.00725 s, throughput: 137.84 token/s
Decode. latency: 0.00721 s, throughput: 138.74 token/s
Decode. median latency: 0.00737 s, median throughput: 135.76 token/s
Total. latency: 0.094 s, throughput: 1408.61 token/s
Benchmark ...
Prefill. latency: 0.05239 s, throughput: 2443.43 token/s
Decode. latency: 0.00739 s, throughput: 135.25 token/s
Decode. latency: 0.00720 s, throughput: 138.90 token/s
Decode. latency: 0.00718 s, throughput: 139.21 token/s
Decode. latency: 0.00722 s, throughput: 138.42 token/s
Decode. latency: 0.00745 s, throughput: 134.30 token/s
Decode. median latency: 0.00731 s, median throughput: 136.82 token/s
Total. latency: 0.111 s, throughput: 1223.51 token/s

A100, no compile
python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8 --torchao-config fp8wo max_total_num_tokens=199454
Warmup ...
Prefill. latency: 0.06958 s, throughput: 1839.60 token/s
Decode. latency: 0.02343 s, throughput: 42.68 token/s
Decode. latency: 0.02342 s, throughput: 42.70 token/s
Decode. latency: 0.02368 s, throughput: 42.23 token/s
Decode. latency: 0.02337 s, throughput: 42.80 token/s
Decode. median latency: 0.02342 s, median throughput: 42.69 token/s
Total. latency: 0.163 s, throughput: 807.48 token/s
Benchmark ...
Prefill. latency: 0.05767 s, throughput: 2219.36 token/s
Decode. latency: 0.02293 s, throughput: 43.61 token/s
Decode. latency: 0.02026 s, throughput: 49.36 token/s
Decode. latency: 0.02029 s, throughput: 49.29 token/s
Decode. latency: 0.02024 s, throughput: 49.41 token/s
Decode. latency: 0.02026 s, throughput: 49.36 token/s
Decode. median latency: 0.02025 s, median throughput: 49.39 token/s
Total. latency: 0.222 s, throughput: 611.87 token/s

Reviewers:

Subscribers:

Tasks:

Tags:

python/sglang/srt/models/mixtral.py

python/sglang/srt/models/qwen2_moe.py

python/sglang/srt/models/mixtral.py

Summary: Similar to sgl-project#1341 we add torchao quantization to mixtral model Test Plan: Note: compile is not working yet, and I can't install torchnightly locally and make it work either. I'll wait for pytorch 2.5 release which happens in mid Oct, or check that again later python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8  Warmup ... Prefill. latency: 0.05532 s, throughput: 2313.73 token/s Decode. latency: 0.00896 s, throughput: 111.65 token/s Decode. latency: 0.00833 s, throughput: 120.04 token/s Decode. latency: 0.00869 s, throughput: 115.06 token/s Decode. latency: 0.00842 s, throughput: 118.79 token/s Decode. median latency: 0.00855 s, median throughput: 116.89 token/s Total. latency: 0.090 s, throughput: 1471.26 token/s Benchmark ... Prefill. latency: 0.04294 s, throughput: 2980.61 token/s Decode. latency: 0.00839 s, throughput: 119.12 token/s Decode. latency: 0.00828 s, throughput: 120.78 token/s Decode. latency: 0.00857 s, throughput: 116.64 token/s Decode. latency: 0.00853 s, throughput: 117.19 token/s Decode. latency: 0.00859 s, throughput: 116.39 token/s Decode. median latency: 0.00853 s, median throughput: 117.17 token/s Total. latency: 0.111 s, throughput: 1226.84 token/s python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128 Warmup ... Prefill. latency: 0.06413 s, throughput: 1996.05 token/s Decode. latency: 0.00764 s, throughput: 130.84 token/s Decode. latency: 0.00748 s, throughput: 133.73 token/s Decode. latency: 0.00725 s, throughput: 137.84 token/s Decode. latency: 0.00721 s, throughput: 138.74 token/s Decode. median latency: 0.00737 s, median throughput: 135.76 token/s Total. latency: 0.094 s, throughput: 1408.61 token/s Benchmark ... Prefill. latency: 0.05239 s, throughput: 2443.43 token/s Decode. latency: 0.00739 s, throughput: 135.25 token/s Decode. latency: 0.00720 s, throughput: 138.90 token/s Decode. latency: 0.00718 s, throughput: 139.21 token/s Decode. latency: 0.00722 s, throughput: 138.42 token/s Decode. latency: 0.00745 s, throughput: 134.30 token/s Decode. median latency: 0.00731 s, median throughput: 136.82 token/s Total. latency: 0.111 s, throughput: 1223.51 token/s A100, no compile python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8 --torchao-config fp8wo max_total_num_tokens=199454 Warmup ... Prefill. latency: 0.06958 s, throughput: 1839.60 token/s Decode. latency: 0.02343 s, throughput: 42.68 token/s Decode. latency: 0.02342 s, throughput: 42.70 token/s Decode. latency: 0.02368 s, throughput: 42.23 token/s Decode. latency: 0.02337 s, throughput: 42.80 token/s Decode. median latency: 0.02342 s, median throughput: 42.69 token/s Total. latency: 0.163 s, throughput: 807.48 token/s Benchmark ... Prefill. latency: 0.05767 s, throughput: 2219.36 token/s Decode. latency: 0.02293 s, throughput: 43.61 token/s Decode. latency: 0.02026 s, throughput: 49.36 token/s Decode. latency: 0.02029 s, throughput: 49.29 token/s Decode. latency: 0.02024 s, throughput: 49.41 token/s Decode. latency: 0.02026 s, throughput: 49.36 token/s Decode. median latency: 0.02025 s, median throughput: 49.39 token/s Total. latency: 0.222 s, throughput: 611.87 token/s Reviewers: Subscribers: Tasks: Tags:

zhyncs requested review from merrymercy, ispobock and zhyncs September 14, 2024 00:30

zhyncs self-assigned this Sep 14, 2024

merrymercy reviewed Sep 14, 2024

View reviewed changes

python/sglang/srt/models/mixtral.py Outdated Show resolved Hide resolved

jerryzh168 changed the title ~~Add torchao quant for mixtral~~ Add torchao quant for mixtral and qwen_moe Sep 14, 2024

merrymercy requested changes Sep 14, 2024

View reviewed changes

python/sglang/srt/models/qwen2_moe.py Outdated Show resolved Hide resolved

python/sglang/srt/models/mixtral.py Outdated Show resolved Hide resolved

jerryzh168 added 8 commits September 13, 2024 22:43

formatting

c02c3eb

refactor

da03c06

remove comment

b03fd13

formatting

7f5579d

simplify

c3bbcbe

simplify

7be4780

rebase

c3b08dd

jerryzh168 force-pushed the add-mixtral-quant branch from 5436560 to c3b08dd Compare September 14, 2024 05:46

jerryzh168 requested a review from merrymercy September 14, 2024 05:48

merrymercy approved these changes Sep 14, 2024

View reviewed changes

merrymercy enabled auto-merge (squash) September 14, 2024 06:42

merrymercy merged commit 30b404c into sgl-project:main Sep 14, 2024
10 checks passed

zhyncs added the quant LLM Quantization label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torchao quant for mixtral and qwen_moe #1418

Add torchao quant for mixtral and qwen_moe #1418

jerryzh168 commented Sep 14, 2024

Add torchao quant for mixtral and qwen_moe #1418

Add torchao quant for mixtral and qwen_moe #1418

Conversation

jerryzh168 commented Sep 14, 2024