GitHub - fxmarty/q4f16-gemm-gemv-benchmark

This repository aims to compare the available open-source GEMM / GEMV kernels using a mixed precision scheme int4 / fp16, with per-group quantization.

Available implementations

https://github.com/qwopqwop200/GPTQ-for-LLaMa
https://github.com/turboderp/exllama
https://github.com/PanQiWei/AutoGPTQ
https://github.com/NVIDIA/FasterTransformer (only per-channel quantization, per-block not open-sourced, so not compared but sounds promising as based on CUTLASS)
AWQ implem https://github.com/mit-han-lab/llm-awq/tree/main/awq/kernels
Probably missing others

Results

On A100-SXM4-80GB & Intel Xeon Platinum 8275CL CPU + CUDA 11.7/11.8 (should be rerun in docker):

m	n	k	implementation	act_order	Time (ms/op)	Max mem (MB)
1	8192	8192	baseline	True	0.0937	177.6845
1	8192	8192	gptqforllama	True	0.2038	69.8450
1	8192	8192	exllama	False	0.0681	34.9143
1	8192	8192	exllama	True	0.0675	34.9471
1	8192	8192	autogptq-triton	True	0.3990	69.8450
1	8192	8192	autogptq-cuda-old	False	0.0831	71.9585
1	8192	8192	autogptq-cuda	True	0.1546	69.8778

On RTX 4090 + AMD Ryzen 9 7950X CPU + CUDA 11.8:

TODO

On A10G + AMD EPYC 7R32 CPU + CUDA 11.8 (docker):

m	n	k	implementation	act_order	Time (ms/op)	Max mem (MB)
1	8192	8192	baseline	True	0.2891	177.6845
1	8192	8192	gptqforllama	True	0.1746	69.8450
1	8192	8192	autogptq-triton	True	0.2963	69.8450
1	8192	8192	autogptq-cuda-old	False	0.0979	71.9585
1	8192	8192	autogptq-cuda	True	0.1483	69.8778
1	8192	8192	exllama	False	0.0842	34.9143
1	8192	8192	exllama	True	0.0839	34.9471

Run the benchmark

A=m * k, B=k * n, compute C= A*B^T

It can be a good idea to first lock the GPU frequency, see NVIDIA/cutlass#430 (comment)

Run exllama in exllama env:

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --exllama-path ../exllama --act-order yes

Run gptqforllama in gptqforllama env:

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --gptqforllama-path ../GPTQ-for-LLaMa --act-order yes

Run AutoGPTQ (specify --autogptq-implem {triton, cuda-old, cuda}):

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --autogptq-path ../AutoGPTQ/ --autogptq-implem triton --act-order yes

Run PyTorch fp16 * fp16 baseline:

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --baseline

Run all benchmarks

Follow https://stackoverflow.com/a/61737404 and

docker build -f Dockerfile --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) -t container-q4f16 .

and

docker run --gpus device=0 -it --rm container-q4f16:latest /bin/bash run.sh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dummy_config.json		dummy_config.json
run.sh		run.sh
run_benchmark.py		run_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Available implementations

Results

Run the benchmark

Run all benchmarks

About

Releases

Packages

Languages

License

fxmarty/q4f16-gemm-gemv-benchmark

Folders and files

Latest commit

History

Repository files navigation

Available implementations

Results

Run the benchmark

Run all benchmarks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages