Add exllama q4 kernel #219

fxmarty · 2023-07-31T11:53:21Z

Exllama kernel for int4/fp16 is notoriously faster than the implem in AutoGPTQ / GPTQ-for-llama (see e.g. huggingface/text-generation-inference#553 (comment), will do a proper benchmark for auto-gptq), especially in the act-order case where weights are reordered ahead of time, and the activations on the fly, removing the need to reorder scales and zero points.

I removed the rocm support for now, let's first merge #214 and then test on rocm this kernel.

Left to do:

Add tests
Test with peft
Implement the pack method - not sure what should be changed compared to cuda / cuda-old / triton?
Auto-detect bits/groupsize/act-order (can be done in an other PR, actually not needed since most models have a quantize_config.json

TheBloke · 2023-07-31T13:57:17Z

Awesome! I'm really excited for this.

I've done a quick benchmark, on:

4090 GPU on AMD EPYC 7352 24-Core Processor (= slow single core performance = CPU bottlenecked)
Torch 2.0.1, CUDA 11.8
Benchmarking with examples/benchmark/generation_speed.py

Llama 1 7B GPTQ

Group size 128, Act-Order/desc_act = False

AutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention

27.27 tokens/s
generated 5120 tokens using 187.7769591808319 seconds, generation speed: 27.266391054236674tokens/s
VRAM usage: 6530 MiB
GPU usage: 24% (CPU bottlenecked)

fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)

36.53 tokens/s
generated 5120 tokens using 140.82491946220398 seconds, generation speed: 36.3572016909561tokens/s
VRAM usage: 6510 MiB
GPU usage: 33% (CPU bottlenecked)

Llama 1 33B GPTQ

Group size: None, Act-Order/desc_act = True

AutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention

15.32 tokens/s
generated 5120 tokens using 334.10506224632263 seconds, generation speed: 15.324520872494963tokens/s
VRAM usage: 22666 MiB
GPU usage: 45%

fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)

19.19 tokens/s
generated 5120 tokens using 266.84571146965027 seconds, generation speed: 19.187117423778886tokens/s
VRAM usage: 20848 MiB
GPU usage: 55%

Not yet a huge difference on this CPU-bottlenecked system, but definitely worthwhile - and it should be better still when fused attention works.

We also see lower VRAM usage in the 33B test, but that could be down to not using fused attention which tends to use a bit more VRAM.

I wanted to test group_size + desc_act, but it crashes at the moment.

I found three issues while doing this testing (two bugs, one problem), which I'll put in the next message.

TheBloke · 2023-07-31T14:00:53Z

Bugs:

Fused attention doesn't work, hence I had to disable it for the above benchmark. Exception when inject_fused_attention=True:

Traceback (most recent call last):
  File "/workspace/exllama_AutoGPTQ/examples/benchmark/generation_speed.py", line 314, in <module>
    main()
  File "/workspace/exllama_AutoGPTQ/examples/benchmark/generation_speed.py", line 266, in main
    model, tokenizer = load_model_tokenizer(
  File "/workspace/exllama_AutoGPTQ/examples/benchmark/generation_speed.py", line 167, in load_model_tokenizer
    model = AutoGPTQForCausalLM.from_quantized(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 105, in from_quantized
    return quant_func(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 877, in from_quantized
    cls.fused_attn_module_type.inject_to_model(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 156, in inject_to_model
    g_idx = torch.cat([q_proj.g_idx, k_proj.g_idx, v_proj.g_idx], dim=0)
TypeError: expected Tensor as element 0 in argument 0, but got NoneType

When trying to test group_size + desc-act (eg 128g + act-order or 32g + act-order), I got a segmentation fault:

 root@8cf0559a3679:/workspace/exllama_AutoGPTQ/examples/benchmark (exllama-q4-kernel ✔) ᐅ python3 generation_speed.py --model_name_or_path /workspace/llama-7b-gptq-32gTrue --use_safetensors --use_fast_tokenizer --no_inject_fused_attention
2023-07-31 13:09:29 INFO [__main__] max_memory: None
2023-07-31 13:09:29 INFO [__main__] loading model and tokenizer
2023-07-31 13:09:29 INFO [auto_gptq.modeling._base] lm_head not been quantized, will be ignored when make_quant.
2023-07-31 13:09:32 WARNING [accelerate.utils.modeling] The safetensors archive passed at /workspace/llama-7b-gptq-32gTrue/gptq_model-4bit-32g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
[1]    6051 segmentation fault (core dumped)  python3 generation_speed.py --model_name_or_path  --use_safetensors
 root@8cf0559a3679:/workspace/exllama_AutoGPTQ/examples/benchmark (exllama-q4-kernel ✔) ᐅ

Issue:

I noticed that weight loading is quite a bit slower than before.

Example: 30B, no group_size, act-order = True

AutoGPTQ 0.3.2 CUDA with Llama 30B, group_size = None, act-order = True
- model and tokenizer loading time: 70.6247s
ExLlama AutoGPTQ, with same:
- model and tokenizer loading time: 181.7889s

This might be an inevitable result of the ExLlama kernel? But I thought it worth mentioning.

Also, when I tried to test 30B 128g + Act-Order, it took something like 10 minutes to load weights before it segmentation faulted due to issue 2. I can't give an exact timing because of the crash, but it seemed from that that perhaps group_size + desc_act was even slower on weight loading.

fxmarty · 2023-07-31T14:04:23Z

The fused attention bug should be solved in the act-order=False case.

@TheBloke can you run pytest tests/ -s and tell me if it works for you?

There is no test yet for the act-order case, so it is expected that it crashes (probably something is wrong in my implem). Will add a test.

I'll test as well the load time, thank you!

Edit: to me the case Group size: None, Act-Order/desc_act = True does not really make sense since it is then per-column quantization instead of per-group, so the notion of act-order does not make sense here. @TheBloke Can you give an example of such a model so that I can test as well?

TheBloke · 2023-07-31T14:42:01Z

@fxmarty pytest tests/ -s fails immediately with segmentation fault

 root@8cf0559a3679:/workspace/exllama_AutoGPTQ (exllama-q4-kernel ✔) ᐅ pytest tests -s
========================================================================================== test session starts ==========================================================================================
platform linux -- Python 3.10.6, pytest-7.4.0, pluggy-1.2.0
rootdir: /workspace/exllama_AutoGPTQ
plugins: anyio-3.7.1
collected 4 items

tests/test_q4.py Fatal Python error: Segmentation fault

Current thread 0x00007fbf34a52000 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 13 in ext_make_q4
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 86 in post_init
  File "/workspace/exllama_AutoGPTQ/tests/test_q4.py", line 163 in test_exllama
  File "/usr/lib/python3.10/unittest/case.py", line 549 in _callTestMethod
  File "/usr/lib/python3.10/unittest/case.py", line 591 in run
  File "/usr/lib/python3.10/unittest/case.py", line 650 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/unittest.py", line 333 in runtest
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 262 in <lambda>
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 341 in from_call
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 222 in call_and_report
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 324 in _main
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 270 in wrap_session
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 166 in main
  File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/usr/local/bin/pytest", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, yaml._yaml, charset_normalizer.md, google._upb._message, lz4._version, lz4.frame._frame, zstandard.backend_c, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pandas._libs.tslib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, PIL._imaging (total: 191)
[1]    9346 segmentation fault (core dumped)  pytest tests -s
 root@8cf0559a3679:/workspace/exllama_AutoGPTQ (exllama-q4-kernel ✔) ᐅ

Edit: to me the case Group size: None, Act-Order/desc_act = True does not really make sense since it is then per-column quantization instead of per-group, so the notion of act-order does not make sense here. @TheBloke Can you give an example of such a model so that I can test as well?

The main branch option for any 30/33B or 65B/70B GPTQ I have released, for example:

https://huggingface.co/TheBloke/LLaMA-30b-GPTQ/tree/main

For 33B models in particular no group size is desirable as it lowers VRAM usage, and then act-order is added to get the highest possible quantisation accuracy (at least when evaluated on perplexity).

I don't have PPL results to hand for 30B, but for example here is the result at 7B, comparing:

no group-size + no act-order
no group-size + act-order

As you see, adding act-order to the quantisation makes a pretty big difference in PPL

fxmarty · 2023-07-31T16:39:04Z

Act-order should work now. You can try pip uninstall auto-gptq && rm -rf *.so build auto_gptq.egg-info && pip install -v -e . and then pytest tests/ -s -k "test_generation_with_act_order" to check.

Having exllama act-order work with fused QKV is still to do.

For the act-order + group case,

Running CUDA_VISIBLE_DEVICES=0 python generation_speed.py --model_name_or_path TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g --model_basename vicuna-13B-1.1-GPTQ-4bit-128g.latest --revision actorder --use_safetensors --no_inject_fused_attention --no_inject_fused_mlp --num_samples 10 on this fixed script https://pastebin.com/CqXAjBkf that uses cudaevents + synchronize (which should IMO always be used when benchmarking on GPUs, since kernel launches are asynchronous) + revision

On an Intel Xeon 8275CL CPU + A100 80GB, I get roughly a x2.7 speedup:

On main (with this critical fix #220):

2023-07-31 16:30:49 INFO [__main__] generation speed: 8.681tokens/s

On this branch:

2023-07-31 16:24:36 INFO [__main__] generation speed: 23.693tokens/s

Note that we generate 256 tokens with greedy search, and that the inputs are of various (small) shapes:

examples[i][input_ids] torch.Size([1, 110])
examples[i][input_ids] torch.Size([1, 63])
examples[i][input_ids] torch.Size([1, 73])
examples[i][input_ids] torch.Size([1, 192])
examples[i][input_ids] torch.Size([1, 32])
examples[i][input_ids] torch.Size([1, 67])
examples[i][input_ids] torch.Size([1, 223])
examples[i][input_ids] torch.Size([1, 56])
examples[i][input_ids] torch.Size([1, 83])

I did not try batched generation yet.

For the no act-order + group case, running CUDA_VISIBLE_DEVICES=0 python generation_speed.py --model_name_or_path TheBloke/WizardLM-7B-uncensored-GPTQ --model_basename WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order --use_safetensors --no_inject_fused_attention --no_inject_fused_mlp --num_samples 10

I get a x1.3 speedup.

On main:

2023-07-31 16:35:54 INFO [__main__] generation speed: 22.849tokens/s

On this branch:

2023-07-31 16:38:16 INFO [__main__] generation speed: 29.792tokens/s

I'll test more thoroughly the load time later.

qwopqwop200 · 2023-07-31T23:17:08Z

This looks very cool.

qwopqwop200 · 2023-08-01T03:33:14Z

fxmarty#1

qwopqwop200 · 2023-08-03T07:36:38Z

@TheBloke
If you have time, could you test the llama-70b's fused attention?
It would be appreciated if you could also test the results according to disable_exllama.
I don't have a computer to test this on.
https://github.com/qwopqwop200/AutoGPTQ/tree/support-gqa

fxmarty · 2023-08-04T13:14:16Z

Do not merge for now please

Exllama q4 kernel

fxmarty · 2023-08-04T15:19:08Z

Hi @PanQiWei, I resolved conflicts and tested both on A100 and MI250, we have both CUDA & RoCm support for it as well which is cool! Free free to have a look and merge if it looks good to you.

Note the test for correctness in test_q4.py

PanQiWei

I've run the tests and benchmarks and all looks good to me. Thank you so much for this great work! Will merge it now.

fxmarty added 4 commits July 31, 2023 11:50

exllama kernel

179776b

cleaning

760667d

fix fused attn

3844726

add test

7c72b72

fxmarty added 2 commits July 31, 2023 14:28

fix bug quantization config loading

5660b22

add test act-order wip (not passing)

d595b0b

fxmarty added 3 commits July 31, 2023 14:59

fix test

208bbfb

act-order now works fine

129fa4b

fix

339c57a

fix atol

7ce1828

qwopqwop200 added 2 commits August 1, 2023 12:22

add pack fun

a60c9a8

if training disable exllama

a1fd81c

qwopqwop200 added 3 commits August 1, 2023 20:01

change pcak func support only 4 bit

3fc097d

support group query attention

7a7df56

exllama support flash attention

068210d

qwopqwop200 and others added 2 commits August 4, 2023 18:19

revert fused_llama_attn.py

79ab507

Merge branch 'main' into exllama-q4-kernel

4fb3e20

fxmarty added 3 commits August 4, 2023 13:38

rocm support

d0608b0

fix

c203a85

Merge pull request #1 from qwopqwop200/exllama-q4-kernel

71f2326

Exllama q4 kernel

PanQiWei approved these changes Aug 6, 2023

View reviewed changes

PanQiWei merged commit 9e2741c into AutoGPTQ:main Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exllama q4 kernel #219

Add exllama q4 kernel #219

fxmarty commented Jul 31, 2023 •

edited

Loading

TheBloke commented Jul 31, 2023

TheBloke commented Jul 31, 2023

fxmarty commented Jul 31, 2023 •

edited

Loading

TheBloke commented Jul 31, 2023

fxmarty commented Jul 31, 2023 •

edited

Loading

qwopqwop200 commented Jul 31, 2023

qwopqwop200 commented Aug 1, 2023

qwopqwop200 commented Aug 3, 2023 •

edited

Loading

fxmarty commented Aug 4, 2023

fxmarty commented Aug 4, 2023 •

edited

Loading

PanQiWei left a comment

Add exllama q4 kernel #219

Add exllama q4 kernel #219

Conversation

fxmarty commented Jul 31, 2023 • edited Loading

TheBloke commented Jul 31, 2023

Llama 1 7B GPTQ

Group size 128, Act-Order/desc_act = False

AutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention

fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)

Llama 1 33B GPTQ

Group size: None, Act-Order/desc_act = True

AutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention

fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)

TheBloke commented Jul 31, 2023

Bugs:

Issue:

fxmarty commented Jul 31, 2023 • edited Loading

TheBloke commented Jul 31, 2023

fxmarty commented Jul 31, 2023 • edited Loading

qwopqwop200 commented Jul 31, 2023

qwopqwop200 commented Aug 1, 2023

qwopqwop200 commented Aug 3, 2023 • edited Loading

fxmarty commented Aug 4, 2023

fxmarty commented Aug 4, 2023 • edited Loading

PanQiWei left a comment

Choose a reason for hiding this comment

fxmarty commented Jul 31, 2023 •

edited

Loading

fxmarty commented Jul 31, 2023 •

edited

Loading

fxmarty commented Jul 31, 2023 •

edited

Loading

qwopqwop200 commented Aug 3, 2023 •

edited

Loading

fxmarty commented Aug 4, 2023 •

edited

Loading