Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exllama q4 kernel #219

Merged
merged 20 commits into from
Aug 6, 2023
Merged

Add exllama q4 kernel #219

merged 20 commits into from
Aug 6, 2023

Conversation

fxmarty
Copy link
Collaborator

@fxmarty fxmarty commented Jul 31, 2023

Exllama kernel for int4/fp16 is notoriously faster than the implem in AutoGPTQ / GPTQ-for-llama (see e.g. huggingface/text-generation-inference#553 (comment), will do a proper benchmark for auto-gptq), especially in the act-order case where weights are reordered ahead of time, and the activations on the fly, removing the need to reorder scales and zero points.

I removed the rocm support for now, let's first merge #214 and then test on rocm this kernel.

Left to do:

  • Add tests
  • Test with peft
  • Implement the pack method - not sure what should be changed compared to cuda / cuda-old / triton?
  • Auto-detect bits/groupsize/act-order (can be done in an other PR, actually not needed since most models have a quantize_config.json

@TheBloke
Copy link
Contributor

Awesome! I'm really excited for this.

I've done a quick benchmark, on:

  • 4090 GPU on AMD EPYC 7352 24-Core Processor (= slow single core performance = CPU bottlenecked)
  • Torch 2.0.1, CUDA 11.8
  • Benchmarking with examples/benchmark/generation_speed.py

Llama 1 7B GPTQ

Group size 128, Act-Order/desc_act = False

AutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention

  • 27.27 tokens/s
  • generated 5120 tokens using 187.7769591808319 seconds, generation speed: 27.266391054236674tokens/s
  • VRAM usage: 6530 MiB
  • GPU usage: 24% (CPU bottlenecked)

fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)

  • 36.53 tokens/s
  • generated 5120 tokens using 140.82491946220398 seconds, generation speed: 36.3572016909561tokens/s
  • VRAM usage: 6510 MiB
  • GPU usage: 33% (CPU bottlenecked)

Llama 1 33B GPTQ

Group size: None, Act-Order/desc_act = True

AutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention

  • 15.32 tokens/s
  • generated 5120 tokens using 334.10506224632263 seconds, generation speed: 15.324520872494963tokens/s
  • VRAM usage: 22666 MiB
  • GPU usage: 45%

fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)

  • 19.19 tokens/s
  • generated 5120 tokens using 266.84571146965027 seconds, generation speed: 19.187117423778886tokens/s
  • VRAM usage: 20848 MiB
  • GPU usage: 55%

Not yet a huge difference on this CPU-bottlenecked system, but definitely worthwhile - and it should be better still when fused attention works.

We also see lower VRAM usage in the 33B test, but that could be down to not using fused attention which tends to use a bit more VRAM.

I wanted to test group_size + desc_act, but it crashes at the moment.

I found three issues while doing this testing (two bugs, one problem), which I'll put in the next message.

@TheBloke
Copy link
Contributor

Bugs:

  1. Fused attention doesn't work, hence I had to disable it for the above benchmark. Exception when inject_fused_attention=True:
Traceback (most recent call last):
  File "/workspace/exllama_AutoGPTQ/examples/benchmark/generation_speed.py", line 314, in <module>
    main()
  File "/workspace/exllama_AutoGPTQ/examples/benchmark/generation_speed.py", line 266, in main
    model, tokenizer = load_model_tokenizer(
  File "/workspace/exllama_AutoGPTQ/examples/benchmark/generation_speed.py", line 167, in load_model_tokenizer
    model = AutoGPTQForCausalLM.from_quantized(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 105, in from_quantized
    return quant_func(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 877, in from_quantized
    cls.fused_attn_module_type.inject_to_model(
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 156, in inject_to_model
    g_idx = torch.cat([q_proj.g_idx, k_proj.g_idx, v_proj.g_idx], dim=0)
TypeError: expected Tensor as element 0 in argument 0, but got NoneType
  1. When trying to test group_size + desc-act (eg 128g + act-order or 32g + act-order), I got a segmentation fault:
 root@8cf0559a3679:/workspace/exllama_AutoGPTQ/examples/benchmark (exllama-q4-kernel ✔) ᐅ python3 generation_speed.py --model_name_or_path /workspace/llama-7b-gptq-32gTrue --use_safetensors --use_fast_tokenizer --no_inject_fused_attention
2023-07-31 13:09:29 INFO [__main__] max_memory: None
2023-07-31 13:09:29 INFO [__main__] loading model and tokenizer
2023-07-31 13:09:29 INFO [auto_gptq.modeling._base] lm_head not been quantized, will be ignored when make_quant.
2023-07-31 13:09:32 WARNING [accelerate.utils.modeling] The safetensors archive passed at /workspace/llama-7b-gptq-32gTrue/gptq_model-4bit-32g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
[1]    6051 segmentation fault (core dumped)  python3 generation_speed.py --model_name_or_path  --use_safetensors
 root@8cf0559a3679:/workspace/exllama_AutoGPTQ/examples/benchmark (exllama-q4-kernel ✔) ᐅ

Issue:

  1. I noticed that weight loading is quite a bit slower than before.

Example: 30B, no group_size, act-order = True

  • AutoGPTQ 0.3.2 CUDA with Llama 30B, group_size = None, act-order = True
    • model and tokenizer loading time: 70.6247s
  • ExLlama AutoGPTQ, with same:
    • model and tokenizer loading time: 181.7889s

This might be an inevitable result of the ExLlama kernel? But I thought it worth mentioning.

Also, when I tried to test 30B 128g + Act-Order, it took something like 10 minutes to load weights before it segmentation faulted due to issue 2. I can't give an exact timing because of the crash, but it seemed from that that perhaps group_size + desc_act was even slower on weight loading.

@fxmarty
Copy link
Collaborator Author

fxmarty commented Jul 31, 2023

The fused attention bug should be solved in the act-order=False case.

@TheBloke can you run pytest tests/ -s and tell me if it works for you?

There is no test yet for the act-order case, so it is expected that it crashes (probably something is wrong in my implem). Will add a test.

I'll test as well the load time, thank you!

Edit: to me the case Group size: None, Act-Order/desc_act = True does not really make sense since it is then per-column quantization instead of per-group, so the notion of act-order does not make sense here. @TheBloke Can you give an example of such a model so that I can test as well?

@TheBloke
Copy link
Contributor

@fxmarty pytest tests/ -s fails immediately with segmentation fault

 root@8cf0559a3679:/workspace/exllama_AutoGPTQ (exllama-q4-kernel ✔) ᐅ pytest tests -s
========================================================================================== test session starts ==========================================================================================
platform linux -- Python 3.10.6, pytest-7.4.0, pluggy-1.2.0
rootdir: /workspace/exllama_AutoGPTQ
plugins: anyio-3.7.1
collected 4 items

tests/test_q4.py Fatal Python error: Segmentation fault

Current thread 0x00007fbf34a52000 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 13 in ext_make_q4
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 86 in post_init
  File "/workspace/exllama_AutoGPTQ/tests/test_q4.py", line 163 in test_exllama
  File "/usr/lib/python3.10/unittest/case.py", line 549 in _callTestMethod
  File "/usr/lib/python3.10/unittest/case.py", line 591 in run
  File "/usr/lib/python3.10/unittest/case.py", line 650 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/unittest.py", line 333 in runtest
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 262 in <lambda>
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 341 in from_call
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 222 in call_and_report
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 324 in _main
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 270 in wrap_session
  File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 166 in main
  File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/usr/local/bin/pytest", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, yaml._yaml, charset_normalizer.md, google._upb._message, lz4._version, lz4.frame._frame, zstandard.backend_c, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pandas._libs.tslib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, PIL._imaging (total: 191)
[1]    9346 segmentation fault (core dumped)  pytest tests -s
 root@8cf0559a3679:/workspace/exllama_AutoGPTQ (exllama-q4-kernel ✔) ᐅ

Edit: to me the case Group size: None, Act-Order/desc_act = True does not really make sense since it is then per-column quantization instead of per-group, so the notion of act-order does not make sense here. @TheBloke Can you give an example of such a model so that I can test as well?

The main branch option for any 30/33B or 65B/70B GPTQ I have released, for example:

For 33B models in particular no group size is desirable as it lowers VRAM usage, and then act-order is added to get the highest possible quantisation accuracy (at least when evaluated on perplexity).

I don't have PPL results to hand for 30B, but for example here is the result at 7B, comparing:

  • no group-size + no act-order
  • no group-size + act-order

image

As you see, adding act-order to the quantisation makes a pretty big difference in PPL

@fxmarty
Copy link
Collaborator Author

fxmarty commented Jul 31, 2023

Act-order should work now. You can try pip uninstall auto-gptq && rm -rf *.so build auto_gptq.egg-info && pip install -v -e . and then pytest tests/ -s -k "test_generation_with_act_order" to check.

Having exllama act-order work with fused QKV is still to do.


For the act-order + group case,

Running CUDA_VISIBLE_DEVICES=0 python generation_speed.py --model_name_or_path TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g --model_basename vicuna-13B-1.1-GPTQ-4bit-128g.latest --revision actorder --use_safetensors --no_inject_fused_attention --no_inject_fused_mlp --num_samples 10 on this fixed script https://pastebin.com/CqXAjBkf that uses cudaevents + synchronize (which should IMO always be used when benchmarking on GPUs, since kernel launches are asynchronous) + revision

On an Intel Xeon 8275CL CPU + A100 80GB, I get roughly a x2.7 speedup:

On main (with this critical fix #220):

2023-07-31 16:30:49 INFO [__main__] generation speed: 8.681tokens/s

On this branch:

2023-07-31 16:24:36 INFO [__main__] generation speed: 23.693tokens/s

Note that we generate 256 tokens with greedy search, and that the inputs are of various (small) shapes:

examples[i][input_ids] torch.Size([1, 110])
examples[i][input_ids] torch.Size([1, 63])
examples[i][input_ids] torch.Size([1, 73])
examples[i][input_ids] torch.Size([1, 192])
examples[i][input_ids] torch.Size([1, 32])
examples[i][input_ids] torch.Size([1, 67])
examples[i][input_ids] torch.Size([1, 223])
examples[i][input_ids] torch.Size([1, 56])
examples[i][input_ids] torch.Size([1, 83])

I did not try batched generation yet.


For the no act-order + group case, running CUDA_VISIBLE_DEVICES=0 python generation_speed.py --model_name_or_path TheBloke/WizardLM-7B-uncensored-GPTQ --model_basename WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order --use_safetensors --no_inject_fused_attention --no_inject_fused_mlp --num_samples 10

I get a x1.3 speedup.

On main:

2023-07-31 16:35:54 INFO [__main__] generation speed: 22.849tokens/s

On this branch:

2023-07-31 16:38:16 INFO [__main__] generation speed: 29.792tokens/s

I'll test more thoroughly the load time later.

@qwopqwop200
Copy link
Collaborator

This looks very cool.

@qwopqwop200
Copy link
Collaborator

fxmarty#1

@qwopqwop200
Copy link
Collaborator

qwopqwop200 commented Aug 3, 2023

@TheBloke
If you have time, could you test the llama-70b's fused attention?
It would be appreciated if you could also test the results according to disable_exllama.
I don't have a computer to test this on.
https://github.com/qwopqwop200/AutoGPTQ/tree/support-gqa

@fxmarty
Copy link
Collaborator Author

fxmarty commented Aug 4, 2023

Do not merge for now please

@fxmarty
Copy link
Collaborator Author

fxmarty commented Aug 4, 2023

Hi @PanQiWei, I resolved conflicts and tested both on A100 and MI250, we have both CUDA & RoCm support for it as well which is cool! Free free to have a look and merge if it looks good to you.

Note the test for correctness in test_q4.py

Copy link
Collaborator

@PanQiWei PanQiWei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the tests and benchmarks and all looks good to me. Thank you so much for this great work! Will merge it now.

@PanQiWei PanQiWei merged commit 9e2741c into AutoGPTQ:main Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants