Exllama integration #30

casper-hansen · 2023-09-06T15:31:22Z

This is an integration of the ExLlama kernels.

My initial notes:

It is about 10% faster than the original kernel (maybe 20% with Optimize q4_matmul turboderp/exllama#275)
The output is currently completely random characters (needs FIXING)
To use the matmul kernel, the number of input and output features must be the same

I invite anyone who wants to try to make this work to open PRs. @qwopqwop200

This reverts commit e848fff.

qwopqwop200 · 2023-09-07T05:31:10Z

7617663
I succeeded in getting exllama to work.
Additionally, Optimize q4_matmul was also applied.
But now it is clear. exllama and awq kernel have very different storage formats. And it seems very difficult to make the awq kernel's storage format work in exllama.

casper-hansen · 2023-09-07T16:21:10Z

But now it is clear. exllama and awq kernel have very different storage formats. And it seems very difficult to make the awq kernel's storage format work in exllama.

Would it make sense to create a from_qlinear classmethod in the ExllamaLinear class that implements this new packing functionality? Then when we load into ExllamaLinear, we always map from the WQLinear module to ExllamLinear module. Of course this increases loading times, but I'm curious if it is feasible.

https://github.com/qwopqwop200/AutoAWQ-exllama/blob/exllama-experiment/awq/quantize/qmodule.py#L107

Exllama optimize q4 matmul and fixbug

casper-hansen · 2023-09-07T22:27:58Z

@qwopqwop200 after merging your PR, I get the following error when testing with Vicuna 7B:

ValueError: Trying to set a tensor of shape torch.Size([4096, 512]) in "qweight" (which has shape torch.Size([512, 4096])), this look incorrect.

Am I supposed to re-quant a model to get it working?

casper-hansen · 2023-09-07T23:23:39Z

We might have to abort this pull request. The people from MIT just released a new kernel that should be so much faster.

qwopqwop200 · 2023-09-08T02:36:08Z

No, I don't think that's necessary. The current comparison results show that exllama matmul kernel outperforms tinychatv2 matmul kernel. The current speed improvement of tinychatv2 appears to be due to additional optimizations such as attention.

exllama
Model summary: opt-125m-awq
Load time: 1.73 seconds
Context speed: 7767.68 tokens/second (0.13 ms/token)
Generation speed: 129.93 tokens/second (7.70 ms/token)
VRAM: 255.58 MB

tinychatv2(gemv)
Model summary: opt-125m-awq
Load time: 2.96 seconds
Context speed: 6358.17 tokens/second (0.16 ms/token)
Generation speed: 118.74 tokens/second (8.42 ms/token)
VRAM: 255.92 MB

----edit----
I tested again and it seems that tinychatv2 is slightly faster than exllama.
----edit2----
?
As a result of experimenting with both AutoAWQ and llm-awq, the exllama kernel is better for AutoAWQ, and the tinychatv2 kernel is better for llm-awq.

casper-hansen · 2023-09-08T08:23:20Z

No, I don't think that's necessary. The current comparison results show that exllama matmul kernel outperforms tinychatv2 matmul kernel. The current speed improvement of tinychatv2 appears to be due to additional optimizations such as attention.

exllama Model summary: opt-125m-awq Load time: 1.73 seconds Context speed: 7767.68 tokens/second (0.13 ms/token) Generation speed: 129.93 tokens/second (7.70 ms/token) VRAM: 255.58 MB

tinychatv2(gemv) Model summary: opt-125m-awq Load time: 2.96 seconds Context speed: 6358.17 tokens/second (0.16 ms/token) Generation speed: 118.74 tokens/second (8.42 ms/token) VRAM: 255.92 MB

----edit---- I tested again and it seems that tinychatv2 is slightly faster than exllama. ----edit2---- ? As a result of experimenting with both AutoAWQ and llm-awq, the exllama kernel is better for AutoAWQ, and the tinychatv2 kernel is better for llm-awq.

I will do some more testing. Perhaps we can have both exllama and the new GEMV kernel. Just need to convert weights to exllama compatible version.

casper-hansen · 2023-09-09T12:30:42Z

It seems GEMV kernel is faster than GEMM by 20% but processing context is slow and batch sizes should also be slow. I have made it easy to extend to other methods like exllama, which will come shortly after. GEMV, GEMM, and ExLlama will all have three different quantization formats and kernels.

nexa123 · 2023-09-12T07:21:56Z

I tested this PR with llama2-13b in one nvidia-a800 gpu, but it can not work.

  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/awq/modules/fused_attn.py", line 122, in forward
    query = query.view(query_len*query_batch_size, self.num_heads * self.head_dim)
RuntimeError: shape '[4, 5120]' is invalid for input of size 6828

casper-hansen · 2023-09-12T07:51:24Z

@nexa123 Did you find this model on huggingface or did you quantize it from scratch yourself?

nexa123 · 2023-09-12T07:55:41Z

I use texamples/basic_quant.py to quantize model. and use example/basic_generate.py to observe result.

casper-hansen · 2023-09-12T12:39:38Z

Thanks @nexa123. My efforts are currently in #40 where we get a 60% speed boost on LLaMa models compared to main. Will merge it when it’s fully featured and ready!

nexa123 · 2023-09-14T12:06:26Z

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

casper-hansen · 2023-09-14T12:42:31Z

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

For reference, the AutoAWQ main branch is now 5-10% slower than ExLlama V2 according to my benchmarks for token generation. We are hitting the limits of how much faster models can run. Only thing left to optimize is context processing.

head-with-nothing · 2023-09-14T13:01:12Z

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

For reference, the AutoAWQ main branch is now 5-10% slower than ExLlama V2 according to my benchmarks for token generation. We are hitting the limits of how much faster models can run. Only thing left to optimize is context processing.

I tested two implementation on nvidia a800. Exllamav2 is faster 80%-90% than the older one. But the most fast implementation of awq in my tests is lmdeploy.

casper-hansen · 2023-09-14T13:12:27Z

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

For reference, the AutoAWQ main branch is now 5-10% slower than ExLlama V2 according to my benchmarks for token generation. We are hitting the limits of how much faster models can run. Only thing left to optimize is context processing.

I tested two implementation on nvidia a800. Exllamav2 is faster 80%-90% than the older one. But the most fast implementation of awq in my tests is lmdeploy.

Yes, for production deployments, you want to leverage either vLLM or LMDeploy. I believe vLLM is faster if you use larger batch sizes. AutoAWQ is meant to have relatively fast generation and ease of access.

casper-hansen added 7 commits September 3, 2023 16:50

Initial Exllama integration

36d4972

Fix output shape

855854a

Simplify exllama module and creation

503965e

Updater kernel

e848fff

Initialize fused ExllamaLinear

dd29c34

Run post init

c161184

Revert "Update kernel"

e8262bb

This reverts commit e848fff.

This was referenced Sep 6, 2023

📌 AutoAWQ Roadmap #32

Closed

Implement exllama q4_matmul kernel as alternative #3

Open

casper-hansen and others added 4 commits September 6, 2023 22:59

Merge branch 'main' into exllama

83e7f51

Optimize q4_matmul and Bug fix

f8e2677

turboderp/exllama#275

fix exllama bug

677f400

add kwargs

d1dcf04

@qwopqwop200 Updated and working

8786b88

casper-hansen added 2 commits September 8, 2023 00:04

Merge pull request #37 from qwopqwop200/exllama

4ffb03e

Exllama optimize q4 matmul and fixbug

Fix typo

c7ef0e8

casper-hansen added 2 commits September 7, 2023 22:36

Fix qweight registering

6423563

Attempt from_qlinear classmethod

96b2092

casper-hansen closed this Oct 6, 2023

casper-hansen deleted the exllama branch November 20, 2023 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllama integration #30

Exllama integration #30

casper-hansen commented Sep 6, 2023

qwopqwop200 commented Sep 7, 2023

casper-hansen commented Sep 7, 2023

casper-hansen commented Sep 7, 2023

casper-hansen commented Sep 7, 2023

qwopqwop200 commented Sep 8, 2023 •

edited

Loading

casper-hansen commented Sep 8, 2023

casper-hansen commented Sep 9, 2023

nexa123 commented Sep 12, 2023 •

edited

Loading

casper-hansen commented Sep 12, 2023

nexa123 commented Sep 12, 2023

casper-hansen commented Sep 12, 2023

nexa123 commented Sep 14, 2023

casper-hansen commented Sep 14, 2023

head-with-nothing commented Sep 14, 2023

casper-hansen commented Sep 14, 2023

Exllama integration #30

Exllama integration #30

Conversation

casper-hansen commented Sep 6, 2023

qwopqwop200 commented Sep 7, 2023

casper-hansen commented Sep 7, 2023

casper-hansen commented Sep 7, 2023

casper-hansen commented Sep 7, 2023

qwopqwop200 commented Sep 8, 2023 • edited Loading

casper-hansen commented Sep 8, 2023

casper-hansen commented Sep 9, 2023

nexa123 commented Sep 12, 2023 • edited Loading

casper-hansen commented Sep 12, 2023

nexa123 commented Sep 12, 2023

casper-hansen commented Sep 12, 2023

nexa123 commented Sep 14, 2023

casper-hansen commented Sep 14, 2023

head-with-nothing commented Sep 14, 2023

casper-hansen commented Sep 14, 2023

qwopqwop200 commented Sep 8, 2023 •

edited

Loading

nexa123 commented Sep 12, 2023 •

edited

Loading