Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exllama integration #30

Closed
wants to merge 16 commits into from
Closed

Exllama integration #30

wants to merge 16 commits into from

Conversation

casper-hansen
Copy link
Owner

This is an integration of the ExLlama kernels.

My initial notes:

  • It is about 10% faster than the original kernel (maybe 20% with Optimize q4_matmul turboderp/exllama#275)
  • The output is currently completely random characters (needs FIXING)
  • To use the matmul kernel, the number of input and output features must be the same

I invite anyone who wants to try to make this work to open PRs. @qwopqwop200

@qwopqwop200
Copy link
Contributor

7617663
I succeeded in getting exllama to work.
Additionally, Optimize q4_matmul was also applied.
But now it is clear. exllama and awq kernel have very different storage formats. And it seems very difficult to make the awq kernel's storage format work in exllama.

@casper-hansen
Copy link
Owner Author

But now it is clear. exllama and awq kernel have very different storage formats. And it seems very difficult to make the awq kernel's storage format work in exllama.

Would it make sense to create a from_qlinear classmethod in the ExllamaLinear class that implements this new packing functionality? Then when we load into ExllamaLinear, we always map from the WQLinear module to ExllamLinear module. Of course this increases loading times, but I'm curious if it is feasible.

https://github.com/qwopqwop200/AutoAWQ-exllama/blob/exllama-experiment/awq/quantize/qmodule.py#L107

@casper-hansen
Copy link
Owner Author

@qwopqwop200 after merging your PR, I get the following error when testing with Vicuna 7B:

ValueError: Trying to set a tensor of shape torch.Size([4096, 512]) in "qweight" (which has shape torch.Size([512, 4096])), this look incorrect.

Am I supposed to re-quant a model to get it working?

@casper-hansen
Copy link
Owner Author

We might have to abort this pull request. The people from MIT just released a new kernel that should be so much faster.

@qwopqwop200
Copy link
Contributor

qwopqwop200 commented Sep 8, 2023

No, I don't think that's necessary. The current comparison results show that exllama matmul kernel outperforms tinychatv2 matmul kernel. The current speed improvement of tinychatv2 appears to be due to additional optimizations such as attention.

exllama
Model summary: opt-125m-awq
Load time: 1.73 seconds
Context speed: 7767.68 tokens/second (0.13 ms/token)
Generation speed: 129.93 tokens/second (7.70 ms/token)
VRAM: 255.58 MB

tinychatv2(gemv)
Model summary: opt-125m-awq
Load time: 2.96 seconds
Context speed: 6358.17 tokens/second (0.16 ms/token)
Generation speed: 118.74 tokens/second (8.42 ms/token)
VRAM: 255.92 MB

----edit----
I tested again and it seems that tinychatv2 is slightly faster than exllama.
----edit2----
?
As a result of experimenting with both AutoAWQ and llm-awq, the exllama kernel is better for AutoAWQ, and the tinychatv2 kernel is better for llm-awq.

@casper-hansen
Copy link
Owner Author

No, I don't think that's necessary. The current comparison results show that exllama matmul kernel outperforms tinychatv2 matmul kernel. The current speed improvement of tinychatv2 appears to be due to additional optimizations such as attention.

exllama Model summary: opt-125m-awq Load time: 1.73 seconds Context speed: 7767.68 tokens/second (0.13 ms/token) Generation speed: 129.93 tokens/second (7.70 ms/token) VRAM: 255.58 MB

tinychatv2(gemv) Model summary: opt-125m-awq Load time: 2.96 seconds Context speed: 6358.17 tokens/second (0.16 ms/token) Generation speed: 118.74 tokens/second (8.42 ms/token) VRAM: 255.92 MB

----edit---- I tested again and it seems that tinychatv2 is slightly faster than exllama. ----edit2---- ? As a result of experimenting with both AutoAWQ and llm-awq, the exllama kernel is better for AutoAWQ, and the tinychatv2 kernel is better for llm-awq.

I will do some more testing. Perhaps we can have both exllama and the new GEMV kernel. Just need to convert weights to exllama compatible version.

@casper-hansen
Copy link
Owner Author

It seems GEMV kernel is faster than GEMM by 20% but processing context is slow and batch sizes should also be slow. I have made it easy to extend to other methods like exllama, which will come shortly after. GEMV, GEMM, and ExLlama will all have three different quantization formats and kernels.

@nexa123
Copy link

nexa123 commented Sep 12, 2023

I tested this PR with llama2-13b in one nvidia-a800 gpu, but it can not work.

  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/awq/lib/python3.10/site-packages/awq/modules/fused_attn.py", line 122, in forward
    query = query.view(query_len*query_batch_size, self.num_heads * self.head_dim)
RuntimeError: shape '[4, 5120]' is invalid for input of size 6828

@casper-hansen
Copy link
Owner Author

@nexa123 Did you find this model on huggingface or did you quantize it from scratch yourself?

@nexa123
Copy link

nexa123 commented Sep 12, 2023

I use texamples/basic_quant.py to quantize model. and use example/basic_generate.py to observe result.

@casper-hansen
Copy link
Owner Author

Thanks @nexa123. My efforts are currently in #40 where we get a 60% speed boost on LLaMa models compared to main. Will merge it when it’s fully featured and ready!

@nexa123
Copy link

nexa123 commented Sep 14, 2023

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

@casper-hansen
Copy link
Owner Author

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

For reference, the AutoAWQ main branch is now 5-10% slower than ExLlama V2 according to my benchmarks for token generation. We are hitting the limits of how much faster models can run. Only thing left to optimize is context processing.

@head-with-nothing
Copy link

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

For reference, the AutoAWQ main branch is now 5-10% slower than ExLlama V2 according to my benchmarks for token generation. We are hitting the limits of how much faster models can run. Only thing left to optimize is context processing.

I tested two implementation on nvidia a800. Exllamav2 is faster 80%-90% than the older one. But the most fast implementation of awq in my tests is lmdeploy.

@casper-hansen
Copy link
Owner Author

The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2?

For reference, the AutoAWQ main branch is now 5-10% slower than ExLlama V2 according to my benchmarks for token generation. We are hitting the limits of how much faster models can run. Only thing left to optimize is context processing.

I tested two implementation on nvidia a800. Exllamav2 is faster 80%-90% than the older one. But the most fast implementation of awq in my tests is lmdeploy.

Yes, for production deployments, you want to leverage either vLLM or LMDeploy. I believe vLLM is faster if you use larger batch sizes. AutoAWQ is meant to have relatively fast generation and ease of access.

@casper-hansen casper-hansen deleted the exllama branch November 20, 2023 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants