-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add exllama GPTQ CUDA kernel support #553
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat numbers !
I feel like gptq
and gptq-cuda
is not necessary here.
IIUC, both can run on the same weights (as you didn't change the conversion script).
Therefore, we could simply use exllama
kernel whenever available (when g_idx is increasing).
That should simplify the codebase a lot.
Also nothing should be modified in model files, everything should be very agnostic to it, especially since the weights are exactly the same on disk.
This could also be explained in the gptq script (act-order, if True, more precise models, but slower inference because different kernels, if False, lower precision, but ffaster inference).
I still fail to understand why we cannot reorder on load to use exllama for act-order (since we can reslice at will in the original tensors, we could probably de-entagle g_idx
again.
It's a lot more work certainly.
# Buffers need to be persistent to avoid any bug. | ||
self.buffers = {} | ||
if config.quantize == "gptq-cuda": | ||
max_dq_buffer_size = 0 | ||
for name, submodule in self.named_modules(): | ||
if isinstance(submodule, (TensorParallelColumnLinear, TensorParallelRowLinear)) and isinstance(submodule.linear, Ex4bitLinear): | ||
max_dq_buffer_size = max(max_dq_buffer_size, submodule.linear.qweight.numel() * 8) | ||
|
||
intermediate_size = config.n_inner | ||
max_seq_len = 2048 # TODO: we should be able to set it | ||
|
||
self.buffers["temp_state"] = torch.zeros((max_seq_len, intermediate_size), dtype=torch.float16, device=weights.device) | ||
self.buffers["temp_dq"] = torch.zeros((1, max_dq_buffer_size), dtype=torch.float16, device=weights.device) | ||
|
||
prepare_buffers(weights.device, self.buffers["temp_state"], self.buffers["temp_dq"]) | ||
|
||
# TODO: ability to set them | ||
matmul_recons_thd = 8 | ||
matmul_fused_remap = False | ||
matmul_no_half2 = False | ||
set_tuning_params(matmul_recons_thd, matmul_fused_remap, matmul_no_half2) | ||
|
||
torch.cuda.empty_cache() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go directly in the loading part (within weights).
That ways it's truly agnostic to models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to Model
init. This requires to have access to model.config
which is currently not defined though. There is model.transformer.config
, or model.gpt_neox.config
or model.model.config
depending on the architecture. Is it intended that the config is not registered at the top level? @OlivierDehaene @Narsil
The thing is that the weights = Weights(...)
call is in each model definition, and we need to have loaded all weights to determine the shapes of the buffers. Also, the buffers need to be persistent, while I think this weights
object is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffers intend to be shared, no ?
So why not just have a single location for these buffers, use the pointer on every layer, and increase the size every time max_dq_buffer_size = max(max_dq_buffer_size, submodule.linear.qweight.numel() * 8)
is larger ?
The issue with this (and any post loading treatment) is that you're now dealing with updating every single model file, any time of those line hits. This is what we had before and it was painful to maintain.
This seems to be used as globals let's just use them as globals. (They are temporary buffers IIUC preallocated to avoid reallocating them all the time)
server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py
Outdated
Show resolved
Hide resolved
server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py
Outdated
Show resolved
Hide resolved
For some reason the Edit: comes from the atomicAdd of the kernel - this is fine. I'll add llama support in this PR too. |
This is very suspicious, really ? Isn't the purpose of atomicAdd to remove randomness by forcing access order ? :) |
@Narsil I don't believe it is suspicious: https://forums.developer.nvidia.com/t/get-different-results-for-every-running-with-atomicadd/229649/2 |
Ahhh that level of randomness ! :) I see, yeah totally legit source of "randomness". |
Just trying to get the integration tests to pass. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>
Closing as superseeded by #666 |
Examples:
This PR adds to TGI the mixed precision int4/fp16 kernels from the excellent exllama repo, that from my benchmark is much better than the implementations available in autogptq & gptq-for-llama.
On batch size 1, for starcoder with starcoder & GPTQ-4bit-no-actorder, we get a x2.1 speedup on the prefill over GPTQ-triton, and x1.8 speedup on the decode over GPTQ-triton. I'll have a look at the peak memory.
I verified locally that logits match.
Note that exllama implementation can not be used with act-order & tp rank>=2 for row tensor parallel linear, because exllama reorders weights ahead of runtime, requiring to reorder the activation as well (which are split on several GPUs for row parallel + TP rank>=2). In this specific case, we default to the trition implementation (that is much slower because reordering is done one the scales/zero points, and each weight row need to have its own specific scale/zero point).
Exllama implementation is specifically for n_bits = 4. Thus, for the other cases we fall back on the triton kernel.
Results on starcoder are as follow (TP rank = 2, A100, before vllm):
GPTQ (current):
GPTQ-CUDA (exllama):
Before submitting