support 2bit quip# method #1293

Minami-su · 2023-12-23T01:10:56Z

https://github.com/Cornell-RelaxML/quip-sharp

Minami-su · 2023-12-23T01:11:39Z

younesbelkada · 2023-12-28T18:28:16Z

Hi!
We are definitely interested to add Quip# inference support in HF ecosystem similarly as GPTQ, AWQ, etc.!
tagging one of the main author of Quip# here: @tsengalb99 - what is the current canonical way to use Quip# kernels, is there a package on pypi with pre-compiled kernels for users to run inference ? We can also support inference with kernels that needs to be manually built by users (e.g. for llm-awq package (AWQ) we defined a variable "backend" in the quantization config:https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L58 and users switch from different backend if they use the kernels from the official repository or autoawq - that way if in the future there is a package that stores compiled kernels for Quip# we can support that easily by just swapping the backend)

cc @SunMarc @Titus-von-Koeller FYI

tsengalb99 · 2023-12-29T08:30:33Z

The canonical way to install QuIP# kernels is to install the fast-hadamard-transform package and build quiptools (in our codebase on github). We do not have a pypi package yet but are planning on having one in the future when the project becomes more stable. The two key "linear" classes that QuIP# relies on are here https://github.com/Cornell-RelaxML/quip-sharp/tree/main/lib/linear and you can see how we replace nn.Linear in llama with those classes in https://github.com/Cornell-RelaxML/quip-sharp/blob/main/model/llama.py.

A few questions: how are you planning on integrating QuIP# into huggingface code? Where will it be integrated into and how will you keep up with future itertations on QuIP#?

younesbelkada · 2023-12-29T09:12:51Z

Hi @tsengalb99
Thanks for your response! I am not 100% familiar yet with Quip# but what I had in mind was to go for a similar approach than AWQ, i.e. replacing torch.nn.Linear layers with QuantizedLinear from Quip# codebase. The core code would live inside a file quip.py here and we would replace the linear layers at init before loading the weights. Specifically, we would support Quip# inference and not quantization, for quantizing with Quip# we will redirect users to use your codebase.
To detect whether a model has been quantized with quip# we will add an attribute quantization_config in the config object, e.g. :https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-AWQ/blob/main/config.json#L20 and retrieve all neceassry arguments from there

tsengalb99 · 2024-01-03T02:27:25Z

I think it would be best to avoid duplicating code from the QuIP# codebase. The QuantizedLinear class is not standalone and relies on implementations in the codebook files (eg here for E8P https://github.com/Cornell-RelaxML/quip-sharp/blob/1d6e3c2d4c144eba80b945cca5429ce8d79d2cec/lib/codebook/latticee8_padded12.py#L180), which means you'll have to copy all those over as well. QuIP# is still in active development and we will almost certainly make changes to the codebooks in the future that will require you to update your copies as well. Perhaps you can include QuIP# as a submodule or something similar so users only have to pull our code once.

younesbelkada · 2024-01-29T22:31:29Z

We still plan to support Quip# inference, @tsengalb99 I will provide more details once huggingface/transformers#28703 gets merged

younesbelkada · 2024-01-30T06:41:10Z

Hi @tsengalb99 !
Great news 🎉 - we just merged: huggingface/transformers#26610 to enable developers to easily add new quantization method inference support in HF transformers! Would you like to try your hands on integrating quip# inference support in HF transformers? There is a detailed guideline on how to get started here: https://huggingface.co/docs/transformers/main/en/hf_quantizer Let us know what do you think !

tsengalb99 · 2024-02-02T21:51:55Z

Hi Younes, I’ll take a look at that, it definitely sounds interesting!

younesbelkada · 2024-02-03T00:43:29Z

Awesome, thanks very much @tsengalb99 !

younesbelkada · 2024-02-13T01:57:40Z

Hi @tsengalb99 ! Let me know if you need any help to kickoff Quip# integration in transformers! 🙏 With the recent quantizer support it should be quite straightforward and I am happy to help if needed

tsengalb99 · 2024-02-13T02:03:02Z

Hi Younes, will do. I got caught up with some other stuff but just released the updated quip# code and models today (https://github.com/Cornell-RelaxML/quip-sharp, https://arxiv.org/abs/2402.04396). Hoping to get integration going soon.

younesbelkada · 2024-02-13T02:04:08Z

Awesome, thanks so much @tsengalb99 let me know if you face into any issue!

tsengalb99 · 2024-02-22T03:49:50Z

@younesbelkada we've finally started working on this, expect some progress in a week or so.

younesbelkada · 2024-02-22T04:07:13Z

Nice, thanks very much ! Let me know if you need any help or guidance !
You could take some inspiration from: huggingface/transformers#28928 to get started ! 🙏 Thanks again and looking forward to the PR ! 🚀

github-actions · 2024-03-17T15:03:36Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

tsengalb99 · 2024-03-17T15:39:24Z

We are still working on integration, albeit very slowly.

younesbelkada · 2024-03-18T08:45:53Z

thanks again @tsengalb99 ! 🚀

Minami-su · 2024-03-27T03:17:52Z

AQLM is already fine #1476

github-actions · 2024-04-20T15:03:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

tsengalb99 · 2024-06-04T16:52:02Z

We have a better method coming out soon so quip# development has been superceded. We may eventually get around to hf support but without working cuda graphs during general its difficult to justify spending time on integration. Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: github-actions[bot] ***@***.***> Sent: Tuesday, June 4, 2024 8:04:11 AM To: huggingface/peft ***@***.***> Cc: Albert Tseng ***@***.***>; Mention ***@***.***> Subject: Re: [huggingface/peft] support 2bit quip# method (Issue #1293) Closed #1293<#1293> as completed. — Reply to this email directly, view it on GitHub<#1293 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6WZSHGHP4PU4NA4X44KATZFXJOXAVCNFSM6AAAAABBANQF2CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGAZTQMBRGI2DIMI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

SunMarc · 2024-06-05T11:31:20Z

Thanks for the update @tsengalb99 ! Very excited for this new methods 🔥 Would you mind explaining a bit more why cuda graphs are needed ? Also, in general, do you have any recommendation on what to improve on transformers to allow better support of quantization methods ?

tsengalb99 · 2024-06-06T18:11:51Z

Hi Marc, Cuda graphs are essential for fast inference since they mask out much of the kernel launch overheads. Many quantization algorithms like QuIP# use multiple kernels during inference and the launch overhead can often be much higher than the actual inference part. For example, with Cuda graphs, QuIP# can hit 170 tok/s for 2 bit 7B. Without, iirc it does around 20-30 tok/s. Much of this could be solved by kernel fusion, and groups that have large teams working on quantization have the engineering manpower to do that. However, smaller teams like ours can't always do everything, so having Cuda graph support would be very useful. Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Marc Sun ***@***.***> Sent: Wednesday, June 5, 2024 4:31:41 AM To: huggingface/peft ***@***.***> Cc: Albert Tseng ***@***.***>; Mention ***@***.***> Subject: Re: [huggingface/peft] support 2bit quip# method (Issue #1293) Thanks for the update @tsengalb99<https://github.com/tsengalb99> ! Very excited for this new methods 🔥 Would you mind explaining a bit more why cuda graphs are needed ? Also, in general, do you have any recommendation on what to improve on transformers to allow better support of quantization methods ? — Reply to this email directly, view it on GitHub<#1293 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6WZSABTGHSQYYBA4EEGG3ZF3ZJ3AVCNFSM6AAAAABBANQF2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBZGYYDENZSGE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ArthurZucker · 2024-06-07T08:19:06Z

Cuda graphs are supported in transformers for models that support static kv cache

tsengalb99 · 2024-07-06T20:07:31Z

Is there a list of such models and a guide on how to use cuda graphs with transformers? I just tried torch.compile(model.generate, mode=’reduce-overhead’) on transformers 4.42.3 with Llama 2 7B and get the following error ```

>> gen(input)

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 36, in inner return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/transformers/generation/utils.py", line 1538, in generate @torch.no_grad() File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/transformers/generation/utils.py", line 1456, in _prepare_special_tokens def _prepare_special_tokens( File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 36, in inner return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 917, in forward return compiled_fn(full_args) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 89, in g return f(*args) ^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 106, in runtime_wrapper all_outs = call_func_at_runtime_with_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args out = normalize_as_list(f(args)) ^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 152, in rng_functionalization_wrapper return compiled_fw(args) ^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 906, in __call__ return self.get_current_callable()(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 838, in run return compiled_fn(new_inputs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/cudagraph_trees.py", line 381, in deferred_cudagraphify copy_misaligned_inputs(inputs, check_input_idxs) File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 751, in copy_misaligned_inputs if new_inputs[i].data_ptr() % ALIGNMENT: ^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/transformers/generation/utils.py", line 1500, in _prepare_special_tokens eos_token_id = eos_token_id.unsqueeze(0). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation. ``` From: Arthur ***@***.***> Sent: Friday, June 7, 2024 1:19 AM To: huggingface/peft ***@***.***> Cc: Albert Tseng ***@***.***>; Mention ***@***.***> Subject: Re: [huggingface/peft] support 2bit quip# method (Issue #1293) Cuda graphs are supported in transformers for models that support static kv cache — Reply to this email directly, view it on GitHub <#1293 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH6WZSG34YNNI7KO6ARHYBDZGFUI7AVCNFSM6AAAAABBANQF2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUGMZTKMBYGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ArthurZucker · 2024-07-11T16:47:43Z

The compile should be run on the forward not generate for now! huggingface/transformers#30788 will add end to end spport

younesbelkada added the PRs welcome to address this contributions are welcome from community members on this issue label Dec 28, 2023

huggingface deleted a comment from github-actions bot Jan 29, 2024

github-actions bot closed this as completed Apr 28, 2024

SunMarc reopened this Apr 29, 2024

github-actions bot closed this as completed May 7, 2024

SunMarc reopened this May 7, 2024

github-actions bot closed this as completed May 16, 2024

SunMarc reopened this May 16, 2024

github-actions bot closed this as completed May 25, 2024

SunMarc reopened this May 27, 2024

github-actions bot closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support 2bit quip# method #1293

support 2bit quip# method #1293

Minami-su commented Dec 23, 2023

Minami-su commented Dec 23, 2023

younesbelkada commented Dec 28, 2023 •

edited

Loading

tsengalb99 commented Dec 29, 2023

younesbelkada commented Dec 29, 2023 •

edited

Loading

tsengalb99 commented Jan 3, 2024

younesbelkada commented Jan 29, 2024

younesbelkada commented Jan 30, 2024

tsengalb99 commented Feb 2, 2024 via email •

edited by Titus-von-Koeller

Loading

younesbelkada commented Feb 3, 2024

younesbelkada commented Feb 13, 2024

tsengalb99 commented Feb 13, 2024 via email •

edited by Titus-von-Koeller

Loading

younesbelkada commented Feb 13, 2024

tsengalb99 commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

github-actions bot commented Mar 17, 2024

tsengalb99 commented Mar 17, 2024

younesbelkada commented Mar 18, 2024

Minami-su commented Mar 27, 2024

github-actions bot commented Apr 20, 2024

tsengalb99 commented Jun 4, 2024 via email

SunMarc commented Jun 5, 2024

tsengalb99 commented Jun 6, 2024 via email

ArthurZucker commented Jun 7, 2024

tsengalb99 commented Jul 6, 2024 via email

ArthurZucker commented Jul 11, 2024

support 2bit quip# method #1293

support 2bit quip# method #1293

Comments

Minami-su commented Dec 23, 2023

Minami-su commented Dec 23, 2023

younesbelkada commented Dec 28, 2023 • edited Loading

tsengalb99 commented Dec 29, 2023

younesbelkada commented Dec 29, 2023 • edited Loading

tsengalb99 commented Jan 3, 2024

younesbelkada commented Jan 29, 2024

younesbelkada commented Jan 30, 2024

tsengalb99 commented Feb 2, 2024 via email • edited by Titus-von-Koeller Loading

younesbelkada commented Feb 3, 2024

younesbelkada commented Feb 13, 2024

tsengalb99 commented Feb 13, 2024 via email • edited by Titus-von-Koeller Loading

younesbelkada commented Feb 13, 2024

tsengalb99 commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

github-actions bot commented Mar 17, 2024

tsengalb99 commented Mar 17, 2024

younesbelkada commented Mar 18, 2024

Minami-su commented Mar 27, 2024

github-actions bot commented Apr 20, 2024

tsengalb99 commented Jun 4, 2024 via email

SunMarc commented Jun 5, 2024

tsengalb99 commented Jun 6, 2024 via email

ArthurZucker commented Jun 7, 2024

tsengalb99 commented Jul 6, 2024 via email

ArthurZucker commented Jul 11, 2024

younesbelkada commented Dec 28, 2023 •

edited

Loading

younesbelkada commented Dec 29, 2023 •

edited

Loading

tsengalb99 commented Feb 2, 2024 via email •

edited by Titus-von-Koeller

Loading

tsengalb99 commented Feb 13, 2024 via email •

edited by Titus-von-Koeller

Loading