Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SqueezeLLM Support #3093

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Adding SqueezeLLM Support #3093

wants to merge 5 commits into from

Conversation

chooper1
Copy link

@chooper1 chooper1 commented Sep 9, 2023

SqueezeLLM Support

This PR adds support for the SqueezeLLM quantization method, which is described in the following preprint: https://arxiv.org/abs/2306.07629, and which has open-source GPU inference code available at: https://github.com/SqueezeAILab/SqueezeLLM. SqueezeLLM is a post-training quantization framework that allows for high-accuracy and runtime-efficient quantization at low bit precision.

This PR contains the inference code to run the 4-bit dense-only non-uniform quantization scheme outlined in the preprint, as well as the code required to convert the Huggingface (PyTorch) checkpoints to the required binary format in order to be compatible with the llama.cpp file loader.

SqueezeLLM leverages non-uniform quantization to better represent the underlying distribution by shifting the quantization signposts to the optimal positions. SqueezeLLM has promising performance both from an accuracy perspective as well as in terms of runtime efficiency relative to existing integrated quantization methods. The runtime per token was benchmarked on an M1 for 128 tokens without metal, using checkpoints from: https://huggingface.co/TheBloke/LLaMa-7B-GGML/:

Screen Shot 2023-09-13 at 11 39 32 AM

(Edit - numbers updated to match with the comments below, and to also include the model size with 8-bit embedding quantization. Precision estimates are from the link above)

Quantized checkpoints are publicly available for a range of popular models, including LLaMA1/2, Vicuna 1.1/1.3/1.5, xGEN, and OPT: https://huggingface.co/squeeze-ai-lab

Example usage (for the 7B LLaMA-2 model):

  • Build without metal (run LLAMA_NO_METAL=1 make)
  • Download the model from https://huggingface.co/squeeze-ai-lab/sq-llama-2-7b-w4-s0 (copying the .pt file into the directory “models/7B/sq-llama-2-7b-w4-s0”)
  • Copy the llama-2 tokenizer.model and config.json files into the sq-llama-2-7b-w4-s0 folder
  • Converting the pytorch checkpoint to ggml format:
    python convert-sqllm-to-gguf.py --outtype q4_sq models/7B/sq-llama-2-7b-w4-s0/sq-llama-2-7b-w4-s0.pt --outfile models/7B/sq-llama-2-7b-w4-s0-fp16.gguf -squeezellm . The --equant flag can also be passed to quantize input and output embeddings to 8 bits.
  • Running generation using the converted checkpoint:
    ./main -m models/7B/sq-llama-2-7b-w4-s0-fp16.gguf -n 128

@chooper1 chooper1 marked this pull request as ready for review September 9, 2023 05:09
@casper-hansen
Copy link

It seems this is an inference-only implementation? How can we quantize new/custom models with SqueezeLM?

@ggerganov
Copy link
Owner

ggerganov commented Sep 9, 2023

Interesting work!

Can you try to update the ggml -> gguf convert step - it currently fails:

python3 convert-llama-ggml-to-gguf.py --input models/llama-7b-v2/ggml-model-q4_sq.bin --output models/llama-7b-v2/ggml-model-q4_sq.gguf --squeezellm
* Using config: Namespace(input=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.bin'), output=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.gguf'), name=None, desc=None, gqa=1, eps='5.0e-06', context_length=2048, model_metadata_dir=None, vocab_dir=None, vocabtype='spm', squeezellm=True)

=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===

- Note: If converting LLaMA2, specifying "--eps 1e-5" is required. 70B models also need "--gqa 8".
* Scanning GGML input file
* File format: GGJTv1 with ftype MOSTLY_Q5_K_S
* GGML model hyperparameters: <Hyperparameters: n_vocab=32000, n_embd=4096, n_mult=256, n_head=32, n_layer=0, n_rot=128, n_ff=11008, ftype=MOSTLY_Q5_K_S>

=== WARNING === Special tokens may not be converted correctly. Use --model-metadata-dir if possible === WARNING ===

* Preparing to save GGUF file
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 291 tensor(s)
Traceback (most recent call last):
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 458, in <module>
    main()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 454, in main
    converter.save()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 252, in save
    self.add_tensors(gguf_writer)
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 360, in add_tensors
    assert mapped_name is not None, f'Bad name {name}'
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Bad name layers.0.attention.wq.weight

Edit:

Also, double check your "Model Size" column. These are the correct numbers in bytes:

ls -l models/llama-7b-v2/
total 197602496
-rw-r--r--  1 ggerganov  staff  13478104576 Aug 30 11:27 ggml-model-f16.gguf
-rw-r--r--  1 ggerganov  staff  26954272000 Aug 26 23:18 ggml-model-f32.gguf
-rw-r--r--  1 ggerganov  staff   2825940544 Aug 30 11:53 ggml-model-q2_k.gguf
-rw-r--r--  1 ggerganov  staff   3298004544 Aug 30 11:53 ggml-model-q3_k.gguf
-rw-r--r--  1 ggerganov  staff   2948304448 Sep  2 10:21 ggml-model-q3_k_s.gguf
-rw-r--r--  1 ggerganov  staff   3825806912 Aug 30 11:52 ggml-model-q4_0.gguf
-rw-r--r--  1 ggerganov  staff   4238749248 Aug 30 11:52 ggml-model-q4_1.gguf
-rw-r--r--  1 ggerganov  staff   4081004096 Aug 30 11:52 ggml-model-q4_k.gguf
-rw-r--r--  1 ggerganov  staff   3856739904 Sep  2 10:21 ggml-model-q4_k_s.gguf
-rw-r--r--  1 ggerganov  staff   3807322752 Sep  9 11:38 ggml-model-q4_sq.bin
-rw-r--r--  1 ggerganov  staff   4651691584 Aug 30 11:52 ggml-model-q5_0.gguf
-rw-r--r--  1 ggerganov  staff   5064633920 Aug 30 11:52 ggml-model-q5_1.gguf
-rw-r--r--  1 ggerganov  staff   4783156800 Aug 30 11:52 ggml-model-q5_k.gguf
-rw-r--r--  1 ggerganov  staff   4651691584 Sep  2 10:20 ggml-model-q5_k_s.gguf
-rw-r--r--  1 ggerganov  staff   5529194048 Aug 30 11:51 ggml-model-q6_k.gguf
-rw-r--r--  1 ggerganov  staff   7161089600 Aug 30 11:51 ggml-model-q8_0.gguf

Divide by 1e9 to get size in GB. For example, ggml-model-q4_sq.bin is 3.81 GB

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 9, 2023

AssertionError: Bad name layers.0.attention.wq.weight

I think this is because the name doesn't need to be mapped and the name mapping stuff doesn't support an identity operation. We should probably fix that, it would be an easy change.

edit: Fixed it with #3095

@chooper1
Copy link
Author

chooper1 commented Sep 9, 2023

Interesting work!

Can you try to update the ggml -> gguf convert step - it currently fails:

python3 convert-llama-ggml-to-gguf.py --input models/llama-7b-v2/ggml-model-q4_sq.bin --output models/llama-7b-v2/ggml-model-q4_sq.gguf --squeezellm
* Using config: Namespace(input=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.bin'), output=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.gguf'), name=None, desc=None, gqa=1, eps='5.0e-06', context_length=2048, model_metadata_dir=None, vocab_dir=None, vocabtype='spm', squeezellm=True)

=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===

- Note: If converting LLaMA2, specifying "--eps 1e-5" is required. 70B models also need "--gqa 8".
* Scanning GGML input file
* File format: GGJTv1 with ftype MOSTLY_Q5_K_S
* GGML model hyperparameters: <Hyperparameters: n_vocab=32000, n_embd=4096, n_mult=256, n_head=32, n_layer=0, n_rot=128, n_ff=11008, ftype=MOSTLY_Q5_K_S>

=== WARNING === Special tokens may not be converted correctly. Use --model-metadata-dir if possible === WARNING ===

* Preparing to save GGUF file
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 291 tensor(s)
Traceback (most recent call last):
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 458, in <module>
    main()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 454, in main
    converter.save()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 252, in save
    self.add_tensors(gguf_writer)
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 360, in add_tensors
    assert mapped_name is not None, f'Bad name {name}'
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Bad name layers.0.attention.wq.weight

Thank you for trying this out! I fixed the steps listed in the comment above above - you also need to copy the config.json file as well as the tokenizer file before running the first conversion step (pt -> ggml). Please let me know if there are any issues with this.

@KerfuffleV2
Copy link
Collaborator

Is there a reason to include a way to convert to GGML? As far as I know, there isn't really a use case for creating new GGML format files so you can probably make your life easier by just not having to worry about that.

@ikawrakow
Copy link
Contributor

ikawrakow commented Sep 10, 2023

Adding SqeezeLLM to llama.cpp is really great! People have been asking for it for so long.

It would also be great if the table was updated with the actual perplexity values for the various existing quantization types listed in the table. For instance, the current LLaMA-v1-7B perplexity for Q4_0 is 6.1213, not 6.16. In the same way, we have PPL(Q4_K_S) = 6.0067, not 6.05 as per the table. Q4_K_S perplexity has never been 6.05: in the initial k_quants PR (#1684), Q4_K_S perplexity was 6.0215, it became 6.0067 after the re-tuning in PR #2816 (which also resulted in a slight increase in model size, from 3.83 GB to 3.86 GB as listed in the table). Q4_0 dropped from 6.1563 to 6.1213 when we started quantizing the output.weight tensor with Q6_K. If I remember correctly, the change happened in early June, so quite some time ago.

As the provided SqueezeLLM perplexity is for LLaMA-v1-7B, I was curious to see how this approach will perform for LLaMA-v2-7B. I followed the instructions to create the GGUF model and ran the perplexity tool. Calculation is very slow (6+ hours on my M2 Max), so I stopped after 344 batches. At that point, SqueezeLLM perplexity was higher than Q4_K_S by 0.0844. The experience is that after 300 batches the perplexity difference between two models is the same as the difference for the full calculation within +/- 0.002. So, the projected SqueezeLLM perplexity for LLaMA-v2-7B is 5.96-5.97. This is to be compared with PPL(Q4_0) = 5.94 and PPL(Q4_K_S) = 5.88.

On the bright side, I'm noticing that the provided model does not quantize the tok_embeddings and output.weight tensors. We know that one can quantize tok_embeddings with Q4_0 and output.weight with Q8_0 with negligible loss in accuracy. This would shave off 212 MB from the model size, so the SqeezeLLM model would become 3.6 GB and hence become comparable in size to Q3_K_L, which is 3.55 GB and has a LLaMA-v2-7B perplexity of 5.9811. The model size vs perplexity for LLaMA-v2-7B is illustrated in the following figure:

squeeze

Update: I downloaded a quantized model from https://huggingface.co/TheBloke/LLaMa-7B-GGML/ and I see that, indeed, in these models output.weight has not been quantized with Q6_K. Going back the commit history, I see that Q6_K quantization of output.weight has been disabled between commits 7a74dee (June 6) and 74a6d92 (June 12). @TheBloke must have prepared the GGMLV3 quantized models posted on HF in exactly that period.

@chooper1
Copy link
Author

Is there a reason to include a way to convert to GGML? As far as I know, there isn't really a use case for creating new GGML format files so you can probably make your life easier by just not having to worry about that.

Thank you for the feedback, I've updated the file conversion code to convert directly to gguf!

@chooper1
Copy link
Author

Adding SqeezeLLM to llama.cpp is really great! People have been asking for it for so long.

It would also be great if the table was updated with the actual perplexity values for the various existing quantization types listed in the table. For instance, the current LLaMA-v1-7B perplexity for Q4_0 is 6.1213, not 6.16. In the same way, we have PPL(Q4_K_S) = 6.0067, not 6.05 as per the table. Q4_K_S perplexity has never been 6.05: in the initial k_quants PR (#1684), Q4_K_S perplexity was 6.0215, it became 6.0067 after the re-tuning in PR #2816 (which also resulted in a slight increase in model size, from 3.83 GB to 3.86 GB as listed in the table). Q4_0 dropped from 6.1563 to 6.1213 when we started quantizing the output.weight tensor with Q6_K. If I remember correctly, the change happened in early June, so quite some time ago.

As the provided SqueezeLLM perplexity is for LLaMA-v1-7B, I was curious to see how this approach will perform for LLaMA-v2-7B. I followed the instructions to create the GGUF model and ran the perplexity tool. Calculation is very slow (6+ hours on my M2 Max), so I stopped after 344 batches. At that point, SqueezeLLM perplexity was higher than Q4_K_S by 0.0844. The experience is that after 300 batches the perplexity difference between two models is the same as the difference for the full calculation within +/- 0.002. So, the projected SqueezeLLM perplexity for LLaMA-v2-7B is 5.96-5.97. This is to be compared with PPL(Q4_0) = 5.94 and PPL(Q4_K_S) = 5.88.

On the bright side, I'm noticing that the provided model does not quantize the tok_embeddings and output.weight tensors. We know that one can quantize tok_embeddings with Q4_0 and output.weight with Q8_0 with negligible loss in accuracy. This would shave off 212 MB from the model size, so the SqeezeLLM model would become 3.6 GB and hence become comparable in size to Q3_K_L, which is 3.55 GB and has a LLaMA-v2-7B perplexity of 5.9811. The model size vs perplexity for LLaMA-v2-7B is illustrated in the following figure:

squeeze

Update: I downloaded a quantized model from https://huggingface.co/TheBloke/LLaMa-7B-GGML/ and I see that, indeed, in these models output.weight has not been quantized with Q6_K. Going back the commit history, I see that Q6_K quantization of output.weight has been disabled between commits 7a74dee (June 6) and 74a6d92 (June 12). @TheBloke must have prepared the GGMLV3 quantized models posted on HF in exactly that period.

Thank you for pointing this out! I've updated the table to match the updated perplexities for the existing quantization methods. I've also added a row to correspond to quantizing the input and output embeddings to Q8_0. I'll test quantizing the input embedding to Q4_0 to see if we can also do this with minimal degradation.

@chooper1
Copy link
Author

@ggerganov @KerfuffleV2 @ikawrakow Just following up to see if there was any additional feedback or if the updates look good, please let me know what else we need to do to integrate this!

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 19, 2023

I'm hesitant to post stuff in pulls that isn't really contributing since it spams peoples' notifications but I just wanted to say I'm not ignoring the @ . I'm just not really a person with the ability/authority to decide whether it gets accepted. Unfortunately, I also haven't really had a chance to look too closely at this yet so I don't have any other feedback to add right now. edit: I think I can make the checks run for you though.

@ggerganov
Copy link
Owner

I'm hesitant to integrate this for a few reasons:

  • There is no model quantization implementation provided. Not sure how difficult this would be
  • There is no AVX and GPU kernels yet. Would the table lookup still be efficient in these cases?
  • Adding the above would be a huge amount of extra code to maintain and I am not sure it is worth it given the current results

For now, we should keep this branch as a PoC

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants