Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awq llama quantization #431

Open
1 task
irthomasthomas opened this issue Jan 25, 2024 · 0 comments
Open
1 task

awq llama quantization #431

irthomasthomas opened this issue Jan 25, 2024 · 0 comments
Labels
llm-quantization All about Quantized LLM models and serving python Python code, tools, info source-code Code snippets

Comments

@irthomasthomas
Copy link
Owner

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.

Model Conversion

Here's an example of the syntax for converting a model:

tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

Config File

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

Subword:
  src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
  tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

Model info:
  model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

Inference:
  # ...

Priority

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.
  • Read more: GitHub - casper-hansen/AutoAWQ

Important Note

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

vLLM Performance

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.
  • This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

null

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llm-quantization All about Quantized LLM models and serving python Python code, tools, info source-code Code snippets
Projects
None yet
Development

No branches or pull requests

1 participant