awq llama quantization #431

irthomasthomas · 2024-01-25T12:50:42Z

awq llama quantization

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.

Model Conversion

Here's an example of the syntax for converting a model:

tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors

TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
output: Specifies the target directory and model name you want to save.
format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

Config File

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

Subword:
  src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
  tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

Model info:
  model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

Inference:
  # ...

Priority

When considering your priority:

For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
AWQ models are faster than FP16 for batch size 1.
Read more: GitHub - casper-hansen/AutoAWQ

Important Note

There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

vLLM Performance

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

Batch size 1: 80.5 tokens/second
Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.
This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

null

The text was updated successfully, but these errors were encountered:

irthomasthomas added llm-quantization All about Quantized LLM models and serving python Python code, tools, info source-code Code snippets labels Jan 25, 2024

This was referenced Feb 27, 2024

Guide to choosing quants and engines : r/LocalLLaMA #641

Open

Qwen-1.5-8x7B : r/LocalLLaMA #647

Open

Qwen - supervised finetuning script and guide for SFT. #660

Open

irthomasthomas mentioned this issue Mar 16, 2024

JAX is NumPy on the CPU, GPU, and TPU #719

Open

1 task

ShellLM mentioned this issue Apr 22, 2024

AlpacaEval: Revolutionizing Model Evaluation with LLM-Based Automatic Tools #813

Open

1 task

This was referenced Aug 1, 2024

Mistral NeMo | Mistral AI | Frontier AI in your hands #851

Open

Codestral Mamba | Mistral AI | Frontier AI in your hands #852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awq llama quantization #431

awq llama quantization #431

irthomasthomas commented Jan 25, 2024

awq llama quantization #431

awq llama quantization #431

Comments

irthomasthomas commented Jan 25, 2024

Quantization and Acceleration

Model Conversion

Config File

Priority

Important Note

Offline Quantizer Script

vLLM Performance

Suggested labels

null