awq llama quantization #431
Labels
llm-quantization
All about Quantized LLM models and serving
python
Python code, tools, info
source-code
Code snippets
Quantization and Acceleration
We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not.
Model Conversion
Here's an example of the syntax for converting a model:
TheBloke/Nous-Hermes-Llama2-AWQ
: The name of the repository/model on the Hugging Face Hub.output
: Specifies the target directory and model name you want to save.format
: Optionally, you can save as safetensors.For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.
Config File
After converting, you will need a config file to run
translate.py
orrun_mmlu_opnenmt.py
. Here's an example of the config:Priority
When considering your priority:
Important Note
Offline Quantizer Script
We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.
vLLM Performance
Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:
Suggested labels
null
The text was updated successfully, but these errors were encountered: