diff --git a/README.md b/README.md
index 0f1baf66..1f5393ca 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,6 @@
+
AutoRound
===========================
Advanced Quantization Algorithm for LLMs
@@ -29,6 +30,8 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri).
+
+
## What's New
* [2024/03] The INT2-mixed R1 model (~200GB) retains 97.9% accuracy. Check
@@ -36,9 +39,9 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri).
* [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats.
* [2024/11] We provide experimental support for VLM quantization, please check out
the [README](./auto_round/mllm/README.md)
-
## Installation
+
### Install from pypi
```bash
@@ -67,6 +70,7 @@ pip install auto-round-lib
```
+
## Model Quantization
@@ -87,7 +91,7 @@ auto-round \
--output_dir ./tmp_autoround
```
-We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
+We provide recipes for 'auto-round-best' and 'auto-round-light' mode, running speed with low memory. Details as below.
Other Recipes
@@ -102,15 +106,43 @@ auto-round-best \
```
```bash
+auto-round-light \
+## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
+ --model facebook/opt-125m \
+ --bits 4 \
+ --group_size 128 \
+ --disable_eval
+ ```
+
+
+
+
+#### Auto-Round Recipe Results
+In general, it is recommended to use the auto-round default mode. When resources or quantization time are a priority, the auto-round-light mode can be preferred for models larger than 3B. For 2bits scenario, we recommend auto-round-best mode.
+
+- Average Accuracy of 13 tasks(W4G128) and Time Cost(enable_torch_compile) Results
+
+ | Model | | | Accuracy | | | | Time Cost | |
+ |---------------|:-------------------|:--------|:---------|--------|-|:-----------|:---------|:-------|
+ | | 16bits | Best | Default | Light || Best | Default | Light |
+ | Qwen2.5-0.5B-Instruct | 0.5541 | **0.5675** | 0.5659 | 0.5564 || 383 | 106 | 87 |
+ | Falcon3-3B | 0.6614 | **0.6638** | 0.6496 | 0.6433 || 1329 | 341 | 166 |
+ | Qwen2.5-7B-Instruct | 0.6470 | 0.6426 | 0.6441 | **0.6453** || 3425 | 739 | 306 |
+ | Llama3.1-8B-Instruct | 0.6212 | **0.6115** | 0.6106 | 0.6111 || 3754 | 757 | 255 |
+ | Falcon3-10B | 0.6151 | **0.6092** | 0.6080 | 0.6063 || 4840 | 1046 | 410 |
+ | Qwen2.5-72B-Instruct | 0.7229 | 0.7242 | **0.7252** | 0.7243 || 34480 | 7076 | 2273 |
+
+
+
### API Usage (Gaudi2/CPU/GPU)
@@ -189,9 +221,15 @@ autoround.save_quantized(output_dir, format='auto_round', inplace=True)
- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
+
+
### API Usage for VLMs
+
+
+ Click to expand
+
**This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or
adjustments to default hype-parameters
@@ -220,9 +258,11 @@ autoround.quantize()
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir, format='auto_round', inplace=True)
```
+
-#### Export Formats
+
+### Export Formats
**AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption.
@@ -232,11 +272,13 @@ asymmetric kernel has issues** that can cause considerable accuracy drops, parti
models.
**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
-adopted within the community, **only 4-bits quantization is supported**.
+adopted within the community, **only 4-bits quantization is supported**.
**GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and
q4_1 (W4G32) is supported in our repo**.
+
+
### Quantization Costs
Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note
@@ -293,8 +335,10 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```
+
+#### Evaluation
- Evaluation
+ Click to expand
```bash
auto-round --model saved_quantized_model \
@@ -304,6 +348,7 @@ auto-round --model saved_quantized_model \
```
+
### AutoGPTQ/AutoAWQ format
@@ -323,10 +368,15 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
AutoRound supports basically all the major large language models.
+
+ Supported Models List
+
Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
release most of the models ourselves.
+
+
Model | Supported |
|-------------------------------------------||
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), |
@@ -369,7 +419,12 @@ release most of the models ourselves.
| 01-ai/Yi-6B-Chat | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md) |
| facebook/opt-2.7b | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md) |
| bigscience/bloom-3b | [outdated-recipe](./docs/bloom-3B-asym-recipe.md) |
-| EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) |
+| EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) |
+
+
+
+
+
## Integration
@@ -381,6 +436,8 @@ AutoRound has been integrated into multiple repositories.
[pytorch/ao](https://github.com/pytorch/ao)
+
+
## Reference
If you find AutoRound useful for your research, please cite our paper:
@@ -396,3 +453,4 @@ If you find AutoRound useful for your research, please cite our paper:
+
diff --git a/auto_round/__main__.py b/auto_round/__main__.py
index 246b66fa..6b464954 100644
--- a/auto_round/__main__.py
+++ b/auto_round/__main__.py
@@ -42,6 +42,11 @@ def run_best():
from auto_round.script.llm import setup_best_parser, tune
args = setup_best_parser()
tune(args)
+
+def run_light():
+ from auto_round.script.llm import setup_light_parser, tune
+ args = setup_light_parser()
+ tune(args)
def run_fast():
from auto_round.script.llm import setup_fast_parser, tune
@@ -78,3 +83,4 @@ def switch():
if __name__ == '__main__':
switch()
+
diff --git a/auto_round/script/llm.py b/auto_round/script/llm.py
index 054960df..18da3a50 100644
--- a/auto_round/script/llm.py
+++ b/auto_round/script/llm.py
@@ -254,6 +254,29 @@ def setup_best_parser():
return args
+def setup_light_parser():
+ parser = BasicArgumentParser()
+
+ parser.add_argument("--group_size", default=128, type=int, help="group size")
+
+ parser.add_argument("--batch_size", "--train_bs", "--bs", default=8, type=int, help="train batch size")
+
+ parser.add_argument("--iters", "--iter", default=50, type=int, help="iterations to tune each block")
+
+ parser.add_argument(
+ "--seqlen", "--seq_len", default=2048, type=int, help="sequence length of the calibration samples")
+
+ parser.add_argument("--nsamples", "--nsample", default=128, type=int, help="number of samples")
+
+ parser.add_argument(
+ "--lr", default=5e-3, type=float, help="learning rate, if None, it will be set to 1.0/iters automatically")
+
+ args = parser.parse_args()
+ args.low_gpu_mem_usage = True
+
+ return args
+
+
def setup_fast_parser():
parser = BasicArgumentParser()
@@ -609,8 +632,7 @@ def tune(args):
def _eval_init(tasks, model_path, device, disable_trust_remote_code=False):
set_cuda_visible_devices(device)
device_str, parallelism = get_device_and_parallelism(device)
- ##model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code},add_bos_token=True'
- model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}'
+ model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}' #,add_bos_token={True}
if parallelism:
model_args += ",parallelize=True"
if isinstance(tasks, str):
@@ -683,3 +705,4 @@ def eval_task_by_task(model, device, tasks, batch_size=None, max_batch_size=64,
for key in res_keys:
res_all[key].update(res[key])
print(make_table(res_all))
+