intel · WeiweiZhang1 · Mar 6, 2025 · Mar 10, 2025 · Mar 10, 2025 · Mar 10, 2025
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ AutoRound
 [![version](https://img.shields.io/badge/release-0.4.6-green)](https://github.com/intel/auto-round)
 [![license](https://img.shields.io/badge/license-Apache%202-9C27B0)](https://github.com/intel/auto-round/blob/main/LICENSE)
 <a href="https://huggingface.co/OPEA">
-<img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
+  <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
 </a>
 ---
 <div align="left">
@@ -19,26 +19,25 @@ steps,
 which competes impressively against recent methods without introducing any additional inference overhead and keeping low
 tuning cost. The below
 image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more
-details and quantized models in several Hugging Face Spaces,
-e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup)
-and [fbaldassarri](https://huggingface.co/fbaldassarri).
+details and quantized models in several Hugging Face Spaces, e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) and [fbaldassarri](https://huggingface.co/fbaldassarri).
 
 <div align="center">
 
 ![](docs/imgs/autoround_overview.png)
 
 <div align="left">
 
+
 ## What's New
 
 * [2024/03] The INT2-mixed R1 model (~200GB) retains 97.9% accuracy. Check
   out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc).
 * [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats.
 * [2024/11] We provide experimental support for VLM quantization, please check out
   the [README](./auto_round/mllm/README.md)
-
 ## Installation
 
+
 ### Install from pypi
 
 ```bash
@@ -67,6 +66,7 @@ pip install auto-round-lib
   ```
 
 </details>
+<br>
 
 ## Model Quantization
 
@@ -87,9 +87,9 @@ auto-round \
     --output_dir ./tmp_autoround
 ```
 
-We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
+We provide recipes for 'auto-round-best' and 'auto-round-light' mode, running speed with low memory. Details as below.
 <details>
-  <summary>Other Recipes</summary>
+  <summary>Other Recipes & Results</summary>
 
   ```bash
 ## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
@@ -101,6 +101,15 @@ auto-round-best \
     --disable_eval 
   ```
 
+  ```bash
+auto-round-light \
+## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
+    --model facebook/opt-125m \
+    --bits 4 \
+    --group_size 128 \
+    --disable_eval 
+  ```
+
   ```bash
 ## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
 auto-round-fast \
@@ -109,9 +118,35 @@ auto-round-fast \
     --group_size 128 \
     --disable_eval 
   ```
+<br>
+
+#### Auto-Round Recipes Results
+In general, it is recommended to use the auto-round default mode. When resources or quantization time are a priority, the auto-round-light mode can be preferred for models larger than 3B. For 2bits scenario, we recommend auto-round-best mode.
+
+- Average Accuracy Results of 13 tasks(W4G128)
+
+| Config\Model | Qwen2.5-0.5B-Instruct | falcon3-3B | Qwen2.5-7B-Instruct | llama3.1-8b-instruct | falcon3-10b | Qwen2.5-72B-Instruct |
+|--------------|-----------------------|------------|---------------------|----------------------|-------------|----------------------|
+| 16bits       | 0.5541                | 0.6614     | 0.6470              | 0.6212               | 0.6151      | 0.7229               |
+| Best         | **0.5675**            | **0.6638** | 0.6426              | **0.6115**           | **0.6092**  | 0.7242               |
+| Default      | 0.5659                | 0.6496     | 0.6441              | 0.6106               | 0.6080      | **0.7252**           |
+| Light        | 0.5564                | 0.6433     | **0.6453**          | 0.6111               | 0.6063      | 0.7243               |
+
+
+- Time Costs(with torch compile enabled)
+
+| Config\Model | Qwen2.5-0.5B-Instruct | falcon3-3B | Qwen2.5-7B-Instruct | llama3.1-8b-instruct | falcon3-10b | Qwen2.5-72B-Instruct |
+|:--------------|-----------------------:|------------:|---------------------:|----------------------:|-------------:|----------------------:|
+| Best         | 383                   | 1329       | 3425                | 3754                 | 4840        | 34480                |
+| Default      | 106                   | 341        | 739                 | 757                  | 1046        | 7076                 |
+| Light        | 87                    | 166        | 306                 | 255                  | 410         | 2273                 |
+
+
 
 </details>
 
+<br>
+
 ### API Usage (Gaudi2/CPU/GPU)
 
 ```python
@@ -189,9 +224,15 @@ autoround.save_quantized(output_dir, format='auto_round', inplace=True)
 - `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
 
 </details>
+<br>
+
 
 ### API Usage for VLMs
 
+
+<details>
+  <summary>Click to expand</summary>
+
 **This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or
 adjustments to default hype-parameters
 
@@ -220,9 +261,11 @@ autoround.quantize()
 output_dir = "./tmp_autoround"
 autoround.save_quantized(output_dir, format='auto_round', inplace=True)
 ```
+</details>
 
-#### Export Formats
+<br>
 
+### Export Formats
 **AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
 inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption.
 
@@ -232,11 +275,13 @@ asymmetric kernel has issues** that can cause considerable accuracy drops, parti
 models.
 
 **AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
-adopted within the community, **only 4-bits quantization is supported**.
+adopted within the community, **only 4-bits quantization is supported**. 
 
 **GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and
 q4_1 (W4G32) is supported in our repo**.
 
+<br>
+
 ### Quantization Costs
 
 Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note
@@ -293,8 +338,10 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
 ```
 
 <br>
+
+#### Evaluation
 <details>
-  <summary>Evaluation</summary>
+  <summary>Click to expand</summary>
 
 ```bash
 auto-round --model saved_quantized_model \
@@ -304,6 +351,7 @@ auto-round --model saved_quantized_model \
 ```
 
 </details>
+<br>
 
 ### AutoGPTQ/AutoAWQ format
 
@@ -323,10 +371,15 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
 
 AutoRound supports basically all the major large language models.
 
+<details>
+  <summary>Supported Models List</summary>
+
 Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
 different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
 release most of the models ourselves.
 
+
+
  Model                                     | Supported                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 |-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc),  [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc),                                                                                                                                                                                                                                                                                                        |
@@ -369,7 +422,12 @@ release most of the models ourselves.
 | 01-ai/Yi-6B-Chat                          | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                     
 | facebook/opt-2.7b                         | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | bigscience/bloom-3b                       | [outdated-recipe](./docs/bloom-3B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
-| EleutherAI/gpt-j-6b                       | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 
+| EleutherAI/gpt-j-6b                       | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 
+
+</details> 
+
+
+<br>
 
 ## Integration
 
@@ -381,6 +439,8 @@ AutoRound has been integrated into multiple repositories.
 
 [pytorch/ao](https://github.com/pytorch/ao)
 
+<br>
+
 ## Reference
 
 If you find AutoRound useful for your research, please cite our paper:
@@ -396,3 +456,4 @@ If you find AutoRound useful for your research, please cite our paper:
 
 
 
+
diff --git a/auto_round/__main__.py b/auto_round/__main__.py
@@ -42,6 +42,11 @@ def run_best():
     from auto_round.script.llm import setup_best_parser, tune
     args = setup_best_parser()
     tune(args)
+
+def run_light():
+    from auto_round.script.llm import setup_light_parser, tune
+    args = setup_light_parser()
+    tune(args)
 
 def run_fast():
     from auto_round.script.llm import setup_fast_parser, tune
@@ -78,3 +83,4 @@ def switch():
 
 if __name__ == '__main__':
     switch()
+
diff --git a/auto_round/script/llm.py b/auto_round/script/llm.py
@@ -254,6 +254,29 @@ def setup_best_parser():
     return args
 
 
+def setup_light_parser():
+    parser = BasicArgumentParser()
+
+    parser.add_argument("--group_size", default=128, type=int, help="group size")
+
+    parser.add_argument("--batch_size", "--train_bs", "--bs", default=8, type=int, help="train batch size")
+
+    parser.add_argument("--iters", "--iter", default=50, type=int, help="iterations to tune each block")
+
+    parser.add_argument(
+        "--seqlen", "--seq_len", default=2048, type=int, help="sequence length of the calibration samples")
+
+    parser.add_argument("--nsamples", "--nsample", default=128, type=int, help="number of samples")
+
+    parser.add_argument(
+        "--lr", default=5e-3, type=float, help="learning rate, if None, it will be set to 1.0/iters automatically")
+
+    args = parser.parse_args()
+    args.low_gpu_mem_usage = True
+
+    return args
+
+
 def setup_fast_parser():
     parser = BasicArgumentParser()
 
@@ -609,8 +632,7 @@ def tune(args):
 def _eval_init(tasks, model_path, device, disable_trust_remote_code=False):
     set_cuda_visible_devices(device)
     device_str, parallelism = get_device_and_parallelism(device)
-    ##model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code},add_bos_token=True'
-    model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}'
+    model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}' #,add_bos_token={True}
     if parallelism:
         model_args += ",parallelize=True"
     if isinstance(tasks, str):
@@ -683,3 +705,4 @@ def eval_task_by_task(model, device, tasks, batch_size=None, max_batch_size=64,
             for key in res_keys:
                 res_all[key].update(res[key])
         print(make_table(res_all))
+