intel · WeiweiZhang1 · Mar 6, 2025 · Mar 10, 2025 · Mar 10, 2025 · Mar 10, 2025
diff --git a/README.md b/README.md
@@ -67,6 +67,7 @@ pip install auto-round-lib
   ```
 
 </details>
+<br>
 
 ## Model Quantization
 
@@ -87,9 +88,9 @@ auto-round \
     --output_dir ./tmp_autoround
 ```
 
-We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
+We provide recipes for 'auto-round-best', 'auto-round-light' and 'auto-round-fast' mode, running speed with low memory. Details as below.
 <details>
-  <summary>Other Recipes</summary>
+  <summary>Other Recipes & Results</summary>
 
   ```bash
 ## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
@@ -101,6 +102,16 @@ auto-round-best \
     --disable_eval 
   ```
 
+  ```bash
+auto-round-light \
+## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
+    --model facebook/opt-125m \
+    --bits 4 \
+    --group_size 128 \
+    --low_gpu_mem_usage \
+    --disable_eval 
+  ```
+
   ```bash
 ## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
 auto-round-fast \
@@ -109,9 +120,33 @@ auto-round-fast \
     --group_size 128 \
     --disable_eval 
   ```
+<br>
+
+#### Auto-Round Recipes Results
+In general, it is recommended to use the auto-round default mode. When resources or quantization time are a priority, the auto-round-light mode can be preferred for models larger than 3B. Below are the quantization results for models ranging from 3B to 72B as a reference(with torch compile enabled).
+
+- Accuracy Results
+
+| Config\Model | Qwen2.5-7B-Instruct | llama3.1-8b-instruct | falcon3-10b | OLMo-2-1124-7B-Instruct | Qwen2.5-72B-Instruct |
+|:--------------:|:---------------------:|:----------------------:|:-------------:|:-------------------------:|:----------------------:|
+| 16bits       | 0.6470              | 0.6212               | 0.6151      | 0.6268                  | 0.7229               |
+| Best         | 0.6426              | **0.6115**               | **0.6092**      | **0.6295**                  | 0.7242               |
+| Default      | 0.6441              | 0.6106               | 0.6080      | 0.6253                  | **0.7252**               |
+| Light        | **0.6453**              | 0.6111               | 0.6063      | 0.6261                  | 0.7243               |
+
+- Time Costs
+
+| Config\Model | Qwen2.5-7B-Instruct | llama3.1-8b-instruct | falcon3-10b | OLMo-2-1124-7B-Instruct | Qwen2.5-72B-Instruct |
+|:--------------|---------------------:|----------------------:|-------------:|-------------------------:|----------------------:|
+| Best         | 3425                | 3754                 | 4840        | 3360                    | 33984                |
+| Default      | 739                 | 757                  | 1046        | 704                     | 7076                 |
+| Light        | 306                 | 255                  | 410         | 311                     | 2273                 |
+
 
 </details>
 
+<br>
+
 ### API Usage (Gaudi2/CPU/GPU)
 
 ```python
@@ -189,9 +224,16 @@ autoround.save_quantized(output_dir, format='auto_round', inplace=True)
 - `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
 
 </details>
+<br>
+
 
 ### API Usage for VLMs
 
+By default, AutoRoundMLLM only quantizes the text module of VLMs and uses `NeelNanda/pile-10k` for calibration.
+
+<details>
+  <summary>Detail Usage for VLMs</summary>
+
 **This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or
 adjustments to default hype-parameters
 
@@ -221,7 +263,11 @@ output_dir = "./tmp_autoround"
 autoround.save_quantized(output_dir, format='auto_round', inplace=True)
 ```
 
-#### Export Formats
+</details>
+
+<br>
+
+### Export Formats
 
 **AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
 inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption.
@@ -237,6 +283,8 @@ adopted within the community, **only 4-bits quantization is supported**.
 **GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and
 q4_1 (W4G32) is supported in our repo**.
 
+<br>
+
 ### Quantization Costs
 
 Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note
@@ -293,8 +341,10 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
 ```
 
 <br>
+
+### Evaluation
 <details>
-  <summary>Evaluation</summary>
+  <summary>Click to expand</summary>
 
 ```bash
 auto-round --model saved_quantized_model \
@@ -304,6 +354,7 @@ auto-round --model saved_quantized_model \
 ```
 
 </details>
+<br>
 
 ### AutoGPTQ/AutoAWQ format
 
@@ -323,10 +374,15 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
 
 AutoRound supports basically all the major large language models.
 
+<details>
+  <summary>Supported Models List</summary>
+
 Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
 different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
 release most of the models ourselves.
 
+
+
  Model                                     | Supported                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 |-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc),  [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc),                                                                                                                                                                                                                                                                                                        |
@@ -369,7 +425,11 @@ release most of the models ourselves.
 | 01-ai/Yi-6B-Chat                          | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                     
 | facebook/opt-2.7b                         | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | bigscience/bloom-3b                       | [outdated-recipe](./docs/bloom-3B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
-| EleutherAI/gpt-j-6b                       | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 
+| EleutherAI/gpt-j-6b                       | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 
+
+</details> 
+
+<br>
 
 ## Integration
 
@@ -381,6 +441,8 @@ AutoRound has been integrated into multiple repositories.
 
 [pytorch/ao](https://github.com/pytorch/ao)
 
+<br>
+
 ## Reference
 
 If you find AutoRound useful for your research, please cite our paper:
@@ -396,3 +458,4 @@ If you find AutoRound useful for your research, please cite our paper:
 
 
 
+
diff --git a/auto_round/__main__.py b/auto_round/__main__.py
@@ -42,6 +42,11 @@ def run_best():
     from auto_round.script.llm import setup_best_parser, tune
     args = setup_best_parser()
     tune(args)
+
+def run_light():
+    from auto_round.script.llm import setup_light_parser, tune
+    args = setup_light_parser()
+    tune(args)
 
 def run_fast():
     from auto_round.script.llm import setup_fast_parser, tune
@@ -78,3 +83,4 @@ def switch():
 
 if __name__ == '__main__':
     switch()
+
diff --git a/auto_round/script/llm.py b/auto_round/script/llm.py
@@ -254,6 +254,29 @@ def setup_best_parser():
     return args
 
 
+def setup_light_parser():
+    parser = BasicArgumentParser()
+
+    parser.add_argument("--group_size", default=128, type=int, help="group size")
+
+    parser.add_argument("--batch_size", "--train_bs", "--bs", default=8, type=int, help="train batch size")
+
+    parser.add_argument("--iters", "--iter", default=50, type=int, help="iterations to tune each block")
+
+    parser.add_argument(
+        "--seqlen", "--seq_len", default=2048, type=int, help="sequence length of the calibration samples")
+
+    parser.add_argument("--nsamples", "--nsample", default=128, type=int, help="number of samples")
+
+    parser.add_argument(
+        "--lr", default=5e-3, type=float, help="learning rate, if None, it will be set to 1.0/iters automatically")
+
+    args = parser.parse_args()
+    args.low_gpu_mem_usage = True
+
+    return args
+
+
 def setup_fast_parser():
     parser = BasicArgumentParser()
 
@@ -609,8 +632,7 @@ def tune(args):
 def _eval_init(tasks, model_path, device, disable_trust_remote_code=False):
     set_cuda_visible_devices(device)
     device_str, parallelism = get_device_and_parallelism(device)
-    ##model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code},add_bos_token=True'
-    model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}'
+    model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}' #,add_bos_token={True}
     if parallelism:
         model_args += ",parallelize=True"
     if isinstance(tasks, str):
@@ -683,3 +705,4 @@ def eval_task_by_task(model, device, tasks, batch_size=None, max_batch_size=64,
             for key in res_keys:
                 res_all[key].update(res[key])
         print(make_table(res_all))
+