diff --git a/README.md b/README.md index 0f1baf66..1f5393ca 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,6 @@
+ AutoRound ===========================

Advanced Quantization Algorithm for LLMs

@@ -29,6 +30,8 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri).
+ + ## What's New * [2024/03] The INT2-mixed R1 model (~200GB) retains 97.9% accuracy. Check @@ -36,9 +39,9 @@ and [fbaldassarri](https://huggingface.co/fbaldassarri). * [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats. * [2024/11] We provide experimental support for VLM quantization, please check out the [README](./auto_round/mllm/README.md) - ## Installation + ### Install from pypi ```bash @@ -67,6 +70,7 @@ pip install auto-round-lib ``` +
## Model Quantization @@ -87,7 +91,7 @@ auto-round \ --output_dir ./tmp_autoround ``` -We provide two recipes for best accuracy and fast running speed with low memory. Details as below. +We provide recipes for 'auto-round-best' and 'auto-round-light' mode, running speed with low memory. Details as below.
Other Recipes @@ -102,15 +106,43 @@ auto-round-best \ ``` ```bash +auto-round-light \ +## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower + --model facebook/opt-125m \ + --bits 4 \ + --group_size 128 \ + --disable_eval + ``` + +
+
+ +#### Auto-Round Recipe Results +In general, it is recommended to use the auto-round default mode. When resources or quantization time are a priority, the auto-round-light mode can be preferred for models larger than 3B. For 2bits scenario, we recommend auto-round-best mode. + +- Average Accuracy of 13 tasks(W4G128) and Time Cost(enable_torch_compile) Results + + | Model | | | Accuracy | | | | Time Cost | | + |---------------|:-------------------|:--------|:---------|--------|-|:-----------|:---------|:-------| + | | 16bits | Best | Default | Light || Best | Default | Light | + | Qwen2.5-0.5B-Instruct | 0.5541 | **0.5675** | 0.5659 | 0.5564 || 383 | 106 | 87 | + | Falcon3-3B | 0.6614 | **0.6638** | 0.6496 | 0.6433 || 1329 | 341 | 166 | + | Qwen2.5-7B-Instruct | 0.6470 | 0.6426 | 0.6441 | **0.6453** || 3425 | 739 | 306 | + | Llama3.1-8B-Instruct | 0.6212 | **0.6115** | 0.6106 | 0.6111 || 3754 | 757 | 255 | + | Falcon3-10B | 0.6151 | **0.6092** | 0.6080 | 0.6063 || 4840 | 1046 | 410 | + | Qwen2.5-72B-Instruct | 0.7229 | 0.7242 | **0.7252** | 0.7243 || 34480 | 7076 | 2273 | + + +
### API Usage (Gaudi2/CPU/GPU) @@ -189,9 +221,15 @@ autoround.save_quantized(output_dir, format='auto_round', inplace=True) - `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection. +
+ ### API Usage for VLMs + +
+ Click to expand + **This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or adjustments to default hype-parameters @@ -220,9 +258,11 @@ autoround.quantize() output_dir = "./tmp_autoround" autoround.save_quantized(output_dir, format='auto_round', inplace=True) ``` +
-#### Export Formats +
+### Export Formats **AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption. @@ -232,11 +272,13 @@ asymmetric kernel has issues** that can cause considerable accuracy drops, parti models. **AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely -adopted within the community, **only 4-bits quantization is supported**. +adopted within the community, **only 4-bits quantization is supported**. **GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and q4_1 (W4G32) is supported in our repo**. +
+ ### Quantization Costs Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note @@ -293,8 +335,10 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0])) ```
+ +#### Evaluation
- Evaluation + Click to expand ```bash auto-round --model saved_quantized_model \ @@ -304,6 +348,7 @@ auto-round --model saved_quantized_model \ ```
+
### AutoGPTQ/AutoAWQ format @@ -323,10 +368,15 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0])) AutoRound supports basically all the major large language models. +
+ Supported Models List + Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot release most of the models ourselves. + + Model | Supported | |-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), | @@ -369,7 +419,12 @@ release most of the models ourselves. | 01-ai/Yi-6B-Chat | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md) | | facebook/opt-2.7b | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md) | | bigscience/bloom-3b | [outdated-recipe](./docs/bloom-3B-asym-recipe.md) | -| EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) | +| EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) | + +
+ + +
## Integration @@ -381,6 +436,8 @@ AutoRound has been integrated into multiple repositories. [pytorch/ao](https://github.com/pytorch/ao) +
+ ## Reference If you find AutoRound useful for your research, please cite our paper: @@ -396,3 +453,4 @@ If you find AutoRound useful for your research, please cite our paper: + diff --git a/auto_round/__main__.py b/auto_round/__main__.py index 246b66fa..6b464954 100644 --- a/auto_round/__main__.py +++ b/auto_round/__main__.py @@ -42,6 +42,11 @@ def run_best(): from auto_round.script.llm import setup_best_parser, tune args = setup_best_parser() tune(args) + +def run_light(): + from auto_round.script.llm import setup_light_parser, tune + args = setup_light_parser() + tune(args) def run_fast(): from auto_round.script.llm import setup_fast_parser, tune @@ -78,3 +83,4 @@ def switch(): if __name__ == '__main__': switch() + diff --git a/auto_round/script/llm.py b/auto_round/script/llm.py index 054960df..18da3a50 100644 --- a/auto_round/script/llm.py +++ b/auto_round/script/llm.py @@ -254,6 +254,29 @@ def setup_best_parser(): return args +def setup_light_parser(): + parser = BasicArgumentParser() + + parser.add_argument("--group_size", default=128, type=int, help="group size") + + parser.add_argument("--batch_size", "--train_bs", "--bs", default=8, type=int, help="train batch size") + + parser.add_argument("--iters", "--iter", default=50, type=int, help="iterations to tune each block") + + parser.add_argument( + "--seqlen", "--seq_len", default=2048, type=int, help="sequence length of the calibration samples") + + parser.add_argument("--nsamples", "--nsample", default=128, type=int, help="number of samples") + + parser.add_argument( + "--lr", default=5e-3, type=float, help="learning rate, if None, it will be set to 1.0/iters automatically") + + args = parser.parse_args() + args.low_gpu_mem_usage = True + + return args + + def setup_fast_parser(): parser = BasicArgumentParser() @@ -609,8 +632,7 @@ def tune(args): def _eval_init(tasks, model_path, device, disable_trust_remote_code=False): set_cuda_visible_devices(device) device_str, parallelism = get_device_and_parallelism(device) - ##model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code},add_bos_token=True' - model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}' + model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}' #,add_bos_token={True} if parallelism: model_args += ",parallelize=True" if isinstance(tasks, str): @@ -683,3 +705,4 @@ def eval_task_by_task(model, device, tasks, batch_size=None, max_batch_size=64, for key in res_keys: res_all[key].update(res[key]) print(make_table(res_all)) +