Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload_auto-round-light results #454

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 72 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ AutoRound
[![version](https://img.shields.io/badge/release-0.4.6-green)](https://github.com/intel/auto-round)
[![license](https://img.shields.io/badge/license-Apache%202-9C27B0)](https://github.com/intel/auto-round/blob/main/LICENSE)
<a href="https://huggingface.co/OPEA">
<img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
<img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
</a>
---
<div align="left">
Expand All @@ -19,26 +19,25 @@ steps,
which competes impressively against recent methods without introducing any additional inference overhead and keeping low
tuning cost. The below
image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more
details and quantized models in several Hugging Face Spaces,
e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup)
and [fbaldassarri](https://huggingface.co/fbaldassarri).
details and quantized models in several Hugging Face Spaces, e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) and [fbaldassarri](https://huggingface.co/fbaldassarri).

<div align="center">

![](docs/imgs/autoround_overview.png)

<div align="left">


## What's New

* [2024/03] The INT2-mixed R1 model (~200GB) retains 97.9% accuracy. Check
out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc).
* [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats.
* [2024/11] We provide experimental support for VLM quantization, please check out
the [README](./auto_round/mllm/README.md)

## Installation


### Install from pypi

```bash
Expand Down Expand Up @@ -67,6 +66,7 @@ pip install auto-round-lib
```

</details>
<br>

## Model Quantization

Expand All @@ -87,9 +87,9 @@ auto-round \
--output_dir ./tmp_autoround
```

We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
We provide recipes for 'auto-round-best' and 'auto-round-light' mode, running speed with low memory. Details as below.
<details>
<summary>Other Recipes</summary>
<summary>Other Recipes & Results</summary>

```bash
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
Expand All @@ -101,6 +101,15 @@ auto-round-best \
--disable_eval
```

```bash
auto-round-light \
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
--model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--disable_eval
```

```bash
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto-round-fast \
Expand All @@ -109,9 +118,35 @@ auto-round-fast \
--group_size 128 \
--disable_eval
```
<br>

#### Auto-Round Recipes Results
In general, it is recommended to use the auto-round default mode. When resources or quantization time are a priority, the auto-round-light mode can be preferred for models larger than 3B. For 2bits scenario, we recommend auto-round-best mode.

- Average Accuracy Results of 13 tasks(W4G128)

| Config\Model | Qwen2.5-0.5B-Instruct | falcon3-3B | Qwen2.5-7B-Instruct | llama3.1-8b-instruct | falcon3-10b | Qwen2.5-72B-Instruct |
|--------------|-----------------------|------------|---------------------|----------------------|-------------|----------------------|
| 16bits | 0.5541 | 0.6614 | 0.6470 | 0.6212 | 0.6151 | 0.7229 |
| Best | **0.5675** | **0.6638** | 0.6426 | **0.6115** | **0.6092** | 0.7242 |
| Default | 0.5659 | 0.6496 | 0.6441 | 0.6106 | 0.6080 | **0.7252** |
| Light | 0.5564 | 0.6433 | **0.6453** | 0.6111 | 0.6063 | 0.7243 |


- Time Costs(with torch compile enabled)

| Config\Model | Qwen2.5-0.5B-Instruct | falcon3-3B | Qwen2.5-7B-Instruct | llama3.1-8b-instruct | falcon3-10b | Qwen2.5-72B-Instruct |
|:--------------|-----------------------:|------------:|---------------------:|----------------------:|-------------:|----------------------:|
| Best | 383 | 1329 | 3425 | 3754 | 4840 | 34480 |
| Default | 106 | 341 | 739 | 757 | 1046 | 7076 |
| Light | 87 | 166 | 306 | 255 | 410 | 2273 |



</details>

<br>

### API Usage (Gaudi2/CPU/GPU)

```python
Expand Down Expand Up @@ -189,9 +224,15 @@ autoround.save_quantized(output_dir, format='auto_round', inplace=True)
- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.

</details>
<br>


### API Usage for VLMs


<details>
<summary>Click to expand</summary>

**This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or
adjustments to default hype-parameters

Expand Down Expand Up @@ -220,9 +261,11 @@ autoround.quantize()
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir, format='auto_round', inplace=True)
```
</details>

#### Export Formats
<br>

### Export Formats
**AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption.

Expand All @@ -232,11 +275,13 @@ asymmetric kernel has issues** that can cause considerable accuracy drops, parti
models.

**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
adopted within the community, **only 4-bits quantization is supported**.
adopted within the community, **only 4-bits quantization is supported**.

**GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and
q4_1 (W4G32) is supported in our repo**.

<br>

### Quantization Costs

Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note
Expand Down Expand Up @@ -293,8 +338,10 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

<br>

#### Evaluation
<details>
<summary>Evaluation</summary>
<summary>Click to expand</summary>

```bash
auto-round --model saved_quantized_model \
Expand All @@ -304,6 +351,7 @@ auto-round --model saved_quantized_model \
```

</details>
<br>

### AutoGPTQ/AutoAWQ format

Expand All @@ -323,10 +371,15 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

AutoRound supports basically all the major large language models.

<details>
<summary>Supported Models List</summary>

Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
release most of the models ourselves.



Model | Supported |
|-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), |
Expand Down Expand Up @@ -369,7 +422,12 @@ release most of the models ourselves.
| 01-ai/Yi-6B-Chat | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md) |
| facebook/opt-2.7b | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md) |
| bigscience/bloom-3b | [outdated-recipe](./docs/bloom-3B-asym-recipe.md) |
| EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) |
| EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) |

</details>


<br>

## Integration

Expand All @@ -381,6 +439,8 @@ AutoRound has been integrated into multiple repositories.

[pytorch/ao](https://github.com/pytorch/ao)

<br>

## Reference

If you find AutoRound useful for your research, please cite our paper:
Expand All @@ -396,3 +456,4 @@ If you find AutoRound useful for your research, please cite our paper:




6 changes: 6 additions & 0 deletions auto_round/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ def run_best():
from auto_round.script.llm import setup_best_parser, tune
args = setup_best_parser()
tune(args)

def run_light():
from auto_round.script.llm import setup_light_parser, tune
args = setup_light_parser()
tune(args)

def run_fast():
from auto_round.script.llm import setup_fast_parser, tune
Expand Down Expand Up @@ -78,3 +83,4 @@ def switch():

if __name__ == '__main__':
switch()

27 changes: 25 additions & 2 deletions auto_round/script/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,29 @@ def setup_best_parser():
return args


def setup_light_parser():
parser = BasicArgumentParser()

parser.add_argument("--group_size", default=128, type=int, help="group size")

parser.add_argument("--batch_size", "--train_bs", "--bs", default=8, type=int, help="train batch size")

parser.add_argument("--iters", "--iter", default=50, type=int, help="iterations to tune each block")

parser.add_argument(
"--seqlen", "--seq_len", default=2048, type=int, help="sequence length of the calibration samples")

parser.add_argument("--nsamples", "--nsample", default=128, type=int, help="number of samples")

parser.add_argument(
"--lr", default=5e-3, type=float, help="learning rate, if None, it will be set to 1.0/iters automatically")

args = parser.parse_args()
args.low_gpu_mem_usage = True

return args


def setup_fast_parser():
parser = BasicArgumentParser()

Expand Down Expand Up @@ -609,8 +632,7 @@ def tune(args):
def _eval_init(tasks, model_path, device, disable_trust_remote_code=False):
set_cuda_visible_devices(device)
device_str, parallelism = get_device_and_parallelism(device)
##model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code},add_bos_token=True'
model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}'
model_args = f'pretrained={model_path},trust_remote_code={not disable_trust_remote_code}' #,add_bos_token={True}
if parallelism:
model_args += ",parallelize=True"
if isinstance(tasks, str):
Expand Down Expand Up @@ -683,3 +705,4 @@ def eval_task_by_task(model, device, tasks, batch_size=None, max_batch_size=64,
for key in res_keys:
res_all[key].update(res[key])
print(make_table(res_all))