[RFC] Add Auto-Round support


Hi, here is the INC team from Intel. Thank you for developing this amazing project.

### Motivation

Our team has developed Auto-Round, a new weight-only quantization algorithm. It has achieved superior accuracy compared to [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), and [OmniQuant](https://arxiv.org/abs/2308.13137) across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our [paper](https://arxiv.org/abs/2309.05516), [GitHub repository](https://github.com/intel/auto-round/blob/main/docs/acc.md), and Hugging Face [low-bit quantization leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard).


![autoround_res](https://github.com/user-attachments/assets/2b779e64-c738-42fe-b6a4-e09192b94f6c)


We would like to contribute this quantization algorithm to torchao to let users benefit from its high accuracy.


### The key Idea of Auto-Round

To quantize a given tensor, Auto-Round introduces three trainable parameters (`V`, &alpha; and &beta;) to adjust the rounding value and clipping range. For a given transformers model, Auto-Round quantizes the decoder block one by one, using block-wise output reconstruction error as loss to train these parameters.


![autoround_overview](https://github.com/user-attachments/assets/bec1a573-f091-4b46-9dce-0302022746b3)


### The Modeling User API

We propose the following flow for quantizing a model with Auto-Round, which is similar to [the flow of static quantization requiring calibration](https://github.com/pytorch/ao/blob/main/tutorials/calibration_flow/static_quant.py):

```python
# Step 1. Replace the block with an observed block
# Similar with the `insert_observers_`, but for block
insert_observers_for_block_(m, block_observer, is_block)

# Step 2. calibrating / training
# For capturing the input of block
for _ in range(10):
    m(*example_inputs)

# Step 3. quantize the block
quantize_(m, apply_auto_round, is_observed_block)
```

### Implementation Overview

The high-level idea to implement the above flow is:

1) Replace the model's decoder block with `ObservedBlock` for capturing the block's input. 
2) Calibrate with the user-provided dataset and capture the block's input.
3) Computing the reconstruction error and update `V`, &alpha; and &beta;, then replace the `Linear` layers in the observed block to `QuantizedLinear` layers by applying the optimal `V`, &alpha; and &beta;.

The main functions and classes to implement this flow are defined below:

```python
class ObservedBlock(torch.nn.Module):
    # e.g., replace `transformers.models.llama.modeling_llama.LlamaDecoderLayer`
    pass


class QuantizedBlock(torch.nn.Module):
    """All Linears are replaced as Quantized Linear."""
    pass


class ModuleInputCapture(torch.nn.Module):
    """Capture the input of the given module."""
    pass


def insert_observers_for_block_(
    model: torch.nn.Module,
    block_observer: ModuleInputCapture,
    filter_fn: Optional[Callable[[torch.nn.Module, str], bool]] = None,
) -> ObservedBlock:
    replacement_fn = lambda m: ObservedBlock.from_float(m, block_observer)
    _replace_with_custom_fn_if_matches_filter(model, replacement_fn, filter_fn)


def apply_auto_round(observed_block: ObservedBlock) -> QuantizedBlock:
    # Call the autoround to execute the optimization process
    import auto_round

    # Start the training process to update the v and alpha and betta
    auto_round.quant_block_(observed_block)
```

> Note, we prefer add [auto-round](https://pypi.org/project/auto-round/) as an dependency. We are also willing to integrate all source code of Auto-Round directly into tochao.

Your feedback is important. Please feel free to comment on the flow mentioned above or suggest additional approaches:). Thank you in advance!

cc @thuang6 @ftian1 @wenhuach21 @hshen14 @jgong5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Add Auto-Round support #533

Motivation

The key Idea of Auto-Round

The Modeling User API

Implementation Overview

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Add Auto-Round support #533

Description

Motivation

The key Idea of Auto-Round

The Modeling User API

Implementation Overview

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions