Skip to content

[RFC] Add Auto-Round support #533

Closed
@yiliu30

Description

@yiliu30

Hi, here is the INC team from Intel. Thank you for developing this amazing project.

Motivation

Our team has developed Auto-Round, a new weight-only quantization algorithm. It has achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.

autoround_res

We would like to contribute this quantization algorithm to torchao to let users benefit from its high accuracy.

The key Idea of Auto-Round

To quantize a given tensor, Auto-Round introduces three trainable parameters (V, α and β) to adjust the rounding value and clipping range. For a given transformers model, Auto-Round quantizes the decoder block one by one, using block-wise output reconstruction error as loss to train these parameters.

autoround_overview

The Modeling User API

We propose the following flow for quantizing a model with Auto-Round, which is similar to the flow of static quantization requiring calibration:

# Step 1. Replace the block with an observed block
# Similar with the `insert_observers_`, but for block
insert_observers_for_block_(m, block_observer, is_block)

# Step 2. calibrating / training
# For capturing the input of block
for _ in range(10):
    m(*example_inputs)

# Step 3. quantize the block
quantize_(m, apply_auto_round, is_observed_block)

Implementation Overview

The high-level idea to implement the above flow is:

  1. Replace the model's decoder block with ObservedBlock for capturing the block's input.
  2. Calibrate with the user-provided dataset and capture the block's input.
  3. Computing the reconstruction error and update V, α and β, then replace the Linear layers in the observed block to QuantizedLinear layers by applying the optimal V, α and β.

The main functions and classes to implement this flow are defined below:

class ObservedBlock(torch.nn.Module):
    # e.g., replace `transformers.models.llama.modeling_llama.LlamaDecoderLayer`
    pass


class QuantizedBlock(torch.nn.Module):
    """All Linears are replaced as Quantized Linear."""
    pass


class ModuleInputCapture(torch.nn.Module):
    """Capture the input of the given module."""
    pass


def insert_observers_for_block_(
    model: torch.nn.Module,
    block_observer: ModuleInputCapture,
    filter_fn: Optional[Callable[[torch.nn.Module, str], bool]] = None,
) -> ObservedBlock:
    replacement_fn = lambda m: ObservedBlock.from_float(m, block_observer)
    _replace_with_custom_fn_if_matches_filter(model, replacement_fn, filter_fn)


def apply_auto_round(observed_block: ObservedBlock) -> QuantizedBlock:
    # Call the autoround to execute the optimization process
    import auto_round

    # Start the training process to update the v and alpha and betta
    auto_round.quant_block_(observed_block)

Note, we prefer add auto-round as an dependency. We are also willing to integrate all source code of Auto-Round directly into tochao.

Your feedback is important. Please feel free to comment on the flow mentioned above or suggest additional approaches:). Thank you in advance!

cc @thuang6 @ftian1 @wenhuach21 @hshen14 @jgong5

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions