Description
Hi, here is the INC team from Intel. Thank you for developing this amazing project.
Motivation
Our team has developed Auto-Round, a new weight-only quantization algorithm. It has achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.
We would like to contribute this quantization algorithm to torchao to let users benefit from its high accuracy.
The key Idea of Auto-Round
To quantize a given tensor, Auto-Round introduces three trainable parameters (V
, α and β) to adjust the rounding value and clipping range. For a given transformers model, Auto-Round quantizes the decoder block one by one, using block-wise output reconstruction error as loss to train these parameters.
The Modeling User API
We propose the following flow for quantizing a model with Auto-Round, which is similar to the flow of static quantization requiring calibration:
# Step 1. Replace the block with an observed block
# Similar with the `insert_observers_`, but for block
insert_observers_for_block_(m, block_observer, is_block)
# Step 2. calibrating / training
# For capturing the input of block
for _ in range(10):
m(*example_inputs)
# Step 3. quantize the block
quantize_(m, apply_auto_round, is_observed_block)
Implementation Overview
The high-level idea to implement the above flow is:
- Replace the model's decoder block with
ObservedBlock
for capturing the block's input. - Calibrate with the user-provided dataset and capture the block's input.
- Computing the reconstruction error and update
V
, α and β, then replace theLinear
layers in the observed block toQuantizedLinear
layers by applying the optimalV
, α and β.
The main functions and classes to implement this flow are defined below:
class ObservedBlock(torch.nn.Module):
# e.g., replace `transformers.models.llama.modeling_llama.LlamaDecoderLayer`
pass
class QuantizedBlock(torch.nn.Module):
"""All Linears are replaced as Quantized Linear."""
pass
class ModuleInputCapture(torch.nn.Module):
"""Capture the input of the given module."""
pass
def insert_observers_for_block_(
model: torch.nn.Module,
block_observer: ModuleInputCapture,
filter_fn: Optional[Callable[[torch.nn.Module, str], bool]] = None,
) -> ObservedBlock:
replacement_fn = lambda m: ObservedBlock.from_float(m, block_observer)
_replace_with_custom_fn_if_matches_filter(model, replacement_fn, filter_fn)
def apply_auto_round(observed_block: ObservedBlock) -> QuantizedBlock:
# Call the autoround to execute the optimization process
import auto_round
# Start the training process to update the v and alpha and betta
auto_round.quant_block_(observed_block)
Note, we prefer add auto-round as an dependency. We are also willing to integrate all source code of Auto-Round directly into tochao.
Your feedback is important. Please feel free to comment on the flow mentioned above or suggest additional approaches:). Thank you in advance!