Skip to content

offical repo for paper: ROTATED RUNTIME SMOOTH: TRAINING-FREE ACTIVATION SMOOTHER FOR ACCURATE INT4 INFERENCE

Notifications You must be signed in to change notification settings

Coco58323/Rotated_Runtime_Smooth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Abstract

Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale.

Quantization methods have been employed to reduce service costs and latency. Nevertheless, the presence of outliers in activations hinders the development of INT4 weight-activation quantization. Existing approaches utilize mixed-precision strategies or migrate outliers from activations to weights, suffering from either high latency or accuracy degradation. Based on the observation of activations from large language models, outliers can be classified into channel-wise outliers and spike outliers.

In order to eliminate channel-wise outliers, we propose Runtime Smooth RS, which is a play-and-plugin operator for activation quantization. Within the operator, activation is divided by the channel-wise maximums during runtime prior to quantization. Subsequently, the quantized weight and activation, along with the channel-wise maximums, are sent to the fused GEMM kernel for output with negligible overhead. To further address spike outliers, we propose Rotated Runtime Smooth RRS, where one spike outlier is spread on the entire token after rotation; hence consistent smoothing scale eliminates the effect of victims. The proposed method outperforms the state-of-the-art method in the Llama and Qwen families and achieves an improvement in WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.

Preparation

Download Models from Huggingface or ModelScope

Download Dataset WikiText-2

Requirements:

  • Python 3.9

Install python dependencies

pip install -r requirements.txt

Reproducing

For simulation results with fake quantization, check script directory.

The end-to-end INT4 inference pipeline is still under development.

About

offical repo for paper: ROTATED RUNTIME SMOOTH: TRAINING-FREE ACTIVATION SMOOTHER FOR ACCURATE INT4 INFERENCE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages