Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale #133

mgoin · 2024-08-14T20:18:56Z

Temporary fix to the serialized checkpoint format for quantized kv cache scales.

The current issue is that storing the scales directly on the output of Linear modules doesn’t exactly match up the its usage in attention implementation. In vLLM the kv scales are members of the Attention class, rather than its Linear submodules. So rather than model.layers.0.self_attn.k_proj.output_scale we should use model.layers.0.self_attn.k_scale

Example script used to talk about differences in this PR:

from llmcompressor.transformers import oneshot

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true
"""

oneshot(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    dataset="open_platypus",
    recipe=recipe,
    output_dir="TinyLlama-1.1B-Chat-v1.0-KV",
    num_calibration_samples=16,
)

Before this PR, we would save the kv cache scales as output_scale tensors on the k_proj and v_proj Linear modules:

model.layers.0.mlp.down_proj.input_scale
model.layers.0.mlp.down_proj.weight
model.layers.0.mlp.down_proj.weight_scale
model.layers.0.mlp.gate_proj.input_scale
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.gate_proj.weight_scale
model.layers.0.mlp.up_proj.input_scale
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.up_proj.weight_scale
model.layers.0.self_attn.k_proj.input_scale
model.layers.0.self_attn.k_proj.output_scale <<<
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.weight_scale
model.layers.0.self_attn.o_proj.input_scale
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.o_proj.weight_scale
model.layers.0.self_attn.q_proj.input_scale
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.weight_scale
model.layers.0.self_attn.v_proj.input_scale
model.layers.0.self_attn.v_proj.output_scale <<<
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.weight_scale

Now we have those scales rewritten to be k_scale or v_scale on the Attention parent module of those Linear modules:

model.layers.0.mlp.down_proj.input_scale
model.layers.0.mlp.down_proj.weight
model.layers.0.mlp.down_proj.weight_scale
model.layers.0.mlp.gate_proj.input_scale
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.gate_proj.weight_scale
model.layers.0.mlp.up_proj.input_scale
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.up_proj.weight_scale
model.layers.0.self_attn.k_proj.input_scale
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.weight_scale
model.layers.0.self_attn.k_scale             <<<
model.layers.0.self_attn.o_proj.input_scale
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.o_proj.weight_scale
model.layers.0.self_attn.q_proj.input_scale
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.weight_scale
model.layers.0.self_attn.v_proj.input_scale
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.weight_scale
model.layers.0.self_attn.v_scale             <<<

src/compressed_tensors/compressors/model_compressor.py

Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale

2637373

mgoin requested review from Satrat, bfineran and horheynm August 14, 2024 20:19

Add better checking that we hit our special case

e391ddb

Satrat approved these changes Aug 15, 2024

View reviewed changes

src/compressed_tensors/compressors/model_compressor.py Show resolved Hide resolved

mgoin merged commit ecf4450 into main Aug 15, 2024
1 check passed

mgoin deleted the move-kv_cache_scheme-to-kv_scales branch August 15, 2024 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale #133

Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale #133

mgoin commented Aug 14, 2024 •

edited

Loading

Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale #133

Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale #133

Conversation

mgoin commented Aug 14, 2024 • edited Loading

mgoin commented Aug 14, 2024 •

edited

Loading