Skip to content

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Aug 14, 2024

Temporary fix to the serialized checkpoint format for quantized kv cache scales.

The current issue is that storing the scales directly on the output of Linear modules doesn’t exactly match up the its usage in attention implementation. In vLLM the kv scales are members of the Attention class, rather than its Linear submodules. So rather than model.layers.0.self_attn.k_proj.output_scale we should use model.layers.0.self_attn.k_scale

Example script used to talk about differences in this PR:

from llmcompressor.transformers import oneshot

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true
"""

oneshot(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    dataset="open_platypus",
    recipe=recipe,
    output_dir="TinyLlama-1.1B-Chat-v1.0-KV",
    num_calibration_samples=16,
)

Before this PR, we would save the kv cache scales as output_scale tensors on the k_proj and v_proj Linear modules:

model.layers.0.mlp.down_proj.input_scale
model.layers.0.mlp.down_proj.weight
model.layers.0.mlp.down_proj.weight_scale
model.layers.0.mlp.gate_proj.input_scale
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.gate_proj.weight_scale
model.layers.0.mlp.up_proj.input_scale
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.up_proj.weight_scale
model.layers.0.self_attn.k_proj.input_scale
model.layers.0.self_attn.k_proj.output_scale <<<
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.weight_scale
model.layers.0.self_attn.o_proj.input_scale
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.o_proj.weight_scale
model.layers.0.self_attn.q_proj.input_scale
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.weight_scale
model.layers.0.self_attn.v_proj.input_scale
model.layers.0.self_attn.v_proj.output_scale <<<
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.weight_scale

Now we have those scales rewritten to be k_scale or v_scale on the Attention parent module of those Linear modules:

model.layers.0.mlp.down_proj.input_scale
model.layers.0.mlp.down_proj.weight
model.layers.0.mlp.down_proj.weight_scale
model.layers.0.mlp.gate_proj.input_scale
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.gate_proj.weight_scale
model.layers.0.mlp.up_proj.input_scale
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.up_proj.weight_scale
model.layers.0.self_attn.k_proj.input_scale
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.weight_scale
model.layers.0.self_attn.k_scale             <<<
model.layers.0.self_attn.o_proj.input_scale
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.o_proj.weight_scale
model.layers.0.self_attn.q_proj.input_scale
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.weight_scale
model.layers.0.self_attn.v_proj.input_scale
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.weight_scale
model.layers.0.self_attn.v_scale             <<<

@mgoin mgoin requested review from Satrat, bfineran and horheynm August 14, 2024 20:19
@mgoin mgoin merged commit ecf4450 into main Aug 15, 2024
@mgoin mgoin deleted the move-kv_cache_scheme-to-kv_scales branch August 15, 2024 20:19
Etelis added a commit to Etelis/compressed-tensors that referenced this pull request Sep 11, 2025
vllm-project#133)

* Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale

* Add better checking that we hit our special case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants