Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale #133

Merged
merged 2 commits into from
Aug 15, 2024

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Aug 14, 2024

Temporary fix to the serialized checkpoint format for quantized kv cache scales.

The current issue is that storing the scales directly on the output of Linear modules doesn’t exactly match up the its usage in attention implementation. In vLLM the kv scales are members of the Attention class, rather than its Linear submodules. So rather than model.layers.0.self_attn.k_proj.output_scale we should use model.layers.0.self_attn.k_scale

Example script used to talk about differences in this PR:

from llmcompressor.transformers import oneshot

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true
"""

oneshot(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    dataset="open_platypus",
    recipe=recipe,
    output_dir="TinyLlama-1.1B-Chat-v1.0-KV",
    num_calibration_samples=16,
)

Before this PR, we would save the kv cache scales as output_scale tensors on the k_proj and v_proj Linear modules:

model.layers.0.mlp.down_proj.input_scale
model.layers.0.mlp.down_proj.weight
model.layers.0.mlp.down_proj.weight_scale
model.layers.0.mlp.gate_proj.input_scale
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.gate_proj.weight_scale
model.layers.0.mlp.up_proj.input_scale
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.up_proj.weight_scale
model.layers.0.self_attn.k_proj.input_scale
model.layers.0.self_attn.k_proj.output_scale <<<
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.weight_scale
model.layers.0.self_attn.o_proj.input_scale
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.o_proj.weight_scale
model.layers.0.self_attn.q_proj.input_scale
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.weight_scale
model.layers.0.self_attn.v_proj.input_scale
model.layers.0.self_attn.v_proj.output_scale <<<
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.weight_scale

Now we have those scales rewritten to be k_scale or v_scale on the Attention parent module of those Linear modules:

model.layers.0.mlp.down_proj.input_scale
model.layers.0.mlp.down_proj.weight
model.layers.0.mlp.down_proj.weight_scale
model.layers.0.mlp.gate_proj.input_scale
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.gate_proj.weight_scale
model.layers.0.mlp.up_proj.input_scale
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.up_proj.weight_scale
model.layers.0.self_attn.k_proj.input_scale
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.k_proj.weight_scale
model.layers.0.self_attn.k_scale             <<<
model.layers.0.self_attn.o_proj.input_scale
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.o_proj.weight_scale
model.layers.0.self_attn.q_proj.input_scale
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.q_proj.weight_scale
model.layers.0.self_attn.v_proj.input_scale
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.v_proj.weight_scale
model.layers.0.self_attn.v_scale             <<<

@mgoin mgoin merged commit ecf4450 into main Aug 15, 2024
1 check passed
@mgoin mgoin deleted the move-kv_cache_scheme-to-kv_scales branch August 15, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants