Move kv cache scales from k/v_proj.output_scale to self_attn.k/v_scale #133
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Temporary fix to the serialized checkpoint format for quantized kv cache scales.
The current issue is that storing the scales directly on the output of Linear modules doesn’t exactly match up the its usage in attention implementation. In vLLM the kv scales are members of the Attention class, rather than its Linear submodules. So rather than
model.layers.0.self_attn.k_proj.output_scale
we should usemodel.layers.0.self_attn.k_scale
Example script used to talk about differences in this PR:
Before this PR, we would save the kv cache scales as
output_scale
tensors on the k_proj and v_proj Linear modules:Now we have those scales rewritten to be
k_scale
orv_scale
on the Attention parent module of those Linear modules: