Support calibrating kv cache scales #17
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds a
kv_cache_quant_targets
quant config argument that attachesoutput_scales
to the specified Linear modules. This means we will end up withk_proj.output_scale
andv_proj.output_scale
after activation calibration. For the final checkpoint, we add a pass to take the maximum ofk_proj.output_scale
andv_proj.output_scale
, and place the result in the parent of those modules (the Attention module) as a singlekv_scale
, which is needed to match the representation in vLLM.Also includes a decent chunk of refactoring to allow for no examples to be passed in for weight quantization, renaming for clearer understanding of modules, making
"re:.*lm_head"
not a required ignored pattern but just a default, and disabling torch._scaled_mm for easier usage on CPU.A new example is included to show how to enable this functionality