Support calibrating kv cache scales #17

mgoin · 2024-06-13T21:29:37Z

Adds a kv_cache_quant_targets quant config argument that attaches output_scales to the specified Linear modules. This means we will end up with k_proj.output_scale and v_proj.output_scale after activation calibration. For the final checkpoint, we add a pass to take the maximum of k_proj.output_scale and v_proj.output_scale, and place the result in the parent of those modules (the Attention module) as a single kv_scale, which is needed to match the representation in vLLM.

Also includes a decent chunk of refactoring to allow for no examples to be passed in for weight quantization, renaming for clearer understanding of modules, making "re:.*lm_head" not a required ignored pattern but just a default, and disabling torch._scaled_mm for easier usage on CPU.

A new example is included to show how to enable this functionality

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8-KV"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static",
    ignore_patterns=["re:.*lm_head"],
    kv_cache_quant_targets=("k_proj", "v_proj"),
)

model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

This reverts commit 0d40b99.

mgoin added 3 commits June 14, 2024 18:56

Support calibrating kv cache scales

e325874

Add comment

4eae078

Fix weight name

084feb8

mgoin force-pushed the support-kv-cache-scales branch from 96cc0b0 to 084feb8 Compare June 14, 2024 18:57

mgoin marked this pull request as ready for review June 14, 2024 18:57

mgoin added 5 commits June 15, 2024 18:39

Add Qwen test

f3a29ae

Fix kv cache test count

e620300

Add fixed target sizes

817cf9c

Fix proj linear count

142eac8

Merge branch 'main' into support-kv-cache-scales

bb15a62

mgoin mentioned this pull request Jun 18, 2024

FP8 KV cache support #10

Closed

mgoin added 2 commits June 18, 2024 16:11

Switch from output_scale to kv_scale

8556c86

Add example

65c4e83

mgoin merged commit 0d40b99 into main Jun 18, 2024
4 checks passed

mgoin added a commit that referenced this pull request Jun 19, 2024

Revert "Support calibrating kv cache scales (#17)"

3662e0e

This reverts commit 0d40b99.

mgoin linked an issue Jun 20, 2024 that may be closed by this pull request

FP8 KV cache support #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support calibrating kv cache scales #17

Support calibrating kv cache scales #17

mgoin commented Jun 13, 2024 •

edited

Loading

Support calibrating kv cache scales #17

Support calibrating kv cache scales #17

Conversation

mgoin commented Jun 13, 2024 • edited Loading

mgoin commented Jun 13, 2024 •

edited

Loading