[compressor] Add packed int8 support #91

dsikka · 2024-06-18T16:07:26Z

Summary

Add the ability to pack int8 tensors into int32 packed weights
This will allow the w8a16 models to run with the gpt_marlin kernels in vllm

Testing:

Tested using the following recipe/script

import torch

from sparseml.transformers import SparseAutoModelForCausalLM, oneshot


# define a sparseml recipe for GPTQ W8A8 quantization
recipe = """
quant_stage:
    quant_modifiers:
        GPTQModifier:
            sequential_update: false
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: "int"
                        symmetric: true
                        strategy: "channel"
                    targets: ["Linear"]
"""

# setting device_map to auto to spread the model evenly across all available GPUs
# load the model in as bfloat16 to save on memory and compute
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.bfloat16, device_map="auto"
)

# uses SparseML's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# save location of quantized model out
output_dir = "./output_llama1b_w8a16_channel_compressed"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]"}
max_seq_length = 512
pad_to_max_length = False
num_calibration_samples = 512

# apply recipe to the model and save quantized output in an int4 packed format
oneshot(
    model=model,
    dataset=dataset,
    recipe=recipe,
    output_dir=output_dir,
    splits=splits,
    max_seq_length=max_seq_length,
    pad_to_max_length=pad_to_max_length,
    num_calibration_samples=num_calibration_samples,
    save_compressed=True,
)

model.save_pretrained(output_dir, quantization_format="pack-quantized")

The produced model was then tested and ran in vllm without issue

from vllm import LLM, SamplingParams
import torch

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The president of the United States is",
    "The Boston Bruins are"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=1, top_p=1)


llm = LLM(model="/root/output_llama1b_w8a16_channel_compressed")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

Prompt: 'Hello, my name is', Generated text: ' PR Business Incubator Mohammed, and I am studying English and Computer Science'
Prompt: 'The capital of France is', Generated text: " Paris, but it's a city with plenty of blue-collar people"
Prompt: 'The president of the United States is', Generated text: ': evangelicals are influential in the political process; Gallup finds that'
Prompt: 'The Boston Bruins are', Generated text: " among the NHL's hottest teams as of late. They'"

Satrat

Could we also add in the unpacking code for int8? Then we will be able to reload the models in transformers

src/compressed_tensors/compressors/model_compressor.py

src/compressed_tensors/compressors/pack_quantized.py

tests/test_compressors/test_pack_quant.py

…cheme mapping

dsikka requested review from Satrat and bfineran June 18, 2024 16:07

Satrat suggested changes Jun 18, 2024

View reviewed changes

dsikka requested a review from Satrat June 19, 2024 21:58

Satrat suggested changes Jun 20, 2024

View reviewed changes

bfineran previously approved these changes Jun 24, 2024

View reviewed changes

tests/test_compressors/test_pack_quant.py Outdated Show resolved Hide resolved

dsikka dismissed bfineran’s stale review via d03c14f June 24, 2024 14:39

dsikka requested review from Satrat and bfineran June 24, 2024 14:39

dsikka added 8 commits June 24, 2024 14:51

add function to pack bits

3ec790c

fix arg

a365687

make 4bits the default

0aa127b

update

95e6e44

add support for int8 decompress; update function to take in name to s…

191199c

…cheme mapping

update to test 8 bits; update kwargs

cef0396

fix print; update name

c4205a4

update tests

a6a649b

dsikka force-pushed the add_int8_packed branch from ca5279f to a6a649b Compare June 24, 2024 14:51

bfineran previously approved these changes Jun 24, 2024

View reviewed changes

Satrat previously approved these changes Jun 24, 2024

View reviewed changes

update arg

7c18805

dsikka dismissed stale reviews from Satrat and bfineran via 7c18805 June 24, 2024 18:03

update all other classes

90e459c

dsikka requested review from Satrat and bfineran June 24, 2024 18:36

Satrat approved these changes Jun 24, 2024

View reviewed changes

bfineran approved these changes Jun 24, 2024

View reviewed changes

dsikka merged commit f3b0948 into main Jun 24, 2024
1 check passed

dsikka deleted the add_int8_packed branch June 24, 2024 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[compressor] Add packed int8 support #91

[compressor] Add packed int8 support #91

Uh oh!

dsikka commented Jun 18, 2024 •

edited

Loading

Uh oh!

Satrat left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[compressor] Add packed int8 support #91

[compressor] Add packed int8 support #91

Uh oh!

Conversation

dsikka commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing:

The produced model was then tested and ran in vllm without issue

Output:

Uh oh!

Satrat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka commented Jun 18, 2024 •

edited

Loading