Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compressor] Add packed int8 support #91

Merged
merged 10 commits into from
Jun 24, 2024
Merged

[compressor] Add packed int8 support #91

merged 10 commits into from
Jun 24, 2024

Conversation

dsikka
Copy link
Contributor

@dsikka dsikka commented Jun 18, 2024

Summary

  • Add the ability to pack int8 tensors into int32 packed weights
  • This will allow the w8a16 models to run with the gpt_marlin kernels in vllm

Testing:

  • Tested using the following recipe/script
import torch

from sparseml.transformers import SparseAutoModelForCausalLM, oneshot


# define a sparseml recipe for GPTQ W8A8 quantization
recipe = """
quant_stage:
    quant_modifiers:
        GPTQModifier:
            sequential_update: false
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: "int"
                        symmetric: true
                        strategy: "channel"
                    targets: ["Linear"]
"""

# setting device_map to auto to spread the model evenly across all available GPUs
# load the model in as bfloat16 to save on memory and compute
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.bfloat16, device_map="auto"
)

# uses SparseML's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# save location of quantized model out
output_dir = "./output_llama1b_w8a16_channel_compressed"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]"}
max_seq_length = 512
pad_to_max_length = False
num_calibration_samples = 512

# apply recipe to the model and save quantized output in an int4 packed format
oneshot(
    model=model,
    dataset=dataset,
    recipe=recipe,
    output_dir=output_dir,
    splits=splits,
    max_seq_length=max_seq_length,
    pad_to_max_length=pad_to_max_length,
    num_calibration_samples=num_calibration_samples,
    save_compressed=True,
)

model.save_pretrained(output_dir, quantization_format="pack-quantized")

The produced model was then tested and ran in vllm without issue

from vllm import LLM, SamplingParams
import torch

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The president of the United States is",
    "The Boston Bruins are"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=1, top_p=1)


llm = LLM(model="/root/output_llama1b_w8a16_channel_compressed")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output:

Prompt: 'Hello, my name is', Generated text: ' PR Business Incubator Mohammed, and I am studying English and Computer Science'
Prompt: 'The capital of France is', Generated text: " Paris, but it's a city with plenty of blue-collar people"
Prompt: 'The president of the United States is', Generated text: ': evangelicals are influential in the political process; Gallup finds that'
Prompt: 'The Boston Bruins are', Generated text: " among the NHL's hottest teams as of late. They'"

@dsikka dsikka requested review from Satrat and bfineran June 18, 2024 16:07
Copy link

@Satrat Satrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add in the unpacking code for int8? Then we will be able to reload the models in transformers

@dsikka dsikka requested a review from Satrat June 19, 2024 21:58
bfineran
bfineran previously approved these changes Jun 24, 2024
tests/test_compressors/test_pack_quant.py Outdated Show resolved Hide resolved
@dsikka dsikka force-pushed the add_int8_packed branch from ca5279f to a6a649b Compare June 24, 2024 14:51
bfineran
bfineran previously approved these changes Jun 24, 2024
Satrat
Satrat previously approved these changes Jun 24, 2024
@dsikka dsikka dismissed stale reviews from Satrat and bfineran via 7c18805 June 24, 2024 18:03
@dsikka dsikka requested review from Satrat and bfineran June 24, 2024 18:36
@dsikka dsikka merged commit f3b0948 into main Jun 24, 2024
1 check passed
@dsikka dsikka deleted the add_int8_packed branch June 24, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants