GPTQ Activation Ordering #94

kylesayrs · 2024-08-16T20:48:22Z

Summary

Add support for compressed-tensors models which have been quantized using activation ordering (group-wise quantization in decreasing order of activation)

Usage Script

compress_actorder.py

import os
import pickle
import datetime
from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot

def get_current_time():
    now = datetime.datetime.now()
    formatted_time = now.strftime("%Y%m%d_%H%M%S")
    return str(formatted_time)

# Select model and load it.
MODEL_ID="Qwen/Qwen2-0.5B-Instruct"
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Select calibration dataset.
DATASET_ID = "openai/gsm8k"
DATASET_SUBSET = "main"
DATASET_SPLIT = "train"
PICKLE_FILE = "pickle.pkl"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 512

# Load dataset and preprocess.
def preprocess(example):
    return tokenizer.apply_chat_template(
        {
            "role": "user",
            "content": example["question"],
        },
        tokenize=False,
        add_generation_prompt=True
    )

# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["question"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

# Check if the preprocessed dataset is already saved
if os.path.exists(PICKLE_FILE):
    # Load the dataset from the pickle file
    with open(PICKLE_FILE, "rb") as f:
        ds = pickle.load(f)
    print("Loaded dataset from pickle file.")
else:
    # Load and preprocess the dataset
    ds = load_dataset(DATASET_ID, DATASET_SUBSET, split=DATASET_SPLIT)
    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
    ds = ds.map(tokenize, remove_columns=ds.column_names)

    # Save the preprocessed dataset to a pickle file
    with open(PICKLE_FILE, "wb") as f:
        pickle.dump(ds, f)
    print("Saved preprocessed dataset to pickle file.")

recipe = """
    quant_stage:
        quant_modifiers:
            GPTQModifier:
                sequential_update: false
                ignore: ["lm_head"]
                config_groups:
                    group_0:
                        weights:
                            num_bits: 4
                            type: "int"
                            symmetric: true
                            strategy: "group"
                            group_size: 128
                            actorder: True
                        targets: ["Linear"]
"""
# Apply algorithm
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# save model
SAVE_DIR = "actorder" + get_current_time()
print(SAVE_DIR)
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR, save_compressed=True)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

Evaluation

Accuracy

Full Precision

vllm (pretrained=Qwen/Qwen2-0.5B-Instruct,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.387|?  |0.0154|
|     |       |strict-match    |     5|exact_match|?  |0.385|?  |0.0154|

Group Quantization Only

vllm (pretrained=/home/ksayers/llm-compressor/gwen_group,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.226|?  |0.0132|
|     |       |strict-match    |     5|exact_match|?  |0.212|?  |0.0129|

Group Quantization Only on main (regression test)

vllm (pretrained=/home/ksayers/llm-compressor/gwen_regression,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.226|?  |0.0132|
|     |       |strict-match    |     5|exact_match|?  |0.212|?  |0.0129|

Activation Ordering

vllm (pretrained=/home/ksayers/llm-compressor/gwen_actorder,add_bos_token=True), gen_kwargs: (None), limit: 1000.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|                                                   
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|                                                   
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.235|?  |0.0134|                                                   
|     |       |strict-match    |     5|exact_match|?  |0.231|?  |0.0133|

Latency Regression

Namespace(model='/home/ksayers/llm-compressor/gwen_actorder/', speculative_model=None, num_speculative
_tokens=None, speculative_draft_tensor_parallel_size=None, tokenizer=None, quantization=None, tensor_p
arallel_size=1, input_len=32, output_len=128, batch_size=32, n=1, use_beam_search=False, num_iters_war
mup=10, num_iters=30, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False, 
kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='a
uto', block_size=16, enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=F
alse, ray_workers_use_nsight=False, download_dir=None, output_json=None, gpu_memory_utilization=0.9, l
oad_format='auto', distributed_executor_backend=None, otlp_traces_endpoint=None)

Group Quantization Only

Avg latency: 0.8884373404396076 seconds
10% percentile latency: 0.8715801022946834 seconds
25% percentile latency: 0.8739993472117931 seconds
50% percentile latency: 0.876951577141881 seconds
75% percentile latency: 0.8830150356516242 seconds
90% percentile latency: 0.9393035409972071 seconds
99% percentile latency: 0.9404808702412992 seconds

Activation Ordering

Avg latency: 0.9159474782645702 seconds
10% percentile latency: 0.9001966264098883 seconds
25% percentile latency: 0.9010569080710411 seconds
50% percentile latency: 0.9041027296334505 seconds
75% percentile latency: 0.9064613012596965 seconds
90% percentile latency: 0.9662564094178379 seconds
99% percentile latency: 0.9761117453686893 seconds

PR Dependencies

Activation Ordering Support (neuralmagic/compressed-tensors#97)

…ation

…ordering

Satrat

Changes look good with respect to the Hessian memory management. I'd still like to see an e2e test in for activation reordering that tests perplexity and reloading. You can see tests/llmcompressor/transformers/compression/test_quantization.py for an example of this. I believe it should just be a matter of adding a new recipe and config, let me know if you need help with doing that

…ivation-ordering

kylesayrs · 2024-08-28T02:58:49Z

Preformed tests and got the same accuracy and latency results

…ordering

kylesayrs · 2024-08-28T21:16:07Z

Using compressed_tensors main branch, I confirmed that tests/llmcompressor/modifiers, tests/llmcompressor/transformers/compression, and tests/llmcompressor/modifiers/quantization/gptq/utils/test_gptq_wrapper.py all pass

kylesayrs marked this pull request as draft August 16, 2024 22:23

kylesayrs mentioned this pull request Aug 16, 2024

actorder #16

Closed

kylesayrs changed the title ~~Activation Ordering~~ GPTQ Activation Ordering Aug 16, 2024

horheynm and others added 27 commits August 16, 2024 23:06

actorder

012138a

g_idx fix

f88c84e

fix

3211fe1

lint

bbbf564

propagagte g_idx with perm

8d29f0d

scratch

89224e9

GPTQ - move calibration of quantiztion params to after hessian calibr…

cb8446d

…ation

no recompute

d7029a0

clean up

eeff533

remvoe unwanted code

842b150

draft

240c39d

draft

820d08a

draft

564845e

mimic gptq

6f54737

permutation seems to be working

2cc99bb

WIP: fails on non-square weights

6fe537d

pass perm into quant params calculation

6611073

works on vllm and loading with identity permutation

9077969

WIP: working pytorch with actorder

6a1565e

able to inference with script and reload, needed to set

1940df4

remove testing comments

11beac1

remove scripts

9456698

remove dregs

0c773e6

merge actorder and group cases

b6bebc2

code structuring and cleanup

3bde194

use refresh_layer_weight_quant_params

758c495

update_layer_weight_quant_params reuse

85fb1ff

Merge remote-tracking branch 'origin/main' into kylesayrs/activation-…

d22ff2e

…ordering

kylesayrs requested a review from Satrat August 23, 2024 18:46

Satrat suggested changes Aug 23, 2024

View reviewed changes

kylesayrs added 3 commits August 25, 2024 00:18

Merge branch 'main' into kylesayrs/activation-ordering

72d919f

indent for when quantization_scheme is missing

e4d37a6

add actorder e2e test

cdc8bcd

bfineran approved these changes Aug 26, 2024

View reviewed changes

kylesayrs added 8 commits August 27, 2024 19:13

do not freeze if initialized from gptq

1fe188b

add get_attr_chain helper function

b06a103

cleanup and clarify logic

f293efd

apply style

a99e0da

rename to getattr_chain, handle no default case

bf915d4

out of place type conversion

66ef96b

Merge remote-tracking branch 'origin/gptq-cleanup' into kylesayrs/act…

98aaf88

…ivation-ordering

account for extra case

91c877a

kylesayrs changed the base branch from main to gptq-cleanup August 27, 2024 20:37

kylesayrs added 5 commits August 28, 2024 01:45

remove freeze_quantization argument

b711e14

remove fake_quantization case, update debug message

974dbc7

remove todo

094e429

Merge remote-tracking branch 'origin/gptq-cleanup' into kylesayrs/act…

582c179

…ivation-ordering

correct name

febb741

Base automatically changed from gptq-cleanup to main August 28, 2024 20:19

kylesayrs added 2 commits August 28, 2024 20:25

Merge remote-tracking branch 'origin/main' into kylesayrs/activation-…

83a1d93

…ordering

Merge remote-tracking branch 'origin/main' into kylesayrs/activation-…

a1646e5

…ordering

change to false in docstring

eef6bab

kylesayrs merged commit 6ad6e05 into main Aug 28, 2024
4 of 7 checks passed

kylesayrs deleted the kylesayrs/activation-ordering branch August 28, 2024 21:18

markmc pushed a commit to markmc/llm-compressor that referenced this pull request Nov 13, 2024

default W4A16 alias to use group_size=128 (vllm-project#94)

6319bc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ Activation Ordering #94

GPTQ Activation Ordering #94

kylesayrs commented Aug 16, 2024 •

edited

Loading

Satrat left a comment

kylesayrs commented Aug 28, 2024

kylesayrs commented Aug 28, 2024

GPTQ Activation Ordering #94

GPTQ Activation Ordering #94

Conversation

kylesayrs commented Aug 16, 2024 • edited Loading

Summary

Usage Script

Evaluation

Accuracy

Latency Regression

PR Dependencies

Satrat left a comment

Choose a reason for hiding this comment

kylesayrs commented Aug 28, 2024

kylesayrs commented Aug 28, 2024

kylesayrs commented Aug 16, 2024 •

edited

Loading