quantization + sparsification - model outputs zeros #942

nirey10 · 2024-11-28T08:05:44Z

Describe the bug
when running quantization (gptq) after sparsification (sparsegpt 2:4) the model accuracy and perplexity is damaged hard and outputs only zeros

Expected behavior
A reasonable text

Environment

Ubuntu 22.04.3 LTS
Python 3.12.4
LLM Compressor 0.1.0/0.3.0
Other Python package versions vLLM 0.6.2/0.5.5, compressed-tensors 0.6.0/0.7.0/0.8.0
Cuda 12.4
GPU - A6000

To Reproduce
model: llama-3.1-8b-instruct
'''
recipe:
sparsity_stage:
run_type: oneshot
sparsity_modifiers:
SparseGPTModifier:
sparsity: 0.5
mask_structure: "2:4"
sequential_update: false
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "channel"
targets: ["Linear"]
'''

Additional context
Quantization alone (GPTQ) provides reasonable results (but not like auto_gptq), when combining it with 2:4 sparsification first, it outputs zeros only (or !).
The only thing that differs from your example is the model and the lack of fine tuning.
I served the model with vllm and asked for a simple completion like "san fransisco is:".

dsikka · 2024-11-29T13:53:30Z

@nirey10 Hi! Can you share how you’re running the model? And share the model config?

nirey10 · 2024-12-01T09:19:58Z

Hey, exactly like the quantization_2of4_sparse_4w16a example but without the finetune stage and i used the llama3.1-8b-instruct model instead.

i am running the model with 'vllm serve' and use the /v1/completions and /v1/chat/completions routes

By the way, i even took one of the models that uploaded to HF (which i believe that uses this code): neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16 and it outputs only '!' when i am doing chat competions, the exact phenomena that i get.

dsikka · 2024-12-01T11:46:06Z

Hi @nirey10, if you’re running generation using vLLM, can you try setting the dtype to float16?

jiangjiadi · 2024-12-02T02:47:35Z

@nirey10 @dsikka Same problem I had with the qwen model. #926

nirey10 · 2024-12-02T09:02:14Z

@dsikka running with float16 actually fixed the released model in HF but my compressed model still outputs nonsense (atleast not '!').
when i am trying to run in with some output folder it doesnt take the appropriate {output_folder}/{stage name} path for the next stage as well.
can you please share the versions of vllm, compressed-tensors and llmcompressor that you used for this example?

dsikka · 2024-12-02T14:07:46Z

HI @nirey10 can you share the code you're using when running the model on vLLM

robertgshaw2-redhat · 2024-12-06T16:24:56Z

What version of vLLM is being used? We fixed some issues with the kernel in the vllm recently:

[Bugfix] Marlin 2:4 temp fix for large M dim (>256) vllm#10464

nirey10 · 2024-12-08T09:29:14Z

Hey,
@robertgshaw2-neuralmagic i am using vllm==0.6.2
@dsikka just running 'vllm serve {model_name} --dtype float16'

Eventually i was able to run it with the yaml recipe but with llmcompressor==0.1.0 and its corresponding 2_4 example from git. After some experiments i found out the the fine tuning stage is crucial for decent outputs, despite the face that the original SparseGPT can provide decent results without fine tuning.

To sum it up the --dtype float16 actually helped with the results, i think it should be on the README. I think that there is a bug with the current example of the 2_4 sparsification and quantization, the model output path from the stages are not going through the stages in the pipeline, instead of taking the sparse model in the finetuning stage, it is looking for the model name from the input, something that does not exists locally.

Thanks for the help!

Thunderbeee · 2024-12-25T10:51:15Z

HiI nirey10 @nirey10 and @dsikka @robertgshaw2-neuralmagic , after simply running python llama7b_sparse_w4a16.py, following the README https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_2of4_sparse_w4a16, I got three folders: stage_finetuning, stage_qunatization, stage_sparsity. I then run, vllm serve output_llama1b_2of4_w4a16_channel/stage_finetuning --dtype float16 --host 0.0.0.0 --port 8080, but get issue KeyError: 'config_groups'. I am wondering how are your steps to run the code? Thanks!

dsikka · 2024-12-25T14:08:03Z

The finetune stage output will not be compressed. You'll need to pass a model that has been quantized or sparsified, such as the model output from stage_quantization.

Thunderbeee · 2024-12-25T14:29:56Z

Thanks for your prompt response! :D

XinShuo-ph · 2024-12-26T04:52:24Z

The finetune stage output will not be compressed. You'll need to pass a model that has been quantified or sparsified, such as the model output from stage_quantization.

@dsikka Hi Dipika, we tried loading the model output in the example from stage_sparsity using vllm.LLM but still got KeyError: 'config_groups'. The model is "neuralmagic/Llama-2-7b-ultrachat200k" passed through

sparsity_stage:
run_type: oneshot
sparsity_modifiers:
SparseGPTModifier:
sparsity: 0.5
mask_structure: "2:4"
sequential_update: false

We checked the model weights in the safetensor file:

Is there a code that explains the usage of the bitmask, row_offsets parameters?

dsikka · 2024-12-26T13:14:49Z

Hi @XinShuo-ph - this model has been compressed using the Bitmask compressor for which there is no support yet in vllm. vLLM currently supports 2:4 models with the dense compressor

Thunderbeee · 2024-12-26T15:34:47Z

@nirey10 thanks for clarifying. could you please give an example of using llm-compressor to prune a LLM? thanks! We only want a LLM after stage_sparsity (only after prunning), but we do not know how to write a correct recipe to achieve this goal.

dsikka · 2024-12-26T17:16:57Z

@Thunderbeee

The following example produces a model that has had sparsity applied to it, following a 2:4 structure. The compressed model produced can run on vLLM:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot

# Select model and load it.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = """
sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      mask_structure: "2:4"
"""

# Apply algorithms.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-Sparse2of4"
model.save_pretrained(SAVE_DIR, save_compressed=False)
tokenizer.save_pretrained(SAVE_DIR)

Running on vLLM (with the latest release):

from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)
llm = LLM(model=model_dir)
outputs = llm.generate(prompts, sampling_params)

You can update the recipe to your desired sparsity by changing the sparsity argument. It is also not necessary to provide a mask_structure value however only 2:4 sparsity has been optimized in vLLM atm.

Thunderbeee · 2024-12-27T09:59:42Z

Thanks so much Dsikka! It is helpful! two quick questions, in your code your wrote model.save_pretrained(SAVE_DIR, save_compressed=False), should it be save_compressed=True? and what is the difference between your code and using the example (after changing the recipe to only contain sparsity) https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_2of4_sparse_w4a16

dsikka · 2024-12-27T13:11:08Z

@Thunderbeee

If you'd like to run this model on vLLM while leveraging the kernels optimized for 2:4 sparsity, you'll have to set it to False. This ensures that the dense compressor is applied. Otherwise, the bit mask compressor will be used for which there is no support yet in vLLM. We are in the process of updating this however.
The example applies one shot and finetune stages (sparsity, finetune , then quantization) while the script I shared only applies one shot sparsity.

Thunderbeee · 2024-12-27T15:22:41Z

Thank you so much for your detailed explanation! :D

dsikka · 2024-12-27T15:33:15Z

@nirey10 Was your issue resolved?

Thunderbeee · 2024-12-30T09:06:18Z

@dsikka hi disikka, thanks so much for your previous example, it works in our case :D

But we found the performance was not good after simply pruning, so we want to do finetuing after pruning (as this example https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_2of4_sparse_w4a16 but without quantization). Our current workflow is below:

python examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py

and then we can get a directory output_llama7b_2of4_w4a16_channel with three folders: output_llama7b_2of4_w4a16_channel/stage_sparsity, output_llama7b_2of4_w4a16_channel/stage_finetuning, output_llama7b_2of4_w4a16_channel/stage_quantization.

Now, we want to load output_llama7b_2of4_w4a16_channel/stage_finetuning:

compressed_output_dir = "output_llama7b_2of4_w4a16_channel/stage_finetuning_compressed"
model = AutoModelForCausalLM.from_pretrained(output_dir, torch_dtype=torch.bfloat16)
model.save_pretrained(compressed_output_dir, save_compressed=False). # False instead of True there

tokenizer = AutoTokenizer.from_pretrained(compressed_output_dir)
model = AutoModelForCausalLM.from_pretrained(
            compressed_output_dir,
            device_map="auto",
            torch_dtype=torch.float16
        )

Does the workflow look correct to you?

Our current issue is that we would encounter

"Some weights of LlamaForCausalLM were not initialized from the model checkpoint at src/stage_sparsity and are newly initialized:"

"Some weights of the model checkpoint at src/stage_sparsity were not used when initializing LlamaForCausalLM:"

Thanks!

nirey10 added the bug Something isn't working label Nov 28, 2024

kylesayrs assigned rahul-tuli Nov 29, 2024

dsikka assigned dsikka and unassigned rahul-tuli Dec 25, 2024

dsikka closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization + sparsification - model outputs zeros #942

quantization + sparsification - model outputs zeros #942

nirey10 commented Nov 28, 2024 •

edited

Loading

dsikka commented Nov 29, 2024 •

edited

Loading

nirey10 commented Dec 1, 2024 •

edited

Loading

dsikka commented Dec 1, 2024

jiangjiadi commented Dec 2, 2024 •

edited

Loading

nirey10 commented Dec 2, 2024

dsikka commented Dec 2, 2024

robertgshaw2-redhat commented Dec 6, 2024

nirey10 commented Dec 8, 2024 •

edited

Loading

Thunderbeee commented Dec 25, 2024

dsikka commented Dec 25, 2024 •

edited

Loading

Thunderbeee commented Dec 25, 2024

XinShuo-ph commented Dec 26, 2024 •

edited

Loading

dsikka commented Dec 26, 2024 •

edited

Loading

Thunderbeee commented Dec 26, 2024 •

edited

Loading

dsikka commented Dec 26, 2024

Thunderbeee commented Dec 27, 2024

dsikka commented Dec 27, 2024

Thunderbeee commented Dec 27, 2024

dsikka commented Dec 27, 2024

Thunderbeee commented Dec 30, 2024 •

edited

Loading

quantization + sparsification - model outputs zeros #942

quantization + sparsification - model outputs zeros #942

Comments

nirey10 commented Nov 28, 2024 • edited Loading

dsikka commented Nov 29, 2024 • edited Loading

nirey10 commented Dec 1, 2024 • edited Loading

dsikka commented Dec 1, 2024

jiangjiadi commented Dec 2, 2024 • edited Loading

nirey10 commented Dec 2, 2024

dsikka commented Dec 2, 2024

robertgshaw2-redhat commented Dec 6, 2024

nirey10 commented Dec 8, 2024 • edited Loading

Thunderbeee commented Dec 25, 2024

dsikka commented Dec 25, 2024 • edited Loading

Thunderbeee commented Dec 25, 2024

XinShuo-ph commented Dec 26, 2024 • edited Loading

dsikka commented Dec 26, 2024 • edited Loading

Thunderbeee commented Dec 26, 2024 • edited Loading

dsikka commented Dec 26, 2024

Thunderbeee commented Dec 27, 2024

dsikka commented Dec 27, 2024

Thunderbeee commented Dec 27, 2024

dsikka commented Dec 27, 2024

Thunderbeee commented Dec 30, 2024 • edited Loading

nirey10 commented Nov 28, 2024 •

edited

Loading

dsikka commented Nov 29, 2024 •

edited

Loading

nirey10 commented Dec 1, 2024 •

edited

Loading

jiangjiadi commented Dec 2, 2024 •

edited

Loading

nirey10 commented Dec 8, 2024 •

edited

Loading

dsikka commented Dec 25, 2024 •

edited

Loading

XinShuo-ph commented Dec 26, 2024 •

edited

Loading

dsikka commented Dec 26, 2024 •

edited

Loading

Thunderbeee commented Dec 26, 2024 •

edited

Loading

Thunderbeee commented Dec 30, 2024 •

edited

Loading