-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quantization + sparsification - model outputs zeros #942
Comments
@nirey10 Hi! Can you share how you’re running the model? And share the model config? |
Hey, exactly like the quantization_2of4_sparse_4w16a example but without the finetune stage and i used the llama3.1-8b-instruct model instead. i am running the model with 'vllm serve' and use the /v1/completions and /v1/chat/completions routes By the way, i even took one of the models that uploaded to HF (which i believe that uses this code): neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16 and it outputs only '!' when i am doing chat competions, the exact phenomena that i get. |
Hi @nirey10, if you’re running generation using vLLM, can you try setting the dtype to float16? |
@dsikka running with float16 actually fixed the released model in HF but my compressed model still outputs nonsense (atleast not '!'). |
HI @nirey10 can you share the code you're using when running the model on vLLM |
What version of vLLM is being used? We fixed some issues with the kernel in the vllm recently: |
Hey, Eventually i was able to run it with the yaml recipe but with llmcompressor==0.1.0 and its corresponding 2_4 example from git. After some experiments i found out the the fine tuning stage is crucial for decent outputs, despite the face that the original SparseGPT can provide decent results without fine tuning. To sum it up the --dtype float16 actually helped with the results, i think it should be on the README. I think that there is a bug with the current example of the 2_4 sparsification and quantization, the model output path from the stages are not going through the stages in the pipeline, instead of taking the sparse model in the finetuning stage, it is looking for the model name from the input, something that does not exists locally. Thanks for the help! |
HiI nirey10 @nirey10 and @dsikka @robertgshaw2-neuralmagic , after simply running python llama7b_sparse_w4a16.py, following the README https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_2of4_sparse_w4a16, I got three folders: stage_finetuning, stage_qunatization, stage_sparsity. I then run, |
The finetune stage output will not be compressed. You'll need to pass a model that has been quantized or sparsified, such as the model output from stage_quantization. |
Thanks for your prompt response! :D |
@dsikka Hi Dipika, we tried loading the model output in the example from stage_sparsity using vllm.LLM but still got KeyError: 'config_groups'. The model is "neuralmagic/Llama-2-7b-ultrachat200k" passed through
We checked the model weights in the safetensor file: |
Hi @XinShuo-ph - this model has been compressed using the Bitmask compressor for which there is no support yet in vllm. vLLM currently supports 2:4 models with the dense compressor |
@nirey10 thanks for clarifying. could you please give an example of using llm-compressor to prune a LLM? thanks! We only want a LLM after stage_sparsity (only after prunning), but we do not know how to write a correct recipe to achieve this goal. |
The following example produces a model that has had sparsity applied to it, following a 2:4 structure. The compressed model produced can run on vLLM: from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
# Select model and load it.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
recipe = """
sparsity_stage:
run_type: oneshot
sparsity_modifiers:
SparseGPTModifier:
sparsity: 0.5
mask_structure: "2:4"
"""
# Apply algorithms.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES
)
# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")
# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-Sparse2of4"
model.save_pretrained(SAVE_DIR, save_compressed=False)
tokenizer.save_pretrained(SAVE_DIR) Running on vLLM (with the latest release): from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)
llm = LLM(model=model_dir)
outputs = llm.generate(prompts, sampling_params) You can update the recipe to your desired sparsity by changing the |
Thanks so much Dsikka! It is helpful! two quick questions, in your code your wrote |
|
Thank you so much for your detailed explanation! :D |
@nirey10 Was your issue resolved? |
@dsikka hi disikka, thanks so much for your previous example, it works in our case :D But we found the performance was not good after simply pruning, so we want to do finetuing after pruning (as this example https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_2of4_sparse_w4a16 but without quantization). Our current workflow is below:
and then we can get a directory Now, we want to load
Does the workflow look correct to you? Our current issue is that we would encounter "Some weights of LlamaForCausalLM were not initialized from the model checkpoint at src/stage_sparsity and are newly initialized:" "Some weights of the model checkpoint at src/stage_sparsity were not used when initializing LlamaForCausalLM:" Thanks! |
Describe the bug
when running quantization (gptq) after sparsification (sparsegpt 2:4) the model accuracy and perplexity is damaged hard and outputs only zeros
Expected behavior
A reasonable text
Environment
To Reproduce
model: llama-3.1-8b-instruct
'''
recipe:
sparsity_stage:
run_type: oneshot
sparsity_modifiers:
SparseGPTModifier:
sparsity: 0.5
mask_structure: "2:4"
sequential_update: false
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "channel"
targets: ["Linear"]
'''
Additional context
Quantization alone (GPTQ) provides reasonable results (but not like auto_gptq), when combining it with 2:4 sparsification first, it outputs zeros only (or !).
The only thing that differs from your example is the model and the lack of fine tuning.
I served the model with vllm and asked for a simple completion like "san fransisco is:".
The text was updated successfully, but these errors were encountered: