Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm output garbled characters #665

Closed
dcdmm opened this issue Dec 3, 2024 · 4 comments · Fixed by #671
Closed

vllm output garbled characters #665

dcdmm opened this issue Dec 3, 2024 · 4 comments · Fixed by #671

Comments

@dcdmm
Copy link

dcdmm commented Dec 3, 2024

vllm output garbled characters

model_yy = "/home/code/inference/Qwen2.5-1.5B-Instruct-CAWQ"
llm = LLM(model=model_yy, quantization='awq')
outputs = llm.generate(["hello!"], sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

image

But auto_awq outputs normally
image

why?

@casper-hansen
Copy link
Owner

@dcdmm its tough to say what causes this. It may be the kernels they use. Either way, please go and raise an issue in their repository

@dcdmm
Copy link
Author

dcdmm commented Dec 3, 2024

@dcdmm its tough to say what causes this. It may be the kernels they use. Either way, please go and raise an issue in their repository

The size of the model after quantization by qwen is 1.6G, and vllm can be used normally, but the size of my quantized model is only 1.1G. Here is my quantization code:

model_path = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)


def load_dolly():
    data = load_from_disk("databricks_databricks-dolly-15k_train")
    def concatenate_data(x):
        msg = [
            {"role": "system", "content": x['instruction']},
            {"role": "user", "content": x['context']},
            {"role": "assistant", "content": x['response']}
        ]
        text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
        return {"text": text.strip()}

    concatenated = data.map(concatenate_data)
    return [text for text in concatenated["text"]]


quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config,
               # Custom Data(Note that currently all samples above 512 in length are discarded.)
               calib_data=load_dolly())

@jklj077
Copy link

jklj077 commented Dec 4, 2024

@casper-hansen

Hi, apologies for reusing this issue, but we were receiving similar feedbacks. It appears that since autoawq v0.2.7, lm_head is saved instead of model.embed_tokens if the model uses tied word embeddings.

Here is a brief summary:

Model Files model.embed_tokens lm_head
Qwen/Qwen2.5-1.5B-Instruct yes no
transformers==4.46.3 load and save yes no
autoawq==0.2.6 quantize and save yes yes
autoawq==0.2.7.post3 quantize and save no yes
expected yes yes or no

The behaviour of autoawq v0.2.7 seems to be the cause for the following compatibility problems:

While this could be simply mitgated by running the following scripts (for Qwen models):

import os
import safetensors

quant_path = "my-awesome-awq-model"
tensors = {}
with safetensors.safe_open(
    os.path.join(quant_path, "model.safetensors"), framework="pt", device="cpu"
) as f:
    for k in f.keys():
        nk = "model.embed_tokens.weight" if k == "lm_head.weight" else k
        tensors[nk] = f.get_tensor(k)

os.rename(
    os.path.join(quant_path, "model.safetensors"),
    os.path.join(quant_path, "model.safetensors.bak"),
)
safetensors.torch.save_file(tensors, os.path.join(quant_path, "model.safetensors"))

It would be better if autoawq could provide a compatible solution.

@casper-hansen
Copy link
Owner

@jklj077 huggingface/transformers#35080
Please see the transformers issue for more details. I will try to follow up to see if we can land an official bug fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants