vllm output garbled characters #665

dcdmm · 2024-12-03T11:06:27Z

vllm output garbled characters

model_yy = "/home/code/inference/Qwen2.5-1.5B-Instruct-CAWQ"
llm = LLM(model=model_yy, quantization='awq')
outputs = llm.generate(["hello!"], sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

But auto_awq outputs normally

why?

casper-hansen · 2024-12-03T11:10:06Z

@dcdmm its tough to say what causes this. It may be the kernels they use. Either way, please go and raise an issue in their repository

dcdmm · 2024-12-03T11:33:16Z

@dcdmm its tough to say what causes this. It may be the kernels they use. Either way, please go and raise an issue in their repository

The size of the model after quantization by qwen is 1.6G, and vllm can be used normally, but the size of my quantized model is only 1.1G. Here is my quantization code:

model_path = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)


def load_dolly():
    data = load_from_disk("databricks_databricks-dolly-15k_train")
    def concatenate_data(x):
        msg = [
            {"role": "system", "content": x['instruction']},
            {"role": "user", "content": x['context']},
            {"role": "assistant", "content": x['response']}
        ]
        text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
        return {"text": text.strip()}

    concatenated = data.map(concatenate_data)
    return [text for text in concatenated["text"]]


quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config,
               # Custom Data(Note that currently all samples above 512 in length are discarded.)
               calib_data=load_dolly())

jklj077 · 2024-12-04T08:18:13Z

@casper-hansen

Hi, apologies for reusing this issue, but we were receiving similar feedbacks. It appears that since autoawq v0.2.7, lm_head is saved instead of model.embed_tokens if the model uses tied word embeddings.

Here is a brief summary:

Model Files	`model.embed_tokens`	`lm_head`
`Qwen/Qwen2.5-1.5B-Instruct`	yes	no
`transformers==4.46.3` load and save	yes	no
`autoawq==0.2.6` quantize and save	yes	yes
`autoawq==0.2.7.post3` quantize and save	no	yes
expected	yes	yes or no

The behaviour of autoawq v0.2.7 seems to be the cause for the following compatibility problems:

vllm generates nonsense: the lm_head is skipped if tied_word_embeddings (code). It means both the token embedding and the lm head will be zeros.
- [BUG]: autoawq 0.2.7量化得到模型，无法使用vllm进行推理 QwenLM/Qwen2.5#1117
llama.cpp complains 'token_embd.weight' is not found for autoawq exported weights: llama.cpp also explicitly skipped lm_head when converting those kinds of models (code) and use input tok embed for output in inference (code).
- "llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found" after using "Quantizing the GGUF with AWQ Scale" QwenLM/Qwen2.5#1101

While this could be simply mitgated by running the following scripts (for Qwen models)：

import os
import safetensors

quant_path = "my-awesome-awq-model"
tensors = {}
with safetensors.safe_open(
    os.path.join(quant_path, "model.safetensors"), framework="pt", device="cpu"
) as f:
    for k in f.keys():
        nk = "model.embed_tokens.weight" if k == "lm_head.weight" else k
        tensors[nk] = f.get_tensor(k)

os.rename(
    os.path.join(quant_path, "model.safetensors"),
    os.path.join(quant_path, "model.safetensors.bak"),
)
safetensors.torch.save_file(tensors, os.path.join(quant_path, "model.safetensors"))

It would be better if autoawq could provide a compatible solution.

casper-hansen · 2024-12-04T09:23:37Z

@jklj077 huggingface/transformers#35080
Please see the transformers issue for more details. I will try to follow up to see if we can land an official bug fix for this.

casper-hansen mentioned this issue Dec 4, 2024

Deprecated shard_checkpoint's replacement save_torch_state_dict does not save tied embeddings huggingface/transformers#35080

Closed

4 tasks

This was referenced Dec 5, 2024

[Bug]: AttributeError: 'Qwen2Model' object has no attribute 'rotary_emb' vllm-project/vllm#10773

Open

Fix missing embed_tokens #671

Merged

casper-hansen closed this as completed in #671 Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm output garbled characters #665

vllm output garbled characters #665

dcdmm commented Dec 3, 2024 •

edited

Loading

casper-hansen commented Dec 3, 2024

dcdmm commented Dec 3, 2024 •

edited

Loading

jklj077 commented Dec 4, 2024 •

edited

Loading

casper-hansen commented Dec 4, 2024

vllm output garbled characters #665

vllm output garbled characters #665

Comments

dcdmm commented Dec 3, 2024 • edited Loading

casper-hansen commented Dec 3, 2024

dcdmm commented Dec 3, 2024 • edited Loading

jklj077 commented Dec 4, 2024 • edited Loading

casper-hansen commented Dec 4, 2024

dcdmm commented Dec 3, 2024 •

edited

Loading

dcdmm commented Dec 3, 2024 •

edited

Loading

jklj077 commented Dec 4, 2024 •

edited

Loading