-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vllm output garbled characters #665
Comments
@dcdmm its tough to say what causes this. It may be the kernels they use. Either way, please go and raise an issue in their repository |
The size of the model after quantization by qwen is 1.6G, and vllm can be used normally, but the size of my quantized model is only 1.1G. Here is my quantization code: model_path = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)
def load_dolly():
data = load_from_disk("databricks_databricks-dolly-15k_train")
def concatenate_data(x):
msg = [
{"role": "system", "content": x['instruction']},
{"role": "user", "content": x['context']},
{"role": "assistant", "content": x['response']}
]
text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
return {"text": text.strip()}
concatenated = data.map(concatenate_data)
return [text for text in concatenated["text"]]
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config,
# Custom Data(Note that currently all samples above 512 in length are discarded.)
calib_data=load_dolly()) |
Hi, apologies for reusing this issue, but we were receiving similar feedbacks. It appears that since Here is a brief summary:
The behaviour of
While this could be simply mitgated by running the following scripts (for Qwen models): import os
import safetensors
quant_path = "my-awesome-awq-model"
tensors = {}
with safetensors.safe_open(
os.path.join(quant_path, "model.safetensors"), framework="pt", device="cpu"
) as f:
for k in f.keys():
nk = "model.embed_tokens.weight" if k == "lm_head.weight" else k
tensors[nk] = f.get_tensor(k)
os.rename(
os.path.join(quant_path, "model.safetensors"),
os.path.join(quant_path, "model.safetensors.bak"),
)
safetensors.torch.save_file(tensors, os.path.join(quant_path, "model.safetensors")) It would be better if autoawq could provide a compatible solution. |
@jklj077 huggingface/transformers#35080 |
vllm output garbled characters
model_yy = "/home/code/inference/Qwen2.5-1.5B-Instruct-CAWQ"
llm = LLM(model=model_yy, quantization='awq')
outputs = llm.generate(["hello!"], sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
But auto_awq outputs normally
why?
The text was updated successfully, but these errors were encountered: