Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLama-3-8B model dumped by LMQuant in 4w8a set raises errors when running e2e benchmark in QServe. #29

Open
Patrick-Lew opened this issue Aug 12, 2024 · 1 comment

Comments

@Patrick-Lew
Copy link

I dumped the quantised llama-3-8B model from LMQuant, using QoQ, the command as follows written in
lmquant/projects/llm/scripts
/qoq.sh
# QoQ (W4A8KV4 with per-channel weight quantization) on Llama3-8B python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name llama3-8b --smooth-xw-alpha 0.05 --smooth-xw-beta 0.95 --smooth-yx-strategy GridSearch --smooth-yx-beta " -2"
and I append --save-model and --model-path to that command.
Then I run the checkpoint_convert.py script in QServe, get the checkpoint model.
Screenshot 2024-08-12 at 12 07 22 PM
then run e2e by this command
Screenshot 2024-08-12 at 12 07 56 PM
but the results are like
Screenshot 2024-08-12 at 12 08 26 PM

I want to know whether this e2e benchmark can only be run on the llama-3-8B-instruct model that you kindly provided in huggingface repo?
I also try running e2e benchmark on other model like llama-3-8B(non-instruct) but it also raises error like above.

Thanks.
Patrick

@ys-2020
Copy link
Contributor

ys-2020 commented Oct 1, 2024

Hi,

Thanks for your interests in QServe! We would suggest you kindly use instruction-tuned models for the e2e generation for robust outputs. The current conversation template is designed for instruct models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants