Output of quantized Vicuna is so inappropriate that I can't use it #1656

JerryYao80 · 2023-05-31T12:09:49Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[√] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[√] I carefully followed the README.md.
[√] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[√] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

In my llama.cpp environment, I get the following files under the guide of README:

vicuna-7b-hf => ggml-model-f16.bin => ggml-model-q4_0.bin

When I executed the command:

./main -m /media/ggml-model-q4_0.bin -p "You are a linguistics professor, translate this sentences from Englisth to Chinese: Across the Great Wall, we can reach every corner in the world. Do not repeate the instruction." -n 512

But I got the output:

You are a linguistics professor, translate this sentences from Englisth to Chinese: Across the Great Wall, we can reach every corner in the world. Do not repeate the instruction.
Translate these sentences from English to Chinese:

Crossing the Great Wall, you can reach all the places on earth. Don't repeat the instructions.

We can learn about different cultures at home and abroad by exchanging ideas with other people who are interested in international affairs. Don't repeat the instruction.

Our school has a unique advantage: we have both Chinese students and overseas students, which enables us to interact with each other. Don't repeat the instructions. [end of text]

This is one of my experiment, and the outputs of quantized Vicuna are so inappropriate that I can't do anything,
I'm not sure what's wrong with my quantized model. Does anyone meet it?

Environment and Context

Docker Toolbox 1.13.1
docker client: 1.13.1 os/arch: windows 7 /amd64
docker server:19.03.12 os/arch:ubuntu 22.04 /amd64
CPU type: Intel Core i7 6700 , supported command set: MMX, SSE, SSE2, ......, AVX, AVX2, FMA3, TSX

cmp-nct · 2023-05-31T12:37:33Z

I lack experience with that particular model but I do notice that you attempt a complex instruction solving translation using a 7B model.
So even if it is very well instruction tuned, I'd yet have to see a 7B model that can do that type of translation good and follow such a relative complex instruction.
So for your report (which I believe is not that well suited as error report for the project in general) you should of course have shown a full precision example of the expected behavior. Not your wishes, given you complain about quantization.
And secondly, you used 4_0, which is the worst available variant in terms of precision.
So after confirming that 16bit precision works for your purpose you might want to try 4_1 5_x and 8_0 to see how those perform.

KerfuffleV2 · 2023-05-31T15:26:14Z

@JerryYao80 You didn't use the correct prompt format for Vicuna models. You also asked it to translate from "Englisth" to Chinese.

我没想批评您的英语水平，希望我的话不会让您不舒服。显然您的英语比我的中文好得多啊！

Because of the way LLMs just complete text, the input makes a huge difference. Typos and grammar mistakes in the prompt, unfortunately will generally cause you to get low quality output. Also not using the prompt format the model expects.

I'd also note that while Vicuna can speak a little Mandarin, that only made up a small part of its training. Even with the best possible prompting, I wouldn't expect the results for translations or generating text to be very good (especially if you're using a 7B model).

LostRuins · 2023-06-01T17:12:54Z

Also this really isn't a llamacpp issue unless it's a tokenizer problem. You can confirm whether the input tokens match the vocab.

ungil · 2023-06-02T19:30:18Z

Does the following work better?

./main -m /media/ggml-model-q4_0.bin -p "### Human: You are a linguistics professor, translate this sentence from English to Chinese: Across the Great Wall, we can reach every corner in the world.
### Assistant:" -n 512

github-actions · 2024-04-10T01:07:56Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output of quantized Vicuna is so inappropriate that I can't use it #1656

Output of quantized Vicuna is so inappropriate that I can't use it #1656

JerryYao80 commented May 31, 2023

cmp-nct commented May 31, 2023

KerfuffleV2 commented May 31, 2023 •

edited

Loading

LostRuins commented Jun 1, 2023

ungil commented Jun 2, 2023 •

edited

Loading

github-actions bot commented Apr 10, 2024

Output of quantized Vicuna is so inappropriate that I can't use it #1656

Output of quantized Vicuna is so inappropriate that I can't use it #1656

Comments

JerryYao80 commented May 31, 2023

Prerequisites

Current Behavior

Environment and Context

cmp-nct commented May 31, 2023

KerfuffleV2 commented May 31, 2023 • edited Loading

LostRuins commented Jun 1, 2023

ungil commented Jun 2, 2023 • edited Loading

github-actions bot commented Apr 10, 2024

KerfuffleV2 commented May 31, 2023 •

edited

Loading

ungil commented Jun 2, 2023 •

edited

Loading