[User] Strange results with Apple Silicon GPU when instruction mode is on #1695

ymcui · 2023-06-05T03:56:05Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

I am using the latest llama.cpp which supports Apple Silicon GPU decoding (#1642).
It indeed provides faster inference speed than without GPU (about 50% speedup on my M1 Max).
However, I encountered an issue when performing instruction mode with Alpaca models (specifically I am using Chinese-Alpaca model).
I am using a fixed seed (42) for the following examples, which shares the same decoding hyperparams and prompt.

Inference without GPU:

> good morning
Good morning! How can I help you today?
> can you explain what is llama?
Llama refers to the animal of the same name, which is a South American camelid native to Chile and Argentina. It's known for its long, curly coat and friendly nature.
> ok, how about alpaca?
Alpaca is another species of South American camelids, which are found in Peru, Bolivia, Chile, Argentina, and Uruguay. Like the llama, it has a long, curly coat and is known for its friendly nature.
> and guanaco?
Guanaco is another species of South American camelids, which are found in Chile, Argentina, and Bolivia. It's similar to the llama and alpaca, but it's a bit smaller and has shorter hair.

Inference with GPU (-ngl 1):

> good morning
Good morning! How can I help you today?
> can you explain what is llama?
Of course, lllamas are large animals that belong to the camel family. They have long and thick hair and are found mainly in South America at high altitudes. They are known for their social behavior and herding abilities, and can live up to 25 years old.
> ok, how about alpaca?
I'm sorry to break it to you like this. What if the yarn-making?
> and guanaco?
Guide on acetacus1caroset
I'm asistle

As we can see that, the first response is identical, while the following responses are different, and finally the third and fourth response by inference with GPU shows strange responses that should not be expected.

Not sure, if this is model-related issue or a general issue introduced by the new feature (GPU support through Metal).
It would be appreciated if someone can also test on other Alpaca-like models with GPU inference (on Apple Silicon) when instruction mode is on.

----update----

I also tested the original LLaMA-7B using instruction mode (I know that the model is not for instruction-following purpose. I am just using it for debugging), but the issue still happens.

command:

./main -m models/7B/ggml-model-q4_0.bin --color -f ./prompts/alpaca.txt -ins -b 24 -c 2048 -n 512 -t 8 --temp 0.5 --top_k 40 --top_p 0.9 --repeat_penalty 1.1 --seed 42

Inference without GPU:

> good morning
Good Morning!

### Instruction:


> can you explain what is llama?
Sure, Llama is a South American camelid.
> ok, how about alpaca?
Alpacas are in the same family as llamas and camels. They have been domesticated for at least 6000 years and were once used to weave clothing.


### Instruction:


> and guanaco?
Guanaco is a wild South American relative of the llama that was also domesticated by ancient peoples. It is now endangered because its habitat has been destroyed.


### Instruction:

Inference with GPU (-ngl 1):

> good morning
Good Morning!

### Instruction:


> can you explain what is llama?
llamas are small, furry animals. they live in the highlands of peru and are raised for their wool.


### Instruction:


> ok, how about alpaca?



##### Response:
## Response:

can you explain llama?




can you explain
 can explain llama?
 can you explain llama?
> and guanaco?
and babo
## Response:
#### Response:
## Response:
Response:
and# Response:
#### Response:
#### Response:
## Response:
#### Response:
#### Response:
#### Response:
#### Response:
Answer: The
#### Answer: The answer is "a"
#### The answer to the first one is "cannot be determined by any means." It's the last part of the puzzle. That's
# Response:
#### The first two words in this answer are
Answer: In answer, the second word begins with
#### Answer: This answer has three words. The third element is
#### The final one contains four letters.

This might demonstrate that the issue is not related to a specific model but might be something related to other parts.

The text was updated successfully, but these errors were encountered:

x4080 · 2023-06-05T04:06:39Z

@ymcui I dont find any speed difference in my m2 pro, almost the same and maybe cpu is a little bit faster. Interesting that Max version GPU make a lot of different then

And so far I dont notice any difference in result using gpu or not, have you tried another model ? I'm using airoboros-7b-gpt4.ggmlv3.q4_0.bin

ymcui · 2023-06-05T04:17:42Z

@x4080 Here is the preliminary results on our Chinese-Alpaca-Plus models (q4_0 quantized). As you can see that the speedup is promising.

	Plus-7B	Plus-13B	33B
Original Speed (`-t 8`)	41 ms/tok	77 ms/tok	179 ms/tok
New Speed (`-t 8 -ngl 1`)	28 ms/tok	49 ms/tok	failed

The reported speeds are based on eval_time.

And so far I dont notice any difference in result using gpu or not, have you tried another model ? I'm using airoboros-7b-gpt4.ggmlv3.q4_0.bin

Yeah, I am actively seeking other models for testing.

x4080 · 2023-06-05T04:30:49Z

@ymcui cool
my m2 : with ngl 49.62 ms per token
without : 44.62 ms per token
eval_time also

./main -m ./models/airoboros-7b-gpt4.ggmlv3.q4_0.bin -p "I believe the meaning of life is" -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

if you wanted to try it just in case :)

data-angel · 2023-06-05T06:30:42Z

I've seen this happen in some ad hoc chatbot testing tonight. There's definitely an inference bug - the conversation goes completely off the rails after a few rounds. I've tested it with 7B, 13B, 30B, and 65B on an M2 Max.

data-angel · 2023-06-05T06:44:26Z

Miku.sh, 30B q4_0, CPU inference:

Miku.sh, 30B q4_0, GPU inference:

Just starts spiraling out of control after a few turns.

ggerganov · 2023-06-05T07:20:53Z

Should be fixed now

x4080 · 2023-06-05T07:41:58Z

Wow that is fast

data-angel · 2023-06-05T16:12:35Z

Tested and verified on my end. Thanks @ggerganov !

ggerganov closed this as completed in d1f563a Jun 5, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User] Strange results with Apple Silicon GPU when instruction mode is on #1695

[User] Strange results with Apple Silicon GPU when instruction mode is on #1695

ymcui commented Jun 5, 2023 •

edited

Loading

x4080 commented Jun 5, 2023

ymcui commented Jun 5, 2023 •

edited

Loading

x4080 commented Jun 5, 2023

data-angel commented Jun 5, 2023

data-angel commented Jun 5, 2023

ggerganov commented Jun 5, 2023

x4080 commented Jun 5, 2023

data-angel commented Jun 5, 2023

[User] Strange results with Apple Silicon GPU when instruction mode is on #1695

[User] Strange results with Apple Silicon GPU when instruction mode is on #1695

Comments

ymcui commented Jun 5, 2023 • edited Loading

Prerequisites

Current Behavior

x4080 commented Jun 5, 2023

ymcui commented Jun 5, 2023 • edited Loading

x4080 commented Jun 5, 2023

data-angel commented Jun 5, 2023

data-angel commented Jun 5, 2023

ggerganov commented Jun 5, 2023

x4080 commented Jun 5, 2023

data-angel commented Jun 5, 2023

ymcui commented Jun 5, 2023 •

edited

Loading

ymcui commented Jun 5, 2023 •

edited

Loading