Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] Strange results with Apple Silicon GPU when instruction mode is on #1695

Closed
4 tasks done
ymcui opened this issue Jun 5, 2023 · 8 comments
Closed
4 tasks done

Comments

@ymcui
Copy link
Contributor

ymcui commented Jun 5, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

I am using the latest llama.cpp which supports Apple Silicon GPU decoding (#1642).
It indeed provides faster inference speed than without GPU (about 50% speedup on my M1 Max).
However, I encountered an issue when performing instruction mode with Alpaca models (specifically I am using Chinese-Alpaca model).
I am using a fixed seed (42) for the following examples, which shares the same decoding hyperparams and prompt.

Inference without GPU:

> good morning
Good morning! How can I help you today?
> can you explain what is llama?
Llama refers to the animal of the same name, which is a South American camelid native to Chile and Argentina. It's known for its long, curly coat and friendly nature.
> ok, how about alpaca?
Alpaca is another species of South American camelids, which are found in Peru, Bolivia, Chile, Argentina, and Uruguay. Like the llama, it has a long, curly coat and is known for its friendly nature.
> and guanaco?
Guanaco is another species of South American camelids, which are found in Chile, Argentina, and Bolivia. It's similar to the llama and alpaca, but it's a bit smaller and has shorter hair.

Inference with GPU (-ngl 1):

> good morning
Good morning! How can I help you today?
> can you explain what is llama?
Of course, lllamas are large animals that belong to the camel family. They have long and thick hair and are found mainly in South America at high altitudes. They are known for their social behavior and herding abilities, and can live up to 25 years old.
> ok, how about alpaca?
I'm sorry to break it to you like this. What if the yarn-making?
> and guanaco?
Guide on acetacus1caroset
I'm asistle

As we can see that, the first response is identical, while the following responses are different, and finally the third and fourth response by inference with GPU shows strange responses that should not be expected.

Not sure, if this is model-related issue or a general issue introduced by the new feature (GPU support through Metal).
It would be appreciated if someone can also test on other Alpaca-like models with GPU inference (on Apple Silicon) when instruction mode is on.

----update----

I also tested the original LLaMA-7B using instruction mode (I know that the model is not for instruction-following purpose. I am just using it for debugging), but the issue still happens.

command:

./main -m models/7B/ggml-model-q4_0.bin --color -f ./prompts/alpaca.txt -ins -b 24 -c 2048 -n 512 -t 8 --temp 0.5 --top_k 40 --top_p 0.9 --repeat_penalty 1.1 --seed 42

Inference without GPU:

> good morning
Good Morning!

### Instruction:


> can you explain what is llama?
Sure, Llama is a South American camelid.
> ok, how about alpaca?
Alpacas are in the same family as llamas and camels. They have been domesticated for at least 6000 years and were once used to weave clothing.


### Instruction:


> and guanaco?
Guanaco is a wild South American relative of the llama that was also domesticated by ancient peoples. It is now endangered because its habitat has been destroyed.


### Instruction:

Inference with GPU (-ngl 1):

> good morning
Good Morning!

### Instruction:


> can you explain what is llama?
llamas are small, furry animals. they live in the highlands of peru and are raised for their wool.


### Instruction:


> ok, how about alpaca?



##### Response:
## Response:

can you explain llama?




can you explain
 can explain llama?
 can you explain llama?
> and guanaco?
and babo
## Response:
#### Response:
## Response:
Response:
and# Response:
#### Response:
#### Response:
## Response:
#### Response:
#### Response:
#### Response:
#### Response:
Answer: The
#### Answer: The answer is "a"
#### The answer to the first one is "cannot be determined by any means." It's the last part of the puzzle. That's
# Response:
#### The first two words in this answer are
Answer: In answer, the second word begins with
#### Answer: This answer has three words. The third element is
#### The final one contains four letters.

This might demonstrate that the issue is not related to a specific model but might be something related to other parts.

@x4080
Copy link

x4080 commented Jun 5, 2023

@ymcui I dont find any speed difference in my m2 pro, almost the same and maybe cpu is a little bit faster. Interesting that Max version GPU make a lot of different then

And so far I dont notice any difference in result using gpu or not, have you tried another model ? I'm using airoboros-7b-gpt4.ggmlv3.q4_0.bin

@ymcui
Copy link
Contributor Author

ymcui commented Jun 5, 2023

@x4080 Here is the preliminary results on our Chinese-Alpaca-Plus models (q4_0 quantized). As you can see that the speedup is promising.

Plus-7B Plus-13B 33B
Original Speed (-t 8) 41 ms/tok 77 ms/tok 179 ms/tok
New Speed (-t 8 -ngl 1) 28 ms/tok 49 ms/tok failed

The reported speeds are based on eval_time.

And so far I dont notice any difference in result using gpu or not, have you tried another model ? I'm using airoboros-7b-gpt4.ggmlv3.q4_0.bin

Yeah, I am actively seeking other models for testing.

@x4080
Copy link

x4080 commented Jun 5, 2023

@ymcui cool
my m2 : with ngl 49.62 ms per token
without : 44.62 ms per token
eval_time also

./main -m ./models/airoboros-7b-gpt4.ggmlv3.q4_0.bin -p "I believe the meaning of life is" -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

if you wanted to try it just in case :)

@data-angel
Copy link
Contributor

I've seen this happen in some ad hoc chatbot testing tonight. There's definitely an inference bug - the conversation goes completely off the rails after a few rounds. I've tested it with 7B, 13B, 30B, and 65B on an M2 Max.

@data-angel
Copy link
Contributor

Miku.sh, 30B q4_0, CPU inference:
Screenshot 2023-06-04 at 11 37 45 PM

Miku.sh, 30B q4_0, GPU inference:
Screenshot 2023-06-04 at 11 41 12 PM

Just starts spiraling out of control after a few turns.

@ggerganov
Copy link
Member

Should be fixed now

@x4080
Copy link

x4080 commented Jun 5, 2023

Wow that is fast

@data-angel
Copy link
Contributor

Tested and verified on my end. Thanks @ggerganov !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants