-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User] Strange results with Apple Silicon GPU when instruction mode is on #1695
Comments
@ymcui I dont find any speed difference in my m2 pro, almost the same and maybe cpu is a little bit faster. Interesting that Max version GPU make a lot of different then And so far I dont notice any difference in result using gpu or not, have you tried another model ? I'm using airoboros-7b-gpt4.ggmlv3.q4_0.bin |
@x4080 Here is the preliminary results on our Chinese-Alpaca-Plus models (
The reported speeds are based on
Yeah, I am actively seeking other models for testing. |
@ymcui cool
if you wanted to try it just in case :) |
I've seen this happen in some ad hoc chatbot testing tonight. There's definitely an inference bug - the conversation goes completely off the rails after a few rounds. I've tested it with 7B, 13B, 30B, and 65B on an M2 Max. |
Should be fixed now |
Wow that is fast |
Tested and verified on my end. Thanks @ggerganov ! |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Current Behavior
I am using the latest llama.cpp which supports Apple Silicon GPU decoding (#1642).
It indeed provides faster inference speed than without GPU (about 50% speedup on my M1 Max).
However, I encountered an issue when performing instruction mode with Alpaca models (specifically I am using Chinese-Alpaca model).
I am using a fixed seed (42) for the following examples, which shares the same decoding hyperparams and prompt.
Inference without GPU:
Inference with GPU (
-ngl 1
):As we can see that, the first response is identical, while the following responses are different, and finally the third and fourth response by
inference with GPU
shows strange responses that should not be expected.Not sure, if this is model-related issue or a general issue introduced by the new feature (GPU support through Metal).
It would be appreciated if someone can also test on other Alpaca-like models with GPU inference (on Apple Silicon) when instruction mode is on.
----update----
I also tested the original LLaMA-7B using instruction mode (I know that the model is not for instruction-following purpose. I am just using it for debugging), but the issue still happens.
command:
Inference without GPU:
Inference with GPU (
-ngl 1
):This might demonstrate that the issue is not related to a specific model but might be something related to other parts.
The text was updated successfully, but these errors were encountered: