-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch_add gives lower quality results than batch_get_one #6475
Comments
Ah, perhaps the issue is around the repro project's usage of With However, when using I think it was non-obvious to me that my usage of If this is As Designed, please feel free to Close! Perhaps I can think on what sort of documentation might have made me aware of this API nuance and create a PR for that improvement. Thank you! |
It is confusing - maybe we can add some comments to the documentation. However, note that |
Yep! I made a mistake. I used the After giving it some thinking, it seemed like it might be possible to make a default call to Thank you for all the insights! |
Example project now behaves as expected after e3c337d Thank you! |
Description
I'm seeing a strange issue where batches created via
llama_batch_get_one
give better results than batches populated withllama_batch_add
.I was trying to convert my code to use
llama_batch_add
becausellama_batch_get_one
has a deprecation note on it, but when I made this conversion, the quality of responses I was getting went down. This appears to be the case whether or not layers are offloaded to the GPU.I may not understand the batch API correctly, so it seems plausible that there is a mistake in my code, rather than this being a true bug. However, if I am using it correctly, it seemed good to raise, as the removal of
llama_batch_get_one
as the comment indicates, would result in either a speed or a quality regression in my project.System Information
llama_cpp hash: f87f7b8
llama_cpp backend: Vulkan
OS: Windows 10 Pro 64-bit
GPU: Nvidia Geforce RTX 3080
CPU: AMD Ryzen 9 3950X
Model: mistral-7b-instruct-v0.2.Q6_K.gguf (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)
Repro Demonstration Code
main.cpp.txt
This cpp file, when compiled, creates a program that can be called with two arguments.
new
|old
|single
to swap between methods of filling a llama_batch.Bad Result
main.exe new "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"
llama_batch_add
to parse the prompt, similar to thesimple
example."""
Questioner, allow me to paint a vivid tableau of the three most distinguished realms within the intricately woven tapestry of my fantastical universe:
"""
Good Result A
main.exe old "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"
llama_batch_get_one
to parse the prompt, similar to themain
example."""
In the heart of my fantastical realm, where towering mountains meet vast emerald forests and azure seas stretch as far as the eye can see, lie the three grand kingdoms: Valoria, Elidor, and Thundertop.
"""
Good Result B
main.exe single "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"
llama_batch_get_one
to parse the prompt, but dispatches a batch with only a single token each time.old
"""
In the vast expanse of Eldoria, the realm of magic and wonder, three distinct kingdoms rose like proud pillars against the ever-changing tapestry of the land. Each unique in its history, culture, and people, they stood as beacons of hope and prosperity for their inhabitants.
"""
Thank you!
The text was updated successfully, but these errors were encountered: