Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_add gives lower quality results than batch_get_one #6475

Closed
TheFlipbook opened this issue Apr 4, 2024 · 4 comments
Closed

batch_add gives lower quality results than batch_get_one #6475

TheFlipbook opened this issue Apr 4, 2024 · 4 comments

Comments

@TheFlipbook
Copy link
Contributor

Description

I'm seeing a strange issue where batches created via llama_batch_get_one give better results than batches populated with llama_batch_add.

I was trying to convert my code to use llama_batch_add because llama_batch_get_one has a deprecation note on it, but when I made this conversion, the quality of responses I was getting went down. This appears to be the case whether or not layers are offloaded to the GPU.

I may not understand the batch API correctly, so it seems plausible that there is a mistake in my code, rather than this being a true bug. However, if I am using it correctly, it seemed good to raise, as the removal of llama_batch_get_one as the comment indicates, would result in either a speed or a quality regression in my project.

System Information

llama_cpp hash: f87f7b8
llama_cpp backend: Vulkan
OS: Windows 10 Pro 64-bit
GPU: Nvidia Geforce RTX 3080
CPU: AMD Ryzen 9 3950X
Model: mistral-7b-instruct-v0.2.Q6_K.gguf (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)

Repro Demonstration Code

main.cpp.txt
This cpp file, when compiled, creates a program that can be called with two arguments.

  • The first argument is one of new|old|single to swap between methods of filling a llama_batch.
  • The second argument is a path to the model to load for testing

Bad Result

main.exe new "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"

  • This uses llama_batch_add to parse the prompt, similar to the simple example.
  • Results always begin with "Qu"-like tokens, usually resulting in the first English word being something like "Question:" or "Questioner,"
  • Changing the last instruction usually still yields things like "Questioner" or "User" as the first word.
    """
    Questioner, allow me to paint a vivid tableau of the three most distinguished realms within the intricately woven tapestry of my fantastical universe:
    """

Good Result A

main.exe old "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"

  • This uses llama_batch_get_one to parse the prompt, similar to the main example.
  • First non-prompt word is highly varied and leads into a logical response.
  • Changing the last instruction yields logical changes, such as "Who is the most famous person in your books?" yielding "Once," and other such first words.
    """
    In the heart of my fantastical realm, where towering mountains meet vast emerald forests and azure seas stretch as far as the eye can see, lie the three grand kingdoms: Valoria, Elidor, and Thundertop.
    """

Good Result B

main.exe single "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"

  • This uses llama_batch_get_one to parse the prompt, but dispatches a batch with only a single token each time.
  • First non-prompt word is highly varied and leads into a logical response.
  • Changing the last instruction yields logical changes, such as "Who is the most famous person in your books?" yielding "Once," and other such first words.
  • This method takes longer to evaluate than old
    """
    In the vast expanse of Eldoria, the realm of magic and wonder, three distinct kingdoms rose like proud pillars against the ever-changing tapestry of the land. Each unique in its history, culture, and people, they stood as beacons of hope and prosperity for their inhabitants.
    """

Thank you!

@TheFlipbook
Copy link
Contributor Author

Ah, perhaps the issue is around the repro project's usage of llama_sampling_sample.

With llama_batch_get_one, calling llama_sampling_sample with a default last parameter (0) is valid. The zeroth query is is correctly the logits for the last token in the batch.

However, when using llama_batch_add, to get the proper logits for the last value, I must pass in the actual index of the last token in the batch. Doing so makes the output match expectations.

I think it was non-obvious to me that my usage of llama_sampling_sample needed to be updated when changing how the llama_batch was populated.

If this is As Designed, please feel free to Close! Perhaps I can think on what sort of documentation might have made me aware of this API nuance and create a PR for that improvement.

Thank you!

@ggerganov
Copy link
Owner

It is confusing - maybe we can add some comments to the documentation. However, note that llama_batch_get_one() is not intended for general use. At some point we will deprecate and remove it.

@TheFlipbook
Copy link
Contributor Author

Yep!

I made a mistake. I used the main example to gain my understanding of the library, and mimicked the llama_batch_get_one call there. When I saw the deprecation warning in the documentation, I tried to port to llama_batch_add and that's where I hit this confusion, as porting without noticing that the call to llama_sampling_sample also needed to be updated significantly changed the behavior or the program.

After giving it some thinking, it seemed like it might be possible to make a default call to llama_sampling_sample more well behaved with a batch populated via llama_batch_add. I created #6519 to explore that possibility. It's also very much reasonable to close both this Issue and the related PR, if that direction does not seem valid.

Thank you for all the insights!

@TheFlipbook
Copy link
Contributor Author

Example project now behaves as expected after e3c337d

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants