-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : adds llama-grammar memoization stacks (#4218) #9833
base: master
Are you sure you want to change the base?
Conversation
@@ -3,6 +3,7 @@ | |||
#include "llama-impl.h" | |||
|
|||
#include <map> | |||
#include <unordered_map> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use unordered_map
vs map
? Is there any benefit or necessity of a sorted key? @ggerganov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any need to have a sorted key for this -- feels like unordered_map
would be my default way to go on this one. If you're curious, a good set of profiling runs to test both options wouldn't be a bad exercise.
src/llama-grammar.cpp
Outdated
@@ -1127,9 +1225,10 @@ void llama_grammar_accept_impl(struct llama_grammar & grammar, llama_token token | |||
const auto & code_points = decoded.first; | |||
|
|||
llama_grammar_stacks stacks_new; | |||
llama_grammar_stacks_cache stacks_cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet understood the details of the implementation, but just want to confirm that recreating the cache for each token is really what we want? Looks like before moving the cache from global to local state, it was initialized during grammar init and maintained across multiple tokens. Now it is recreated for each token (IIUC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, is it better to initialize the cache in grammar_init and return it? This implementation adds the cache in the llama_grammar
; however, this can be a valid assumption as the grammar doesn't change. @ggerganov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, is it better to initialize the cache in grammar_init and return it?
I don't know yet. I was just looking at the lifetime of the cache and noticed this discrepancy. I will need more time to review the logic of the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's correct put the cache into llama_grammar
If we consider each sequence to have one llama_grammar
, then every time we generate a new sequence, we need to create a new llama_grammar
.
In other words, the cache will reset whenever we done sampling the sequence (i.e. stop generation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added stacks_cache
into llama_grammar
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to copy the cache upon grammar clone:
diff --git a/src/llama-grammar.cpp b/src/llama-grammar.cpp
index 21482079..5476976b 100644
--- a/src/llama-grammar.cpp
+++ b/src/llama-grammar.cpp
@@ -1153,7 +1153,7 @@ void llama_grammar_free_impl(struct llama_grammar * grammar) {
}
struct llama_grammar * llama_grammar_clone_impl(const struct llama_grammar & grammar) {
- llama_grammar * result = new llama_grammar { grammar.vocab, grammar.rules, grammar.stacks, grammar.partial_utf8, };
+ llama_grammar * result = new llama_grammar { grammar.vocab, grammar.rules, grammar.stacks, grammar.partial_utf8, grammar.stacks_cache, };
// redirect elements in stacks to point to new rules
for (size_t is = 0; is < result->stacks.size(); is++) {
src/llama-grammar.cpp
Outdated
if (it != stacks_cache.end()) { | ||
advanced_stacks = it->second; | ||
} else { | ||
// Advance stacks with memorization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Advance stacks with memorization | |
// Advance stacks with memorization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Advance stacks with memorization | |
// Advance stacks with memoization |
The Please give it a thorough test and merge it. After that, we'll review the PR again. |
The tests that I think are most valuable are the integration tests, and of the three (integration, unit, and gbnf-validator), ideally those are the ones that expose the least amount of internal logic. Should we maybe disable the unit tests and the gbnf-validator example for now -- until we do more of this refactoring? It's always frustrating when tests become a hindrance to reorganization. I like the change you made in clarismiranda#1 -- that feels very good. |
llama : minor llama_grammar refactoring
Co-authored-by: Clint Herron <hanclinto@gmail.com>
I guess we can do that. It's just that I am still not very intimately familiar with how the grammar implementation works in general (i.e. using it a bit as a black box) and not very sure what is the difference between the existing tests. So I just try to refactor the code in simple ways that preserve the original behaviour, but at least make it a bit more clear and easier to read. Ideally, the implementation has to be rewritten to avoid exposing the internal state and the tests should work using only the public |
This maybe could be documented better elsewhere, but here's the landscape of grammar tests as I see it. I could be misremembering / misinformed about any of this, so anyone feel free to jump in and correct if you see something:
Ideally, the integration tests are the ones that should dig the least into the internals of the grammar engine, and this is the bare-minimum of what I think the public grammar API support. I could also see an argument to be made for rehanging these integration tests to incorporate the tokenization and logit-modification portion, but that discussion may be best-left for another day (IMO, the grammar engine is confusing enough without having to get overly concerned with tokenization).
In short, the only one I really care about right now is the integration tests -- the unit tests and validator are the most expendable, because they're the most-tied to the particulars of the grammar internals.
If we (generally) want the tests to work using only the public |
I tested the speed of the PR against Hyperfine Command:
Grammar is nearly the same as the one @clarismiranda posted, with the exception that it is limited to exactly 1000 entries (without this, the command never seemed to want to finish -- even after I put the limit in, it still took roughly a minute for each iteration to complete): big_stacks.gbnf
Results: `7eee341bee09957139789c2d828995953f0fc7ff` ran 1.03 ± 0.04 times faster than `grammar-memo`
I ran a few different times with different grammars (such as with `master` ran 1.00 ± 0.00 times faster than `grammar-memo`
`master` ran 1.00 ± 0.01 times faster than `grammar-memo`
I haven't yet tested in the harness of @clarismiranda I realize I'm not using the same test setup as you used, but I'm curious to know if your benchmarking suite is still showing the same improvements you were seeing originally. |
…anov#9833 Grammar memo Co-Authored-By: Clarissa Miranda <80654285+clarissamiranda@users.noreply.github.com>
ggerganov#9833" This reverts commit 4cbf5c392af62252a69e17143e8a81d771ca6f8a.
Hi, @HanClinto. I will do more generation tests since I only profiled the llama-gbnf-validator ( |
Thanks for clarifying! I haven't tested in that context yet, but I should. With the changes of moving the cache location, do your profile tests on |
Hi @HanClinto, here is the latest version of this branch profiling: And here is master profiling: It still shows a time benefit and overall memory benefit. |
Hi @clarismiranda -- sorry for my delay in getting back to this. It's good that there is still a performance benefit here in |
Hi @HanClinto, I will give it a try :) |
I brought back @HanClinto idea on memoize. You might have forgotten to add the
<stack, stacks>
pair in thellama_grammar_stacks_cache.
Here is a helpful case of memoization to avoid exponential increments on the stacks.
Grammar
long_test.txt
Current llama-gbnf-validator profiling:
Memoization llama-gbnf-validator profiling: