Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : adds llama-grammar memoization stacks (#4218) #9833

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

clarismiranda
Copy link

I brought back @HanClinto idea on memoize. You might have forgotten to add the <stack, stacks> pair in the llama_grammar_stacks_cache.

Here is a helpful case of memoization to avoid exponential increments on the stacks.

Grammar

root ::= digit ((letter (letter (letter (letter (letter)?)?)?)?)? digit )*
digit ::= [0-9]
letter ::= mega-rule-a | mega-rule-b
mega-rule-b ::= l-b  | l-ba
mega-rule-a ::= l-a | l-ab | l-aa | l-aaa | l-aaaa | l-aaaaa | l-aaaaaa | l-aaaaaaa | l-aaaaaaaa | l-aaaaaaaaa | l-aaaaaaaa | l-aaaaaaaaa | l-aaaaaaaaaa
l-a ::= "a"
l-b ::= "b"
l-ab ::= "ab"
l-ba ::= "ba"
l-aa ::= "aa"
l-aaa ::= "aaa"
l-aaaa ::= "aaaa"
l-aaaaa ::= "aaaaa"
l-aaaaaa ::= "aaaaaa"
l-aaaaaaa ::= "aaaaaaa"
l-aaaaaaaa ::= "aaaaaaaa"
l-aaaaaaaaa ::= "aaaaaaaaa"
l-aaaaaaaaaa ::= "a"+

long_test.txt

Current llama-gbnf-validator profiling:
current-validator

Memoization llama-gbnf-validator profiling:
using-memoization

@clarismiranda clarismiranda changed the title llama : adds llama-grammar memorization stacks (#4218) llama : adds llama-grammar memoization stacks (#4218) Oct 11, 2024
@clarismiranda clarismiranda marked this pull request as ready for review October 11, 2024 03:21
src/llama-grammar.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot added testing Everything test related examples labels Oct 14, 2024
@@ -3,6 +3,7 @@
#include "llama-impl.h"

#include <map>
#include <unordered_map>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use unordered_map vs map? Is there any benefit or necessity of a sorted key? @ggerganov

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any need to have a sorted key for this -- feels like unordered_map would be my default way to go on this one. If you're curious, a good set of profiling runs to test both options wouldn't be a bad exercise.

@@ -1127,9 +1225,10 @@ void llama_grammar_accept_impl(struct llama_grammar & grammar, llama_token token
const auto & code_points = decoded.first;

llama_grammar_stacks stacks_new;
llama_grammar_stacks_cache stacks_cache;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet understood the details of the implementation, but just want to confirm that recreating the cache for each token is really what we want? Looks like before moving the cache from global to local state, it was initialized during grammar init and maintained across multiple tokens. Now it is recreated for each token (IIUC).

Copy link
Author

@clarismiranda clarismiranda Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, is it better to initialize the cache in grammar_init and return it? This implementation adds the cache in the llama_grammar; however, this can be a valid assumption as the grammar doesn't change. @ggerganov

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, is it better to initialize the cache in grammar_init and return it?

I don't know yet. I was just looking at the lifetime of the cache and noticed this discrepancy. I will need more time to review the logic of the implementation.

Copy link
Collaborator

@ngxson ngxson Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's correct put the cache into llama_grammar

If we consider each sequence to have one llama_grammar, then every time we generate a new sequence, we need to create a new llama_grammar.

In other words, the cache will reset whenever we done sampling the sequence (i.e. stop generation)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added stacks_cache into llama_grammar :)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to copy the cache upon grammar clone:

diff --git a/src/llama-grammar.cpp b/src/llama-grammar.cpp
index 21482079..5476976b 100644
--- a/src/llama-grammar.cpp
+++ b/src/llama-grammar.cpp
@@ -1153,7 +1153,7 @@ void llama_grammar_free_impl(struct llama_grammar * grammar) {
 }
 
 struct llama_grammar * llama_grammar_clone_impl(const struct llama_grammar & grammar) {
-    llama_grammar * result = new llama_grammar { grammar.vocab, grammar.rules, grammar.stacks, grammar.partial_utf8, };
+    llama_grammar * result = new llama_grammar { grammar.vocab, grammar.rules, grammar.stacks, grammar.partial_utf8, grammar.stacks_cache, };
 
     // redirect elements in stacks to point to new rules
     for (size_t is = 0; is < result->stacks.size(); is++) {

if (it != stacks_cache.end()) {
advanced_stacks = it->second;
} else {
// Advance stacks with memorization
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Advance stacks with memorization
// Advance stacks with memorization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Advance stacks with memorization
// Advance stacks with memoization

@ggerganov
Copy link
Owner

The llama_grammar implementation is not great because it expose a lot of internal logic in order to make the tests work. In this PR, instead of exposing the stacks_cache as well, we should put some efforts to contain things better. I've opened a small refactoring PR here: clarismiranda#1

Please give it a thorough test and merge it. After that, we'll review the PR again.

@HanClinto
Copy link
Collaborator

The llama_grammar implementation is not great because it expose a lot of internal logic in order to make the tests work.

The tests that I think are most valuable are the integration tests, and of the three (integration, unit, and gbnf-validator), ideally those are the ones that expose the least amount of internal logic. Should we maybe disable the unit tests and the gbnf-validator example for now -- until we do more of this refactoring? It's always frustrating when tests become a hindrance to reorganization.

I like the change you made in clarismiranda#1 -- that feels very good.

src/llama-grammar.cpp Outdated Show resolved Hide resolved
clarismiranda and others added 2 commits October 17, 2024 21:41
llama : minor llama_grammar refactoring
Co-authored-by: Clint Herron <hanclinto@gmail.com>
@ggerganov
Copy link
Owner

Should we maybe disable the unit tests and the gbnf-validator example for now -- until we do more of this refactoring?

I guess we can do that. It's just that I am still not very intimately familiar with how the grammar implementation works in general (i.e. using it a bit as a black box) and not very sure what is the difference between the existing tests. So I just try to refactor the code in simple ways that preserve the original behaviour, but at least make it a bit more clear and easier to read. Ideally, the implementation has to be rewritten to avoid exposing the internal state and the tests should work using only the public libllama API.

@HanClinto
Copy link
Collaborator

HanClinto commented Oct 18, 2024

Should we maybe disable the unit tests and the gbnf-validator example for now -- until we do more of this refactoring?

I guess we can do that. It's just that I am still not very intimately familiar with how the grammar implementation works in general (i.e. using it a bit as a black box) and not very sure what is the difference between the existing tests.

This maybe could be documented better elsewhere, but here's the landscape of grammar tests as I see it. I could be misremembering / misinformed about any of this, so anyone feel free to jump in and correct if you see something:

  • test-grammar-parser.cpp - The "original", this is the oldest set of tests around the grammar parser, and I've (personally) always found it the most inscrutable (though Ochafik did a lot to help the readability). Given a particular grammar, this test ensures that the internal rule structure is constructed as originally designed. This is a unit-level test (meaning: tests how the code does what it does, but doesn't test what it does). Especially the original version was pretty hard-coded for a few particular test-cases, and if someone wanted to optimize the code to change the way grammar rule trees are generated (such as Ochafik did in grammars: x{min,max} repetition operator #6640), then one has to break the tests to change any implementation. I imagine this was very useful in the early days of first developing the grammar engine when one has a very clear vision of how the engine should work internally, but IMO, unit tests like this are of limited use (and often a hindrance) during the refactoring or optimization stages of development.

  • test-llama-grammar.cpp - Was written soon after the grammar-parser test, and functions very similarly. Logic-wise, it picks up after the previous test (takes in a tree of pre-parsed grammar rules, organized as a vector of vectors) and now advances the stacks one token at a time and tests that the internal state is still as-expected.

  • test-grammar-integration.cpp - As @ejones noted in a comment re: the above unit tests, the early unit tests were implementation-level tests, and he thought that a test that only leverages the public interface would perhaps be a better focus. I didn't do my implementation quite like Evan suggested (I sidestepped the tokenization and logit portion), but I did try to focus more on the public API. As part of digging into llama : speed-up grammar sampling #4218, we needed a way to ensure that the functionality of the grammar engine stayed the same (the "what" of the library) while we made changes to the internals (the "how"). Unit tests don't usually help in this situation, so we created the end-to-end / integration tests to not care how the grammar engine works, only that it works. The aim is to let us change the internals all around -- completely remaking internal structures or organization (or even move away from a stack-based parser altogether!) -- while ensuring that end-user functionality remains unchanged. This is the only kind of test that is useful in a refactoring.

Ideally, the integration tests are the ones that should dig the least into the internals of the grammar engine, and this is the bare-minimum of what I think the public grammar API support. I could also see an argument to be made for rehanging these integration tests to incorporate the tokenization and logit-modification portion, but that discussion may be best-left for another day (IMO, the grammar engine is confusing enough without having to get overly concerned with tokenization).

  • test-json-schema-to-grammar.cpp - This doesn't deal with the grammar engine itself, but instead the translator from JSON Schema to GBNF. Outside of the scope of this discussion, and it should not be changed / removed.

  • GBNF Validator - A utility program I wrote to let me feed in a grammar and see if it matched an input like I expected. Kinda' like a GBNF imitation of something like Regex101 or Regex Buddy to help me: A) Ensure that my grammars are syntactically correct, and B) Understand why a particular input does or doesn't match a grammar as-expected. I honestly have no idea if this utility script is useful to anyone else, but I think it's unfortunate how much of the private API it uses. If we need to break this in order to refactor the grammar engine, I say we should do it -- this is not an important utility -- just disable / remove it. The cleaner we make the grammar API, the easier it will be to rewrite this sort of tool in the future.

In short, the only one I really care about right now is the integration tests -- the unit tests and validator are the most expendable, because they're the most-tied to the particulars of the grammar internals.

So I just try to refactor the code in simple ways that preserve the original behaviour, but at least make it a bit more clear and easier to read. Ideally, the implementation has to be rewritten to avoid exposing the internal state and the tests should work using only the public libllama API.

If we (generally) want the tests to work using only the public libllama API, then I vote we cull the unit tests / validator and focus only on the integration tests. I think those unit tests have served their purpose, and can be safely removed.

@HanClinto
Copy link
Collaborator

HanClinto commented Oct 23, 2024

I tested the speed of the PR against master (at 7eee341), and I'm not seeing an appreciable speedup. This is just on my M1 Macbook Pro (while simultaneously running VS Code and Firefox with about a million open tabs) so it's far from a controlled test, but it's still eyebrow raising.

Hyperfine Command:
hyperfine \
    --warmup 1 --runs 5 \
    -L branch grammar-memo,7eee341bee09957139789c2d828995953f0fc7ff \
    --setup 'git checkout {branch} && make clean && make -j LLAMA_CURL=1 llama-cli' \
    './llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344'

Grammar is nearly the same as the one @clarismiranda posted, with the exception that it is limited to exactly 1000 entries (without this, the command never seemed to want to finish -- even after I put the limit in, it still took roughly a minute for each iteration to complete):

big_stacks.gbnf
root ::= digit ((letter (letter (letter (letter (letter)?)?)?)?)? digit ){1000}
digit ::= [0-9]
letter ::= mega-rule-a | mega-rule-b
mega-rule-b ::= l-b  | l-ba
mega-rule-a ::= l-a | l-ab | l-aa | l-aaa | l-aaaa | l-aaaaa | l-aaaaaa | l-aaaaaaa | l-aaaaaaaa | l-aaaaaaaaa | l-aaaaaaaa | l-aaaaaaaaa | l-aaaaaaaaaa
l-a ::= "a"
l-b ::= "b"
l-ab ::= "ab"
l-ba ::= "ba"
l-aa ::= "aa"
l-aaa ::= "aaa"
l-aaaa ::= "aaaa"
l-aaaaa ::= "aaaaa"
l-aaaaaa ::= "aaaaaa"
l-aaaaaaa ::= "aaaaaaa"
l-aaaaaaaa ::= "aaaaaaaa"
l-aaaaaaaaa ::= "aaaaaaaaa"
l-aaaaaaaaaa ::= "a"+

Results:

`7eee341bee09957139789c2d828995953f0fc7ff` ran 1.03 ± 0.04 times faster than `grammar-memo`
Benchmark 1: ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = grammar-memo)
  Time (mean ± σ):     56.460 s ±  1.874 s    [User: 2.239 s, System: 0.666 s]
  Range (min … max):   54.704 s … 59.347 s    5 runs

Benchmark 2: ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = 7eee341bee09957139789c2d828995953f0fc7ff)
  Time (mean ± σ):     54.675 s ±  0.517 s    [User: 2.174 s, System: 0.624 s]
  Range (min … max):   54.196 s … 55.420 s    5 runs

Summary
  ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = 7eee341bee09957139789c2d828995953f0fc7ff) ran
    1.03 ± 0.04 times faster than ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = grammar-memo)

I ran a few different times with different grammars (such as with {0,1000} instead of {1000}, and testing against current master instead of the commit that this PR branched off of), and none of these changes seemed to make much of a difference -- everything was nearly identical in runtime:

`master` ran 1.00 ± 0.00 times faster than `grammar-memo`
Benchmark 1: ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = grammar-memo)
  Time (mean ± σ):     54.734 s ±  0.040 s    [User: 1.884 s, System: 0.602 s]
  Range (min … max):   54.688 s … 54.786 s    5 runs

Benchmark 2: ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     54.668 s ±  0.097 s    [User: 1.806 s, System: 0.605 s]
  Range (min … max):   54.561 s … 54.789 s    5 runs

Summary
  ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = master) ran
    1.00 ± 0.00 times faster than ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = grammar-memo)
`master` ran 1.00 ± 0.01 times faster than `grammar-memo`
Benchmark 1: ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = grammar-memo)
  Time (mean ± σ):     54.905 s ±  0.239 s    [User: 2.007 s, System: 0.625 s]
  Range (min … max):   54.722 s … 55.322 s    5 runs

Benchmark 2: ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = master)
  Time (mean ± σ):     54.763 s ±  0.346 s    [User: 1.966 s, System: 0.616 s]
  Range (min … max):   54.560 s … 55.380 s    5 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = master) ran
    1.00 ± 0.01 times faster than ./llama-cli \
        -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
        --grammar-file grammars/big_stacks.gbnf \
        -p "List of numbers and assortments of letters composed of a and b:" \
        --seed 12344 (branch = grammar-memo)

I haven't yet tested in the harness of gbnf-validator, but at least in terms of inference in the main CLI, I suspect something is going on that is preventing the cache speedup from taking effect.

@clarismiranda I realize I'm not using the same test setup as you used, but I'm curious to know if your benchmarking suite is still showing the same improvements you were seeing originally.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 24, 2024
…anov#9833

Grammar memo

Co-Authored-By: Clarissa Miranda <80654285+clarissamiranda@users.noreply.github.com>
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 24, 2024
ggerganov#9833"

This reverts commit 4cbf5c392af62252a69e17143e8a81d771ca6f8a.
@clarismiranda
Copy link
Author

clarismiranda commented Oct 28, 2024

Hi, @HanClinto. I will do more generation tests since I only profiled the llama-gbnf-validator (llama_grammar_accept and llama_grammar_advance_stack). I will be curious to see how generation works in larger contexts—for example, asking the LLM to repeat a valid long string and complete it with the following sequence.

@HanClinto
Copy link
Collaborator

Hi, @HanClinto. I will do more generation tests since I only profiled the llama-gbnf-validator (llama_grammar_accept and llama_grammar_advance_stack). I will be curious to see how generation works in larger contexts—for example, asking the LLM to repeat a valid long string and complete it with the following sequence.

Thanks for clarifying!

I haven't tested in that context yet, but I should.

With the changes of moving the cache location, do your profile tests on llama-gbnf-validator still hold up?

@clarismiranda
Copy link
Author

Hi @HanClinto, here is the latest version of this branch profiling:

Captura de Pantalla 2024-10-30 a la(s) 8 34 07

And here is master profiling:

Captura de Pantalla 2024-10-30 a la(s) 8 34 26

It still shows a time benefit and overall memory benefit.

@HanClinto
Copy link
Collaborator

Hi @HanClinto, here is the latest version of this branch profiling:
...
It still shows a time benefit and overall memory benefit.

Hi @clarismiranda -- sorry for my delay in getting back to this.

It's good that there is still a performance benefit here in llama-gbnf-validator, but I'm leery of adding complexity if it doesn't help with the primary path. Do you think there is anything you can do with this PR to help llama-cli to get a similar speedup?

@clarismiranda
Copy link
Author

Hi @HanClinto, I will give it a try :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants