-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555
Conversation
… goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`
(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)
… in /v1/chat/completions)
How much more effort would it be to benchmark these improvements? |
@HanClinto It can quickly become an unfair fight depending on the schema and the model's generation choices (in some cases the speed may appear the same, but the worst case is now bounded). If you expand the "Show command" drawer in the PR's description, the simple example I gave goes from being essentially stuck until the end of the universe to something Here's how to benchmark any schema you'd like (using hyperfine): git clone https://github.com/ochafik/llama.cpp --branch json-faster-repetitions2 llama.cpp-faster-rep
cd llama.cpp-faster-rep && git pull
echo '{"items": {"type": "number"}, "maxItems": 10}' > schema.json && \
git checkout json-faster-repetitions2 && \
python examples/json_schema_to_grammar.py schema.json > fast.grammar && \
git checkout master && \
python examples/json-schema-to-grammar.py schema.json > slow.grammar && \
make clean && make -j LLAMA_CURL=1 main && \
mkdir -p models/7B && \
hyperfine --warmup 1 -L speed fast,slow './main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q5_K_M.gguf --grammar-file {speed}.grammar -p "List of 10 numbers" --seed 1234'
# The warmup run will download the model, will take a while & use up 5.2GB It gives 8x speedup for that specific seed & model (other values may not show improvements, or the master branch may timeout) Show output
Lemme know if you'd like me to test any specific kind of schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I'm on board with the rename to use underscores -- while there are a few other files with underscores (such as pydantic_models_to_grammar.py
), most seem to use hyphens (pydantic-models-to-grammar-examples.py
, etc), and it seems like the old filename is possibly better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I wanted all filenames in the repo to use hyphens. But later I found out that Python does not work well when there are hyphens in the filenames (e.g. I think you cannot include a Python file that has hyphens). So I think it's better to eventually rename all Python files to use underscores in their filenames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I did all this as a prerequisite for #6389, in which i need to import the converter from Python. I also found out llama-cpp-python inlines that file in their codebase, since it's hard / not trivial to import (short of using importlib, which feels dirty).
examples/json_schema_to_grammar.py
Outdated
} | ||
|
||
RESERVED_NAMES = set(["root", *PRIMITIVE_RULES.keys(), *DATE_RULES.keys()]) | ||
DOTALL = '[\\U00000000-\\U0010FFFF]' | ||
DOT = '[\\U00000000-\\x09\\x0B\\x0C\\x0E-\\U0010FFFF]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it would be more performant or not, but I'm curious if:
DOT = '[^\\x0A\\x0D]'
would be easier / faster to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to simpler negative range, thanks!!
@@ -89,3 +89,13 @@ This guide provides a brief overview. Check out the GBNF files in this directory | |||
``` | |||
./main -m <model> --grammar-file grammars/some-grammar.gbnf -p 'Some prompt' | |||
``` | |||
|
|||
## Troubleshooting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this section. After this gets merged in, I'll write a section on the dangers of left-recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also document the json->grammar converters here, I'll send that separately
After (#6609 &) @HanClinto's ✨ magical #6616 ✨, this PR is still required, and we can dramatically increase the # of repetitions without impacting sampling performance. At 200 repetitions the PR is 18x faster (switched to phi-2), and from 500 reps Show benchmark commands for 10k repsecho '{"items": {"type": "number"}, "maxItems": 10000}' > schema.json && \
git checkout json-faster-repetitions2 && \
python examples/json_schema_to_grammar.py schema.json > fast.grammar && \
git checkout master && \
python examples/json-schema-to-grammar.py schema.json > slow.grammar && \
make clean && make -j LLAMA_CURL=1 main && \
mkdir -p models/7B && \
hyperfine --warmup 1 -L speed fast,slow './main -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file {speed}.grammar -p "List of 10 numbers" --seed 1234' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Great work on this PR -- this is really stellar work!
Only other suggestion I can think is that it might be worth adding integration tests to compare the output of Python json_schema_to_grammar.py vs. json-scham-to-grammar.cpp -- but not sure if we care about ensure lockstep equivalency that much to care about wrapping it.
Overall this looks great, and I'm very impressed with it -- GREAT work on all of this @ochafik !
Fully agree, so much I've done this already in #5978 :-D: https://github.com/ggerganov/llama.cpp/blob/master/tests/test-json-schema-to-grammar.cpp (Also tests the JS version)
Thank you so much for your help, ideas & proactive reviews! (and your own speedups) Love this team work 👍 |
haha -- very nicely done. 😄 Again, awesome job -- super happy to see this merged in! |
…ngs, cap number length (ggerganov#6555) * json: rename python schema converter to make import easier * server: skip null json_schema / grammar fields * json: deps management for primitive rules (+ allow null values) * json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?` * grammars: add troubleshooting section to readme * json: cap length of numbers to 15 digits before/after decimal point (avoids infinite gen, e.g. "one third" -> `0.333333333333...`) * json: unify all repetition code (w/ or w/o sep) * json: support string minLength/maxLength * server+json: update server/README w/ result_format * nits * json: fix type error w/ python 3.8 * json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions) * json: simplify DOT `{"type": "string", "pattern": "^.$"}` * json: remove recursion in opt_repetitions (avoids Python stack overflow) * json: rm dead code * json: rm useless assert & ggml.h import
Another followup on #5978
Bug fixes:
Grammars generated for the following schemas will no longer trigger combinatorial explosions during inference (also see llama : speed-up grammar sampling #4218 (comment)):
{"items": {"type": "number"}, "minItems": 10, "maxItems": 1000}
: this used to hang forever, it's now running smoothlyShow command
Before:
After (notice python script uses underscores now):
{"type": "string", "pattern": "^a{10,100}$"}
Numbers & integers now have a capped precision (JSON itself allows arbitrary precisions numbers but there's no point in exceeding JavaScript's - roughly 15 digits; zealous LLMs may otherwise generate an infinite sequence
0.33333333333...
when prompted for "one third")Allow
null
in untyped JSON objectsNew features:
Support string length constraints:
{"type": "string", "minLength": 10, "maxLength": 100}
Python converter can be imported more easily (underscored name)
I've hopefully simplified the code by adding a simple dependencies mechanism for primitive rules, and unifying all repetition code.
I've also updated the GBNF doc to mention the performance gotchas, and have documented the server
response_format
parameter for schema-constrained JSON output