-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars #9639
Draft
ochafik
wants to merge
267
commits into
ggerganov:master
Choose a base branch
from
ochafik:tool-call
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,506
−143
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
testing
Everything test related
examples
python
python script changes
server
labels
Sep 25, 2024
ochafik
changed the title
Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine
Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine
Sep 25, 2024
ochafik
changed the title
Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine
Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine
Sep 25, 2024
…taking server down
…like llama3.1 template)
…el_chat_template
Extradited the Python agent code w/ docker siloing to this gist, updated this PR's description with simpler instructions. Also, |
ochafik
changed the title
Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine
Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars
Jan 21, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).
Background
It tackles two main problems related to tool calling:
Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if
"tool_choice": "required"
is specified in the request). It's not currently possible to say.* "<tool_call>" constrained "</tool_call>"
as the leading.*
will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in thethoughtful_steps
style, but the native tool call styles were still problematic.Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for
content
vs.tool_call
outputs, and even mixes of the two (for the few models that support that).<|python_tag|>
and{"name": "toolN"
(for eachtoolN
in the list oftools
in the request).{"
which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.<tool_call>
.>>>toolN\n
for eachtoolN
.<function=
and<|python_tag|>
[TOOL_CALLS]
but it doesn't seem to (ever?) be emitted, so we're triggering on{"
instead for now.tool_choice
isrequired
)Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
minja.hpp
), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.With this intro out of the way, here are the parts of this PR that could possibly be sent separately (
currently itemizedto be reitemized as commits):minja.hpp
: minimal Jinja templating engine and its tests against actual templates & a few test contexts: now in its own repo (https://github.com/google/minja), integrated w/--jinja
flag in Add Jinja template support #11016Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro
Integration with
llama-server
(tools
,tool_choice
) when--jinja
enabledgrammar_trigger_words +
llama_antiprompts
: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligningcli
&server
(e.g. single-token stop logic) and handling grammar trigger words.How to use / test
While any model should work (using generic support based on JSON schema constraints), this PR supports the native call style of a few models:
For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the
tool_use
variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspectinghttp://localhost:8080/props
, and inspect the logs forTool call style:
.I've created some minimalistic Agent loop code in this Gist: it contains a few python tools & supports running them in a siloed docker container, along with examples (used to be part of this PR).
As for basic tool call functionality, you can test it just with this PR:
Run
llama-server
w/ any model:Call the chat completions endpoint (not in streamed mode) with any OpenAI-compatible library, or plain curl:
TODOs before undrafting:
"all\n"
in non-tool-call outputs forllama-cli
(^|\n)?{"
and otherwise not trigger spuriously elsewhere.Command R Plus,DeepSeek)[ ] e2e tests for agent[ ] Add Google search tool as alternative to Brave--special
for Nemo since last merge[TOOL_CALLS]
token<|python_tag|>
tokenthoughtful_steps
tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)--cache-prompt
defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instancechat_template
should maybe be resolved earlier? (now allama_chat_template
class)llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct?(have introduced a new C++ APIllama_chat_template::apply
)llama_token_to_piece(ctx, token)
should really take(model, token)
instead, but that's a breaking API change_llama_token_to_piece
that takes model. Movedllama_chat_template_from_model
helper tocommon.cpp
builtin_tools
andtodays_date
in llama3.1's template)test-chat-templates
&test-minja
(write each test case in a.jinja
file)bos_token
in the current chat template logicexamples/tool-call
) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389Possible follow ups: