-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639
Conversation
Apologies for this PR being a moving target. I've now stabilized things (except older gcc giving me sweats), added tests & included basic usage instructions (w/ a tiny agent helper adapted from #6389) for Llama-3.1-8B-Instruct, Hermes-2-Pro-Llama-3-8B and functionary-small-3.2 (which still needs a bit of work). |
@ochafik Your BTW: My current tool-calling solution is to write dummy functions in python and generate grammar files with pydantic, awkward and ugly. I'll definitely give it a try when you finish this PR. Exciting work! |
Thanks @rujialiu !
Thanks for the pointer, at first glance inja seems too limited to support actual templates (we're at the mercy of each and every model maker, some use lots of jinja features, e.g. NousResearch/Hermes-3-Llama-3.1, Cohere/command-r-plus, meetkai/functionary-medium-v3.2 ). Filters (w/ the pipe syntax, e.g.
Yeah I'm doing the same, that's why I spent so much energy improving the JSON schema support tbh.
Hopefully soon! (famous last words haha) |
Ouch, I was not aware of that. That's crazy. Now I'm really impressed that your little code already supports these. Maybe I should use your |
@ochafik I really like your idea of using lazy grammar, I would love to help you. I'm the developer of llama-cpp-agent. Let me know if we can collaborate somehow. |
@Maximilian-Winter thanks / sorry for the slow reply! (frantically busy few weeks 😅) I'd love help on this, anything from just testing out instructions above, to finding new cool examples / bugs, reporting on any other model's tool call styles, or new ideas. I'm trying to release minja in its own mini-repo w/ better testing, but the lazy grammar part is probably going to be what needs most work on next. Depending on your timezone, happy to jump into a video chat too :-) (DM on x?) (Also, llama-cpp-agent looks suuuper cool! 💜) |
@ochafik Sure, that would be great. I'm living in germany. I actually tried to verify on X, by buying premium to write you, but I still have to wait for verification. If you want to reach out me by email or discord, feel free! My email is maximilian.winter.91@gmail.com |
… dumb for function call)
Great work! |
It seems like it adds double BOS when using LLama 3.1 models. Doesn't happen without --jinja |
@Dampfinchen should be fixed as of #11641 / b4641, which version of llama-server & exact model did you test this with? |
So, I've had surprisingly good results with a simple pseudo-Python grammar that ensures code strings are valid structured token soups, guaranteeing string tokens aren't split (restricting allowed nested escapes) & open parentheses / braces / brackets are closed (in this branch). It makes even Llama 3.x 1B / 3B / 8B super compliant & able to overcome the code escapes issues, even at very high temperatures (tested up to 5). Once finalized, it may also be a great way to guard against prompt injection (e.g. from tool results) for models that use special unicode tokens to close / open tool calls (if we mandate that unicode be escaped in the code's JSONified string), which is could be another reason why unicode symbols may have been chosen (cc/ @Kreijstal @ngxson re/ discussion above). NOTE: results above and below are from my tool-bench branch which builds on top of #1160
FYI I've looked into benchmark options (cc/ @Maximilian-Winter ):
![]() |
…istral, Firefunction, DeepSeek) w/ lazy grammars (ggml-org#9639) --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Call updated to match the tool used in the output just below, following the example in #9639
Call updated to match the tool used in the output just below, following the example in ggml-org#9639
…istral, Firefunction, DeepSeek) w/ lazy grammars (ggml-org#9639) --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Call updated to match the tool used in the output just below, following the example in ggml-org#9639
@ochafik Thanks for this work! I've been investigating tool calling on consumer hardware and Ollama, and it's been a very frustrating experience. The lazy grammar idea is very cool. Anyway, as far as benchmarks, I thought I'd point out BFCL v3 as another option. I know that I would love to have some type of tool calling leaderboard for local GGML models. |
@edmcman Thanks!! Not sure if you've seen #12034, I've done a very coarse & naive "benchmark" of llama-server against ollama w/ various models at various temperatures (more results here). I wonder on which models you've had issues w/ ollama, lemme know if you'd like updated results or need help running
Thanks for the pointer (i'd completely forgotten about that repo!), their code looks straightforward, will give it a deeper look! |
Thanks, this is so helpful! I wrote a few blogs as I banged my head trying to find a model that worked. In a nutshell, One thing I'd recommend from my experiences is adding a "conversational" test to your benchmark, e.g., say "Hello" and verify that the model does not attempt to make a nonsensical function call. |
@edmcman Great write ups, thanks a lot for sharing!!
One of my gripes with Llama 3.2 (the very small versions I tested, that is) is it tends to forget to escape nested quotes in (JSON escaped) Python code, causing premature termination of its function call's arguments 🤦♂️... I prototyped some convoluted workaround (cf. above) but haven't gotten to productionizing it yet.
Ollama and their custom templates are in an awkward position. I decided I preferred the (sizeable) hassle of writing and maintaining a jinja templating engine (and now, custom tool call parsers), rather than doing prompt engineering and coercing models too much (I mean, I do coerce them very much ⛓, but only after they start picking one of their natural outputs formats; Qwen 2.5 Coder turned out wildly creative, for instance)
I partially test for this in test_completion_without_tool_call* (checks there's no call, with variations of no tool provided, or just a test tool that's useless for the task, or the right tool but with tool_choice = none), but sounds like a great idea to also check some nice chatty interaction 👌 You'll also see some PS: started incubating gorilla support for llama.cpp in this branch |
@edmcman Note that
Without the template override, the tool calls will still work (with the generic support) but the JSON-based tool call results injection done by Minja's polyfill isn't picked up properly (e.g. |
Great news: I was just able to run my application using langchain, llama.cpp's server, and it worked great with Two small things:
|
@edmcman Glad to read this! (have you tried others such as Qwen2.5-Coder? I'm obsessed w/ unsloth's 128k extended context versions)
If you'd like to see the impact of grammar constraints (a key difference w/ Ollama), you could disable them in utils.hpp as follows: ...
llama_params["prompt"] = chat_params.prompt;
if (getenv("DISABLE_GRAMMAR")) {
llama_params["grammar"] = chat_params.grammar;
llama_params["grammar_lazy"] = chat_params.grammar_lazy;
auto grammar_triggers = json::array();
for (const auto & trigger : chat_params.grammar_triggers) {
grammar_triggers.push_back(trigger.to_json<json>());
}
llama_params["grammar_triggers"] = grammar_triggers;
}
...
Thanks for reporting back, really matters to know this is useful and appreciated!!
Aaaabsolutely. There have been multiple suggestions on how to proceed, but I'm currently working on a "simple" approach that involves:
Hope to get something testable (if not reviewable) in a few days (famous last words haha)
|
So far on llama.cpp I have just tried qwen 2.5 and functionary-small-v3.2 (without the functionary chat template). I'll be testing more soon! My internet is not that fast, and my work's VPN makes it worse, so downloading the models takes forever 😓 On Ollama, I have tried their
Will do, thanks! |
@edmcman Ugh, I feel you! (my own ordeal is disk space, afraid I'm continuously wearing my SSD off 💀) Note that if you already pulled other Ollama models, you can find their GGUF model to use w/ llama-server using a script like this (you need to pull the original Jinja template separately, which is light in bandwidth ;-)): get_ollama_gguf.js#!/usr/bin/env node
/*
Get the file under $OLLAMA_HOME/models/blobs/ for the application/vnd.ollama.image.model key in the manifest
- Note that metadata of modelId:modelTag is stored under $OLLAMA_HOME/models/manifests/registry.ollama.ai/library/${modelId}/${modelTag}
- You'll need to get the Jinja template from the original model using llama.cpp's scripts/get_chat_template.py script
ollama pull qwen2.5-coder:7b
llama-server -m $( ./get_ollama_gguf.js qwen2.5-coder:7b ) -fa --jinja --chat-template-file <( ./scripts/get_chat_template.py Qwen/Qwen2.5-Coder-7B-Instruct-GGUF tool_use )
*/
const fs = require('fs');
const path = require('path');
const HOME = process.env.HOME;
const OLLAMA_HOME = process.env.OLLAMA_HOME || path.join(HOME, '.ollama');
const [model] = process.argv.slice(2);
if (!model) {
console.error('Usage: node get_ollama_gguf.js <modelId:modelTag>');
process.exit(1);
}
const [modelId, modelTag] = model.split(':');
const manifestFile = path.join(OLLAMA_HOME, 'models', 'manifests', 'registry.ollama.ai', 'library', modelId, modelTag);
if (!fs.existsSync(manifestFile)) {
console.error(`Manifest file not found for ${modelId}:${modelTag}`);
process.exit(1);
}
const manifest = JSON.parse(fs.readFileSync(manifestFile, 'utf8'));
const modelLayer = manifest.layers.find(l => l.mediaType === 'application/vnd.ollama.image.model');
if (!modelLayer) {
console.error('Model layer not found');
process.exit(1);
}
const modelFileName = modelLayer.digest.split(':').join('-');
const modelFile = path.join(OLLAMA_HOME, 'models', 'blobs', modelFileName);
if (!fs.existsSync(modelFile)) {
console.error(`Model file not found for ${modelId}:${modelTag}`);
process.exit(1);
}
console.log(modelFile); ollama pull qwen2.5-coder:7b
llama-server -m $( ./get_ollama_gguf.js qwen2.5-coder:7b ) -fa --jinja --chat-template-file <( ./scripts/get_chat_template.py Qwen/Qwen2.5-Coder-7B-Instruct-GGUF tool_use ) |
…istral, Firefunction, DeepSeek) w/ lazy grammars (ggml-org#9639) --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Call updated to match the tool used in the output just below, following the example in ggml-org#9639
@ochafik I've had this PR tab open in my browser for quite some time and only recently got around to building a simple voice assistant I've been meaning to make, which depends on tool calling. I've built a lot of LLM tools, but I've put off tool calling for ages due to the gotchas involved, and this really helped to smooth away those wrinkles. Using Qwen 32b with llama-server --jinga, the process of getting tool calling working was straight forward and worked like a charm right out of the box. So, thanks from me as well. Sincerely looking forward to #12379, but it's incredibly useful as is. |
This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).
Which models are supported (in their native style)?
While any model should work (w/ generic fallback using JSON schema constraints), this PR supports the native call style of a few models:
tool-call
: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034server
: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless--reasoning-format none
#11607tool-call
: support Command R7B (+ return tool_plan "thoughts" in API) #11585(note: streaming incubated in #12379)
Show all templates supported by minja and which handler they use
For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the
tool_use
variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspectinghttp://localhost:8080/props
, and inspect the logs forChat format:
.Any
tool_calls
field returned byllama-server
should always conform to the JSON schema (to the extent that it uses supported features of JSON schemas), so there's no need to use any post-processor.How to use / test
You can test tool calls as follows:
Get and build this PR's branch
Run
llama-server
w/ any model (Edited: bumped to quants / models that work w/ my agent example):Call the chat completions endpoint (in non-streamed mode) with any OpenAI-compatible library, or plain curl:
It will output something like (once piped in
jq
):I've also created some minimalistic Agent loop code in this Gist: it contains a few python tools & supports running them in a siloed docker container, along with examples (used to be part of this PR).
Background
This PR tackles two main problems related to tool calling:
Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if
"tool_choice": "required"
is specified in the request). It's not currently possible to say.* "<tool_call>" constrained "</tool_call>"
as the leading.*
will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in thethoughtful_steps
style, but the native tool call styles were still problematic.Solved w/ lazy grammars activated by trigger words (similar to stop words, but awaited in the grammar implementation itself). Output is completely unconstrained before triggers, and completely constrained after, which allows for
content
vs.tool_call
outputs, and even mixes of the two (for the few models that support that).For Llama 3.x (cf. these docs: 1, 2, 3), triggers are
<|python_tag|>
if any of the builtin tools are detected (wolfram_alpha
,brave_search
/web_search
withquery
param,code_interpreter
withcode
param); NOT for Llama 3.2{"name": "toolN"
(for eachtoolN
in the list oftools
in the request){"name":
(needed for very small 1B/3B models which get confused very quickly otherwise), and some other variations (to allow the somewhat popular{"type": "function", "name": ...
)For Functionary v3.1, we trigger on
<function=
and<|python_tag|>
(NOTE: seems to work well w/Llama-3.1-Instruct
, e.g. it's on together.ai's docs). Note that<|python_tag|>
here introduces freeform Python code, whereas for Llama-3.1-Instruct's template it introduces builtin tool calls in Python syntax. Almost the same, but handled quite differently.For Functionary v3.2, it's
>>>toolN\n
for eachtoolN
(technically also triggering ontoolN\n
for the first tool call, there's a todo to avoid spurious matches by forcing a match at the very start)For Hermes Pro (cf. Hermes-Function-Calling repo), the trigger is
<tool_call>
.For Mistral Nemo, the trigger is the special
[TOOL_CALLS]
tokenFor DeepSeek R1 and its distills, it's
<|tool▁calls▁begin|>
(Note: DeepSeek-R1 seems more eager to talk than to call tools for now, lemme know if you get it to work)For Firefunction v2, the trigger is
functools[
For other models ("generic" chat format), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless
tool_choice
isrequired
)Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
minja.hpp
), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.With this intro out of the way, here are the main parts of this PR:
minja.hpp
: minimal Jinja templating engine and its tests against actual templates & a few test contexts--jinja
flag in Add Jinja template support #11016Tool call grammar generation + output parsing logic for 8 different tool call styles (covering most of the popular models, incl. Llama 3.x, Functionary 3, Qwen 2.5, DeepSeek R1, Mistral Nemo...), with a generic fallback.
Lazy grammar wired into the sampler, using a mix of trigger words and trigger tokens to enable the grammar. Trigger tokens are also used to override printability of special tokens, even when the grammar is not lazy (e.g. when
"tool_choice": "required"
is passed in the request)Integration with
llama-server
(fulltools
&tool_choice
support).( cd examples/server/tests && ./tests.sh -m slow -v -x )
).TODOs
Blocking:
sync
: minja #11499 (this PR's diff won't include chat-template.hpp or minja.hpp)python_code_argument_name
in favour ofexpect_tool_arguments
Nice to haves:
at_first
semantics to require trigger word to be at start of output (equiv. to ^ regex behaviour; not using regexes as ^ can't be made to mean "start of entire string" reliably afaict), to reduce spurious triggers w/ Llama 3.xSee draft-times TODOs
[ ] Support streaming (of content - as long as it doesn't trigger any partial antiprompt match - and of individual tool calls)"all\n"
in non-tool-call outputs forCommand R Plus,DeepSeek)[ ] e2e tests for agent[ ] Add Google search tool as alternative to Brave--special
for Nemo since last merge[TOOL_CALLS]
token<|python_tag|>
tokenthoughtful_steps
tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)--cache-prompt
defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instancechat_template
should maybe be resolved earlier? (now allama_chat_template
class)llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct?(have introduced a new C++ APIllama_chat_template::apply
)llama_token_to_piece(ctx, token)
should really take(model, token)
instead, but that's a breaking API change_llama_token_to_piece
that takes model. Movedllama_chat_template_from_model
helper tocommon.cpp
builtin_tools
andtodays_date
in llama3.1's template)test-chat-templates
&test-minja
(write each test case in a.jinja
file)bos_token
in the current chat template logicexamples/tool-call
) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389Possible follow ups:
-hft
/--hf_template
flag to override the GGUF's chat templates from a HF model repo