Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars #9639

ochafik · 2024-09-25T15:37:26Z

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Background

It tackles two main problems related to tool calling:

Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.
- Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).
  - For Llama3.1-Instruct (cf. llama-stack-apps repo / these docs) for instance, triggers are <|python_tag|> and {"name": "toolN" (for each toolN in the list of tools in the request).
  - For Llama3.2-Instruct, we eagerly trigger on{" which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.
  - For Hermes Pro (cf. Hermes-Function-Calling repo), it's <tool_call>.
  - For Functionary v3.llama3, it's >>>toolN\n for each toolN.
  - For Functionary v3-llama3.1, it's <function= and <|python_tag|>
  - For Mistral Nemo, the trigger ought to be [TOOL_CALLS] but it doesn't seem to (ever?) be emitted, so we're triggering on {" instead for now.
  - For other models ("generic" tool call style), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
- Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the parts of this PR that could possibly be sent separately (~~currently itemized~~ to be reitemized as commits):

minja.hpp: minimal Jinja templating engine and its tests against actual templates & a few test contexts: now in its own repo (https://github.com/google/minja), integrated w/ --jinja flag in Add Jinja template support #11016
Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro
Integration with llama-server (tools, tool_choice) when --jinja enabled
grammar_trigger_words + llama_antiprompts: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligning cli & server (e.g. single-token stop logic) and handling grammar trigger words.

How to use / test

While any model should work (using generic support based on JSON schema constraints), this PR supports the native call style of a few models:

Llama 3.x
Functionary 3.x
Hermes 2/3, Qwen 2.5
Mistral Nemo.

For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the tool_use variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspecting http://localhost:8080/props, and inspect the logs for Tool call style: .

I've created some minimalistic Agent loop code in this Gist: it contains a few python tools & supports running them in a siloed docker container, along with examples (used to be part of this PR).

As for basic tool call functionality, you can test it just with this PR:

Run llama-server w/ any model:

cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel
alias llama-server=./build/bin/llama-server

# Native support for Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x
# Note that some of these GGUFs lack the right template, so we override it
# (otherwise they'd use the generic tool call support, which may be less efficient
# and consume more tokens)

llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
  -hfr bartowski/Qwen2.5-7B-Instruct-GGUF -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf

llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
  -hfr NousResearch/Hermes-3-Llama-3.1-8B-GGUF -hff Hermes-3-Llama-3.1-8B.Q4_K_M.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use )

llama-server --jinja -fa --verbose \
  -hfr meetkai/functionary-small-v3.2-GGUF -hff functionary-small-v3.2.Q8_0.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py meetkai/functionary-medium-v3.2 )

llama-server --jinja -fa --verbose \
  -hfr lmstudio-community/Llama-3.2-3B-Instruct-GGUF -hff Llama-3.2-3B-Instruct-Q6_K.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py meta-llama/Llama-3.2-3B-Instruct )

# Note the --special flag: this is needed b/c of a regression from the last merge, will fix!
llama-server --jinja -fa -ctk q8_0 -ctv q8_0 --verbose --special \
  -hfr bartowski/Mistral-Nemo-Instruct-2407-GGUF -hff Mistral-Nemo-Instruct-2407-Q8_0.gguf \
  --chat-template-file <( python scripts/get_hf_chat_template.py mistralai/Mistral-Nemo-Instruct-2407 )

# Generic support, e.g. Phi 3.5, Gemma 2b, but really anything goes

llama-server --jinja -fa --verbose \
  -hfr bartowski/Phi-3.5-mini-instruct-GGUF -hff Phi-3.5-mini-instruct-Q4_K_M.gguf

llama-server --jinja -fa --verbose \
  -hfr bartowski/gemma-2-2b-it-GGUF -hff gemma-2-2b-it-Q4_K_M.gguf

Call the chat completions endpoint (not in streamed mode) with any OpenAI-compatible library, or plain curl:

curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gpt-3.5-turbo",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "ipython",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ]
}'

TODOs before undrafting:

Possible follow ups:

Add agent example w/ isolation in c++ or python (see example/agent moved from this PR to that Gist).
Add agent w/ MCP support?
Add tool call loop to the default web chat using Pyodide as a python interpreter?

…taking server down

…like llama3.1 template)

…el_chat_template

…v#6389

…289b7d38d49e1ee2755698d6c79)

ochafik · 2025-01-21T14:49:39Z

Extradited the Python agent code w/ docker siloing to this gist, updated this PR's description with simpler instructions.

Also, --jinja support was merged: #11016

github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024

ochafik added 24 commits September 26, 2024 02:27

fix editorconfig lints

1b62801

fix flake8 lints

76d2938

minja: add str.endswith

c124ab4

tool-call: fix/test functionary v3

595e11c

server: catch errors in format_final_response_oaicompat instead of …

94377d7

…taking server down

minja: try to please gcc

059babd

tool-call: fix pyright type errors

4cd82d6

tool-call: update chat templates/goldens

2eb29bf

minja: gcc tweaks

5f5be9c

minja: allow none input to selectattr, and add safe passthrough filter

8e4a9ba

tool-call: test/fix functionary-medium-v3.1's template (can "look" …

0c87013

…like llama3.1 template)

gcc appeasement

749a21c

fix gcc build

3d2650c

tool-call: allow <|python_tag|> in functionary-medium-3.1

d7ec84f

tool-call: factor chat template away from legacy API

cf7bece

tool-call: refactor llama_chat_template class + use in validate_mod…

9cfe4d7

…el_chat_template

minja: update chat template goldens w/ llama.3.1 arguments workaround

296331b

minja: add str.title()

50685f8

tool-call: merge & fix jinja template tests into test-chat-template

5840e10

fix lints

2926089

fix gcc error + lint

c88c932

tool-call: fix tool call return format

10f9fe8

tool-call: adapt very simple agent + docker isolation from ggergano…

8299fac

…v#6389

minja: fix iterables

f9c1743

ochafik added 13 commits January 20, 2025 23:55

Merge branch 'jinja' into tool-call

9bab693

apply renames from jinja branch

b110374

Update minja to google/minja@b8437df

8347da9

Merge branch 'jinja' into tool-call

7ea6a06

fix std imports for gcc build

56aa93c

Update minja to google/minja#25

ff2cce5

Merge branch 'jinja' into tool-call

ba8dd66

Update minja from google/minja#27

9d8ebd6

Merge branch 'jinja' into tool-call

c606255

Merge remote-tracking branch 'origin/master' into tool-call

fec0260

rm tests/test-minja from makefile

b49d052

Remove examples/agent (moved to https://gist.github.com/ochafik/9246d…

f6e73da

…289b7d38d49e1ee2755698d6c79)

Delete update_jinja_goldens.py

77f4098

ochafik changed the title ~~Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine~~ Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars Jan 21, 2025

ochafik added 11 commits January 22, 2025 01:25

Push laziness down to grammar impl

dbf841b

minimize diffs

ef61a4c

common_tool_call rename

3972945

shrink diff in json conversion code

d77fecc

Refactor string helpers into common

5268ec8

follow enum naming style for tool call styles

9e8b43f

Factor string_join, string_split, string_repeat into common

9a5acbb

json: refactor to surface a versatile builder

4de5cf8

drop unused fs_list_files

03fe80f

Merge branch 'string_utils' into tool-call

41a613b

Update common.cpp

5140d7a

ochafik mentioned this pull request Jan 22, 2025

common: utils to split / join / repeat strings (from json converter) #11342

Open

ochafik added 3 commits January 22, 2025 02:27

Merge branch 'string_utils' into tool-call

e211629

drop llama_sampler_accept_str

28cac49

more cleanups

2dd09c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars #9639

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars #9639

ochafik commented Sep 25, 2024 •

edited

Loading

ochafik commented Jan 21, 2025 •

edited

Loading

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars #9639

Are you sure you want to change the base?

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars #9639

Conversation

ochafik commented Sep 25, 2024 • edited Loading

Background

How to use / test

TODOs before undrafting:

Possible follow ups:

ochafik commented Jan 21, 2025 • edited Loading

ochafik commented Sep 25, 2024 •

edited

Loading

ochafik commented Jan 21, 2025 •

edited

Loading