Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars #9639

Draft
wants to merge 267 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Sep 25, 2024

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Background

It tackles two main problems related to tool calling:

  • Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.

    • Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).

      • For Llama3.1-Instruct (cf. llama-stack-apps repo / these docs) for instance, triggers are <|python_tag|> and {"name": "toolN" (for each toolN in the list of tools in the request).
      • For Llama3.2-Instruct, we eagerly trigger on{" which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.
      • For Hermes Pro (cf. Hermes-Function-Calling repo), it's <tool_call>.
      • For Functionary v3.llama3, it's >>>toolN\n for each toolN.
      • For Functionary v3-llama3.1, it's <function= and <|python_tag|>
      • For Mistral Nemo, the trigger ought to be [TOOL_CALLS] but it doesn't seem to (ever?) be emitted, so we're triggering on {" instead for now.
      • For other models ("generic" tool call style), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
  • Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.

    • Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the parts of this PR that could possibly be sent separately (currently itemized to be reitemized as commits):

  • minja.hpp: minimal Jinja templating engine and its tests against actual templates & a few test contexts: now in its own repo (https://github.com/google/minja), integrated w/ --jinja flag in Add Jinja template support #11016

  • Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro

  • Integration with llama-server (tools, tool_choice) when --jinja enabled

  • grammar_trigger_words + llama_antiprompts: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligning cli & server (e.g. single-token stop logic) and handling grammar trigger words.

How to use / test

While any model should work (using generic support based on JSON schema constraints), this PR supports the native call style of a few models:

  • Llama 3.x
  • Functionary 3.x
  • Hermes 2/3, Qwen 2.5
  • Mistral Nemo.

For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the tool_use variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspecting http://localhost:8080/props, and inspect the logs for Tool call style: .

I've created some minimalistic Agent loop code in this Gist: it contains a few python tools & supports running them in a siloed docker container, along with examples (used to be part of this PR).

As for basic tool call functionality, you can test it just with this PR:

  • Run llama-server w/ any model:

    cmake -B build -DLLAMA_CURL=1
    cmake --build build -t llama-server --parallel
    alias llama-server=./build/bin/llama-server
    
    # Native support for Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x
    # Note that some of these GGUFs lack the right template, so we override it
    # (otherwise they'd use the generic tool call support, which may be less efficient
    # and consume more tokens)
    
    llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
      -hfr bartowski/Qwen2.5-7B-Instruct-GGUF -hff Qwen2.5-7B-Instruct-Q4_K_M.gguf
    
    llama-server --jinja -fa -ctk q4_0 -ctv q4_0 --verbose \
      -hfr NousResearch/Hermes-3-Llama-3.1-8B-GGUF -hff Hermes-3-Llama-3.1-8B.Q4_K_M.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use )
    
    llama-server --jinja -fa --verbose \
      -hfr meetkai/functionary-small-v3.2-GGUF -hff functionary-small-v3.2.Q8_0.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py meetkai/functionary-medium-v3.2 )
    
    llama-server --jinja -fa --verbose \
      -hfr lmstudio-community/Llama-3.2-3B-Instruct-GGUF -hff Llama-3.2-3B-Instruct-Q6_K.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py meta-llama/Llama-3.2-3B-Instruct )
    
    # Note the --special flag: this is needed b/c of a regression from the last merge, will fix!
    llama-server --jinja -fa -ctk q8_0 -ctv q8_0 --verbose --special \
      -hfr bartowski/Mistral-Nemo-Instruct-2407-GGUF -hff Mistral-Nemo-Instruct-2407-Q8_0.gguf \
      --chat-template-file <( python scripts/get_hf_chat_template.py mistralai/Mistral-Nemo-Instruct-2407 )
    
    # Generic support, e.g. Phi 3.5, Gemma 2b, but really anything goes
    
    llama-server --jinja -fa --verbose \
      -hfr bartowski/Phi-3.5-mini-instruct-GGUF -hff Phi-3.5-mini-instruct-Q4_K_M.gguf
    
    llama-server --jinja -fa --verbose \
      -hfr bartowski/gemma-2-2b-it-GGUF -hff gemma-2-2b-it-Q4_K_M.gguf
  • Call the chat completions endpoint (not in streamed mode) with any OpenAI-compatible library, or plain curl:

    curl http://localhost:8080/v1/chat/completions -d '{
      "model": "gpt-3.5-turbo",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "ipython",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": ["code"]
            }
          }
        }
      ],
      "messages": [
        {
          "role": "user",
          "content": "Print a hello world message with python."
        }
      ]
    }'

TODOs before undrafting:

  • Fix CI build (tests still failing on windows)
  • Test lazy grammars (cf. removed test-antiprompts.cpp)
  • Support DeepSeek-R1-Distill*
  • Add a way to require trigger word to be at start of output
  • Support streaming (of content - as long as it doesn't trigger any partial antiprompt match - and of individual tool calls)
  • Implement strftime_now in minja (for Llama 3.2), also update today's date for Llama 3.1
  • Functionary v3.2: strip leading "all\n" in non-tool-call outputs for
  • Add grammar trigger words support to llama-cli
  • Support regexps as antiprompts? Would allow triggering tool call grammar for small Llama 3.2 models (1B, 3B) on (^|\n)?{" and otherwise not trigger spuriously elsewhere.
  • Add support for broken templates (GML3..., Command R Plus, DeepSeek)
  • [ ] e2e tests for agent
  • [ ] Add Google search tool as alternative to Brave
  • Simplify stop word / trigger word logic (push down to grammar)
  • Fix regression requiring --special for Nemo since last merge
  • Move minja to its own location w/ fuller testing (fuzzing, etc) or at least its own PR --> https://github.com/google/minja
  • Port former behave / feature tool call tests to new pytest setup (server : replace behave with pytest #10416)
  • Nemo: handle special [TOOL_CALLS] token
  • Qwen2.5-72B-Instruct
  • Llama: suspicious early terminations in hello world tests w/ using explicit python tool w/ json output (could be a failure to escape strings?). Also, need to keep special <|python_tag|> token
  • Bring back generic thoughtful_steps tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)
  • Add support for {"type": "code_interpreter"} (special-cased by functionary-medium-v3.1's template), maybe using ipython automatically for llama 3.1
  • Support jinja templates that explode on system prompts (replicate current chat template handling that puts system in user)
  • Add more tests (heavy e2e w/ actual models, tool_choice = none, parallel tool call, etc)
  • Add configurable network isolation of tools w/ a proxy (also caches pip & deb packages & limits access to host)
  • KV cache saving / reuse (within session & beyond) in agent (--cache-prompt defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instance
  • Add tool call grammar tests (although indirectly covered by server "required" test cases)
  • Add more tools (brave search) + agent examples
  • Refactorings?
    • Ideally would pass some kind of ChatHandler between OAI init & final callback, and make it handle streaming / non streaming cases? (should parallel tool calls be streamed?)
    • chat_template should maybe be resolved earlier? (now a llama_chat_template class)
    • llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct? (have introduced a new C++ API llama_chat_template::apply)
    • llama_token_to_piece(ctx, token) should really take (model, token) instead, but that's a breaking API change
      • calls common-local _llama_token_to_piece that takes model. Moved llama_chat_template_from_model helper to common.cpp
  • Fix functionary-medium-* templates' golden generation
  • Add examples to server readme
  • Support key-value overrides for templates (e.g. builtin_tools and todays_date in llama3.1's template)
    • Done by tool call handler, not user-configurable
  • Unify test-chat-templates & test-minja (write each test case in a .jinja file)
    • Fix a couple of missing bos_token in the current chat template logic
  • Bring back agent / tool call loop example + python tools isolation in docker (examples/tool-call) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389
  • Test w/ meetkai/functionary-small-v3.2

Possible follow ups:

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented Jan 21, 2025

Extradited the Python agent code w/ docker siloing to this gist, updated this PR's description with simpler instructions.

Also, --jinja support was merged: #11016

@ochafik ochafik changed the title Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants