Skip to content

Latest commit

 

History

History
 
 

openai

examples.openai: OpenAI-compatibility layer on top of server.cpp

New Python OpenAI API compatibility server, which calls into / spawns the C++ server under the hood:

python -m examples.openai.server --model model.gguf

Prerequisites

Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge

conda create -n agent python=3.11
conda activate agent
pip install -r examples/openai/requirements.txt

Features

The new examples/openai/server.py:

  • Supports grammar-constrained tool calling for all models (incl. Mixtral 7x8B)

    • Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling schemes

    • Generic support w/ JSON schema that guides the model towards tool usage (at the cost of extra tokens):

        {
          // original_thought: string,
          thought_about_next_step_only: string,
          next_step: {tool_calls: {name: string, arguments: any}} | {result: T}
        }
        // Where T is the output JSON schema, or 'any'
      • Option to publicise schemas to models as TypeScript signatures (as for Functionary) or JSON schema.

      • Supports models that require user/assistant alternance (like Mixtral Instruct) by merging system messages into user messages.

  • Spawns the C++ llama.cpp server under the hood (unless passed --endpoint), but only uses its non-chat endpoint

    (depending on the prompting strategy, we weave the tool & output schema along with the chat template into the raw model grammar constraints)

  • Uses the actual Jinja2 templates stored in the GGUF models

  • Will eventually also spawn whisper.cpp and another server subprocess for the embeddings endpoint

Rationale: the C++ server lacks some OpenAI compatibility features (and can't realistically keep up with prompt templates w/o bringing in too many dependencies), this new layer could allow focusing the C++ server on serving efficiency and delegate OAI compliance to a layer easier to maintain.

Test

If you want to see tools in action, look at the agent example. Otherwise:

Start the server in Terminal 1:

python -m examples.openai --model  ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

Query it in Terminal 2 (or use it from any framework that makes use of tools: note tool calls are guaranteed to comply to the schema, so retries are likely not necessary!):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [{
          "type": "function",
          "function": {
              "name": "get_current_weather",
              "description": "Get the current weather",
              "parameters": {
                  "type": "object",
                  "properties": {
                      "location": {
                          "type": "string",
                          "description": "The city and state, e.g. San Francisco, CA"
                      },
                      "format": {
                          "type": "string",
                          "enum": ["celsius", "fahrenheit"],
                          "description": "The temperature unit to use. Infer this from the users location."
                      }
                  },
                  "required": ["location", "format"]
              }
          }
      }, {
          "type": "function",
          "function": {
              "name": "get_n_day_weather_forecast",
              "description": "Get an N-day weather forecast",
              "parameters": {
                  "type": "object",
                  "properties": {
                      "location": {
                          "type": "string",
                          "description": "The city and state, e.g. San Francisco, CA"
                      },
                      "format": {
                          "type": "string",
                          "enum": ["celsius", "fahrenheit"],
                          "description": "The temperature unit to use. Infer this from the users location."
                      },
                      "num_days": {
                          "type": "integer",
                          "description": "The number of days to forecast"
                      }
                  },
                  "required": ["location", "format", "num_days"]
              }
          }
      }],
    "messages": [
      {"role": "system", "content": "Do not make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous."},
      {"role": "user", "content": "what is the weather going to be like in San Francisco and Glasgow over the next 4 days"}
    ]
  }'
Show output
{
  "id": "chatcmpl-3095057176",
  "object": "chat.completion",
  "created": 1711726921,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "name": null,
        "tool_call_id": null,
        "content": "In order to provide the required information, I need to call the get_n_day_weather_forecast function twice, once for San Francisco and once for Glasgow.",
        "tool_calls": [
          {
            "id": "call_970977",
            "type": "function",
            "function": {
              "name": "get_n_day_weather_forecast",
              "arguments": {
                "location": "San Francisco, CA",
                "format": "celsius",
                "num_days": 4
              }
            }
          }
        ]
      },
      "logprobs": null,
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 546,
    "completion_tokens": 118,
    "total_tokens": 664
  },
  "system_fingerprint": "...",
  "error": null
}

TODO

  • Embedding endpoint w/ distinct server subprocess

  • Evaluate options for session caching

    • Pass session id & store / read from file?

    • Support parent session ids for trees of thought?

    • Support precaching long prompts from CLI / read session files?

  • Follow-ups

    • Remove OAI support from server

    • Remove non-Python json-schema-to-grammar versions

    • Reach out to frameworks to advertise new option.