New Python OpenAI API compatibility server, which calls into / spawns the C++ server under the hood:
python -m examples.openai.server --model model.gguf
Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge
conda create -n agent python=3.11
conda activate agent
pip install -r examples/openai/requirements.txt
The new examples/openai/server.py:
-
Supports grammar-constrained tool calling for all models (incl. Mixtral 7x8B)
-
Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling schemes
-
Generic support w/ JSON schema that guides the model towards tool usage (at the cost of extra tokens):
{ // original_thought: string, thought_about_next_step_only: string, next_step: {tool_calls: {name: string, arguments: any}} | {result: T} } // Where T is the output JSON schema, or 'any'
-
Option to publicise schemas to models as TypeScript signatures (as for Functionary) or JSON schema.
-
Supports models that require user/assistant alternance (like Mixtral Instruct) by merging system messages into user messages.
-
-
-
Spawns the C++ llama.cpp server under the hood (unless passed
--endpoint
), but only uses its non-chat endpoint(depending on the prompting strategy, we weave the tool & output schema along with the chat template into the raw model grammar constraints)
-
Uses the actual Jinja2 templates stored in the GGUF models
-
Will eventually also spawn
whisper.cpp
and another server subprocess for the embeddings endpoint
Rationale: the C++ server lacks some OpenAI compatibility features (and can't realistically keep up with prompt templates w/o bringing in too many dependencies), this new layer could allow focusing the C++ server on serving efficiency and delegate OAI compliance to a layer easier to maintain.
If you want to see tools in action, look at the agent example. Otherwise:
Start the server in Terminal 1:
python -m examples.openai --model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
Query it in Terminal 2 (or use it from any framework that makes use of tools: note tool calls are guaranteed to comply to the schema, so retries are likely not necessary!):
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"tools": [{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the users location."
}
},
"required": ["location", "format"]
}
}
}, {
"type": "function",
"function": {
"name": "get_n_day_weather_forecast",
"description": "Get an N-day weather forecast",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the users location."
},
"num_days": {
"type": "integer",
"description": "The number of days to forecast"
}
},
"required": ["location", "format", "num_days"]
}
}
}],
"messages": [
{"role": "system", "content": "Do not make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous."},
{"role": "user", "content": "what is the weather going to be like in San Francisco and Glasgow over the next 4 days"}
]
}'
Show output
{
"id": "chatcmpl-3095057176",
"object": "chat.completion",
"created": 1711726921,
"model": "gpt-3.5-turbo",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"name": null,
"tool_call_id": null,
"content": "In order to provide the required information, I need to call the get_n_day_weather_forecast function twice, once for San Francisco and once for Glasgow.",
"tool_calls": [
{
"id": "call_970977",
"type": "function",
"function": {
"name": "get_n_day_weather_forecast",
"arguments": {
"location": "San Francisco, CA",
"format": "celsius",
"num_days": 4
}
}
}
]
},
"logprobs": null,
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 546,
"completion_tokens": 118,
"total_tokens": 664
},
"system_fingerprint": "...",
"error": null
}
-
Embedding endpoint w/ distinct server subprocess
-
Evaluate options for session caching
-
Pass session id & store / read from file?
-
Support parent session ids for trees of thought?
-
Support precaching long prompts from CLI / read session files?
-
-
Follow-ups
-
Remove OAI support from server
-
Remove non-Python json-schema-to-grammar versions
-
Reach out to frameworks to advertise new option.
-