A lightweight, OpenAI-compatible HTTP proxy that routes requests to multiple backend LLM servers based on the requested model. It supports streaming (SSE), per-server parameter defaults and overrides, virtual profiles, and simple request shaping via a clean TOML configuration.
Key capabilities:
- Single endpoint per server
- Bearer auth via
api_key - Model allow-list
- Defaults (apply if missing), overrides (force-set), and deny (JSON Pointers)
- Virtual model profiles (suffix-based)
- Option to hide base models (expose only profile-suffixed models)
- Default system/developer messages upsert (added if missing)
- llama.cpp-style routing (model via Authorization header on non-
/v1paths)
Prerequisites:
- Java 17+
- Maven 3.x
- Configure
- Use
examples/config.tomlas a reference and place your config at the path configured byConstants.CONFIG_FILE(eg:config.toml).
- Build
./compile.sh- Runs
mvn compileandmvn package
- Runs
- Run
- Development (Maven Exec):
./run.sh(runsmvn exec:java) - From JAR:
./run-jar.sh(runsjava -jar target/llm-proxy-1.0.0.jar)
The server listens on the port defined by Constants.PROXY_PORT (eg: current default is 3000).
Define servers as top-level tables. Each server has one endpoint and optional behaviors and profiles. See examples/config.toml for a complete example.
Schema overview:
-
Server table:
[ServerName]endpoint(string, required): Base URL to the backend (eg:https://api.openai.com/v1)- If the endpoint ends with
/v1and you call/v1/..., the proxy strips the duplicate/v1once.
- If the endpoint ends with
api_key(string, optional): If provided, forwarded asAuthorization: Bearer <api_key>to the backendmodels(array[string], optional): Allow-list; only these models are exposeddefaults(object, optional): Deep merge applied only for missing fields (objects only)overrides(object, optional): Deep merge that force-sets values (overwrites request fields)deny(array[string], optional): Fields to remove (JSON Pointers, or dot-paths converted to pointers)hide_base_models(bool, optional, default false): If true, base models are not listed nor routable; only profile-suffixed models are availabledefault_system_message(string, optional): Ifmessagesarray exists and nosystemrole, insert one at the beginningdefault_developer_message(string, optional): Ifmessagesarray exists and nodeveloperrole, insert one right after the firstsystemmessage (or at the beginning if none)
-
Profiles (virtual models):
[ServerName.profileSuffix]- Same optional fields as server-level for params/messages:
defaults,overrides,denydefault_system_message,default_developer_message
- The profile suffix forms virtual model IDs as
baseModel-profileSuffix, but the backend receives the base model name (the proxy rewrites it).
- Same optional fields as server-level for params/messages:
Behavior notes:
- Arrays (eg:
messages) are not deeply merged:overrides.messagesreplaces the entire arraydefaults.messagesis used only ifmessagesis entirely absent
- deny removes object fields only (array element removal is not supported)
- Profile-level values override server-level values for that request
- llama.cpp-style routing:
- For non-
/v1paths, the proxy extracts the model from the inboundAuthorization: Bearer <model>header - The inbound Authorization header is not forwarded to the backend (the backend Authorization is derived from
api_key, if configured)
- For non-
Example config.toml which I use:
[OpenAI]
endpoint = "https://api.openai.com/v1"
api_key = "sk-proj-..."
models = [ "gpt-5", "o3", "o1" ]
overrides = { stream = true, stream_options = { include_usage = true } }
deny = ["/temperature"]
default_developer_message = "Formatting re-enabled"
hide_base_models = true
[OpenAI.high]
overrides = { reasoning_effort = "high" }
[OpenRouter]
endpoint = "https://openrouter.ai/api/v1"
api_key = "sk-or-v1-..."
models = ["anthropic/claude-sonnet-4", "anthropic/claude-opus-4.1", "google/gemini-2.5-pro"]
overrides = { temperature = 0.0, stream = true, stream_options = { include_usage = true } }
hide_base_models = true
[OpenRouter.high]
overrides = { reasoning = { "max_tokens" = 32000 } }
[DeepSeek]
endpoint = "https://api.deepseek.com/v1"
api_key = "sk-..."
models = ["deepseek-chat", "deepseek-reasoner"]
overrides = { temperature = 0.0, stream = true, stream_options = { include_usage = true } }
["SERVER-Z-8080"]
endpoint = "http://192.168.1.115:8080"
deny = ["/temperature"]
["SERVER-Z-8081"]
endpoint = "http://192.168.1.115:8081"
deny = ["/temperature"]
["MAC-STUDIO-8080"]
endpoint = "http://192.168.1.120:8080"
deny = ["/temperature"]
["MAC-STUDIO-8081"]
endpoint = "http://192.168.1.120:8081"
deny = ["/temperature"]See the example config.toml file for more detailed examples including multiple servers and profiles.
-
Models listing:
GET /v1/models- The proxy aggregates models from all configured servers, applying allow-lists and profile expansion
- Respects
hide_base_models(shows only profile-suffixed models if enabled)
-
Chat Completions (OpenAI-compatible):
- The proxy applies deny → defaults → overrides
- Virtual models: incoming
model: base-suffixis rewritten tobasefor the backend - Streaming (SSE) is enabled by default unless
"stream": falseis explicitly set
-
llama.cpp-style:
- For requests not under
/v1, the proxy extracts the model fromAuthorization: Bearer <model>and routes accordingly
- For requests not under
- Build:
./compile.sh→mvn compile && mvn package
- Run (dev):
./run.sh→mvn exec:java
- Run JAR:
./run-jar.sh→java -jar target/llm-proxy-1.0.0.jar
- Ensure
config.toml(or whateverConstants.CONFIG_FILEpoints to) is readable and follows the schema - If models aren’t visible in
/v1/models:- Check allow-lists (
models) - Check
hide_base_models - Verify the backend supports
/v1/modelsand returns expected data
- Check allow-lists (
- Enable
Constants.DEBUG_REQUESTto log transformed requests (be mindful of sensitive data)
Apache 2.0 License - See LICENSE for details