-
Notifications
You must be signed in to change notification settings - Fork 0
Model Router
TenzinGayche edited this page Dec 15, 2025
·
1 revision
The Model Router is a core component that provides a unified interface to multiple LLM providers, handling model selection, configuration, and caching.
┌─────────────────────────────────────────────────────────────────┐
│ Model Router │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ get_model(name) │ │
│ │ │ │
│ │ 1. Check cache for existing instance │ │
│ │ 2. Identify provider from model name │ │
│ │ 3. Validate API key availability │ │
│ │ 4. Create and configure model instance │ │
│ │ 5. Cache and return │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Provider Layer │ │
│ ├──────────┬──────────┬──────────┬────────┤ │
│ │Anthropic │ Google │ OpenAI │Dharma- │ │
│ │ Claude │ Gemini │ GPT │ mitra │ │
│ └──────────┴──────────┴──────────┴────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Model ID | Description | Context Window |
|---|---|---|
claude-sonnet-4-20250514 |
Claude Sonnet 4.0 | 200,000 tokens |
claude-sonnet-4-5-20250929 |
Claude Sonnet 4.5 | 200,000 tokens |
claude-haiku-4-5-20251001 |
Claude Haiku 4.5 (fast) | 200,000 tokens |
claude-3-5-haiku-20241022 |
Claude 3.5 Haiku | 200,000 tokens |
claude-3-opus-20240229 |
Claude 3 Opus (most capable) | 200,000 tokens |
Environment Variable: ANTHROPIC_API_KEY
Capabilities: text, reasoning, translation, structured output
| Model ID | Description | Thinking | Context Window |
|---|---|---|---|
gemini-2.5-pro |
Gemini 2.5 Pro | ✅ Enabled (12k budget) | 30,720 tokens |
gemini-2.5-flash |
Gemini 2.5 Flash (fast) | ❌ Disabled | 30,720 tokens |
gemini-2.5-flash-thinking |
Flash with thinking | ✅ Enabled (12k budget) | 30,720 tokens |
Environment Variable: GEMINI_API_KEY
Special Features:
- Thinking mode for reasoning tasks
- Default JSON response format
- Automatic generation config handling
| Model ID | Description | Context Window |
|---|---|---|
gpt-4 |
GPT-4 | 128,000 tokens |
gpt-4-turbo |
GPT-4 Turbo (faster) | 128,000 tokens |
gpt-3.5-turbo |
GPT-3.5 Turbo (economical) | 16,385 tokens |
Environment Variable: OPENAI_API_KEY
| Model ID | Description | Notes |
|---|---|---|
dharamitra |
Specialized Buddhist translation | Translation-only |
Environment Variable: DHARMAMITRA_TOKEN
Limitations:
- Translation endpoints only
- No structured output support
- Not available for UCCA/Gloss/Editor features
# .env file
# Anthropic
ANTHROPIC_API_KEY=sk-ant-api03-...
# Google
GEMINI_API_KEY=AIzaSy...
# OpenAI
OPENAI_API_KEY=sk-...
# Dharmamitra
DHARMAMITRA_TOKEN=your-token
DHARMAMITRA_PASSWORD=your-password # For proxy endpointsdefault_configs = {
"temperature": 0.3, # Lower for consistency
"max_tokens": 4000, # Default output limit
}from src.translation_api.models.model_router import get_model_router
# Get the global router instance
router = get_model_router()
# Get a model
model = router.get_model("claude-sonnet-4-20250514")
# Use the model
response = model.invoke("Translate: བྱང་ཆུབ་སེམས")model = router.get_model(
"gemini-2.5-pro",
temperature=0.1,
max_tokens=8000
)# Get all available models (based on configured API keys)
available = router.get_available_models()
# Check if specific model is available
if router.validate_model_availability("claude-sonnet-4-20250514"):
model = router.get_model("claude-sonnet-4-20250514")from pydantic import BaseModel
class Translation(BaseModel):
text: str
confidence: float
# Get model with structured output
model = router.get_model("claude-sonnet-4-20250514")
structured = model.with_structured_output(Translation)
result = structured.invoke("Translate: བྱང་ཆུབ་སེམས")
# result.text = "bodhicitta"
# result.confidence = 0.95Gemini models support "thinking" - internal reasoning before responding.
┌─────────────────────────────────────────────────────────────────┐
│ Thinking Mode │
│ │
│ Input: "Translate this complex Buddhist text..." │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Internal Thinking │ │
│ │ "Let me analyze the grammatical structure..." │ │
│ │ "The term བྱང་ཆུབ་སེམས has multiple meanings..." │ │
│ │ "Given the context, I should use..." │ │
│ │ │ │
│ │ (Up to 12,000 tokens of reasoning) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Output: "Bodhicitta, the mind of awakening..." │
└─────────────────────────────────────────────────────────────────┘
| Model | Thinking Budget |
|---|---|
gemini-2.5-flash |
0 (disabled) |
gemini-2.5-flash-thinking |
12,000 tokens |
gemini-2.5-pro |
12,000 tokens |
# Flash without thinking (fast)
fast_model = router.get_model("gemini-2.5-flash")
# Flash with thinking (slower, better quality)
thinking_model = router.get_model("gemini-2.5-flash-thinking")
# Pro with thinking (best quality)
pro_model = router.get_model("gemini-2.5-pro")The router caches model instances to avoid redundant initialization:
# First call: creates new instance
model1 = router.get_model("claude-sonnet-4-20250514")
# Second call: returns cached instance
model2 = router.get_model("claude-sonnet-4-20250514")
# model1 is model2 → True
# Different params: new instance
model3 = router.get_model("claude-sonnet-4-20250514", temperature=0.5)
# model1 is model3 → Falsecache_key = f"{model_name}_{hash(str(sorted(kwargs.items())))}"The _GeminiModelWrapper handles Gemini-specific configuration:
class _GeminiModelWrapper:
"""Injects generation_config into all Gemini calls."""
def __init__(self, base_model, generation_config):
self._base_model = base_model
self._generation_config = generation_config
async def ainvoke(self, input, **kwargs):
# Merge default config with call-specific config
merged = {**self._generation_config, **kwargs.get("generation_config", {})}
return await self._base_model.ainvoke(input, generation_config=merged)Features:
- Automatic JSON response format
- Thinking config injection
- Plain text mode support
The _DharmamitraModelWrapper integrates with the Dharmamitra API:
class _DharmamitraModelWrapper:
"""Translation-only wrapper for Dharmamitra."""
def invoke(self, input, **kwargs):
# Extract source text and target language
# Call Dharmamitra API
# Return translation
pass
def with_structured_output(self, schema):
raise ValueError("'dharamitra' supports translation only")Limitations:
- No
with_structured_output() - No batch operations
- Translation endpoints only
Models are only available if their API key is configured:
def get_available_models(self) -> Dict[str, Dict[str, Any]]:
available = {}
if self.settings.anthropic_api_key:
available.update({
"claude-sonnet-4-20250514": {...},
# ... other Claude models
})
if self.settings.gemini_api_key:
available.update({
"gemini-2.5-pro": {...},
# ... other Gemini models
})
# ...
return availabletry:
model = router.get_model("claude-sonnet-4-20250514")
except ValueError as e:
# "ANTHROPIC_API_KEY is required for Claude models"try:
model = router.get_model("invalid-model")
except ValueError as e:
# "Unsupported model: invalid-model"if not router.validate_model_availability("claude-sonnet-4-20250514"):
available = list(router.get_available_models().keys())
raise HTTPException(
status_code=400,
detail=f"Model not available. Available: {available}"
)# examples/thinking_models_demo.py
async def demo_thinking_models():
router = get_model_router()
# Test different models
for model_name in ["gemini-2.5-flash", "gemini-2.5-flash-thinking"]:
if router.validate_model_availability(model_name):
model = router.get_model(model_name)
response = await model.ainvoke("Translate: བྱང་ཆུབ་སེམས")
print(f"{model_name}: {response.content}")curl http://localhost:8001/health{
"status": "healthy",
"version": "1.0.0",
"available_models": {
"claude-sonnet-4-20250514": {"provider": "Anthropic", ...},
"gemini-2.5-pro": {"provider": "Google", ...}
}
}- Architecture - System design
- API Reference - Endpoint documentation
- Installation - Setup guide
- Usage Guide - Examples