From bc5a9826f59094179f44df24472a28ecba1e960f Mon Sep 17 00:00:00 2001 From: Mihai Criveti Date: Thu, 2 Oct 2025 22:03:25 -0400 Subject: [PATCH 1/2] Add scale docs Signed-off-by: Mihai Criveti --- docs/docs/manage/.pages | 5 +- docs/docs/manage/index.md | 2 + docs/docs/manage/scale.md | 1445 +++++++++++++++++++++++++++++++++++++ 3 files changed, 1451 insertions(+), 1 deletion(-) create mode 100644 docs/docs/manage/scale.md diff --git a/docs/docs/manage/.pages b/docs/docs/manage/.pages index f9bf97450..4255f2e8b 100644 --- a/docs/docs/manage/.pages +++ b/docs/docs/manage/.pages @@ -1,6 +1,8 @@ nav: - index.md - configuration.md + - scale.md + - tuning.md - backup.md - bulk-import.md - metadata-tracking.md @@ -20,7 +22,8 @@ nav: - sso-google-tutorial.md - sso-ibm-tutorial.md - sso-okta-tutorial.md - - tuning.md + - rbac.md + - teams.md - ui-customization.md - upgrade.md - well-known-uris.md diff --git a/docs/docs/manage/index.md b/docs/docs/manage/index.md index 3fc890cae..1a02c889d 100644 --- a/docs/docs/manage/index.md +++ b/docs/docs/manage/index.md @@ -20,6 +20,8 @@ Whether you're self-hosting, running in the cloud, or deploying to Kubernetes, t | Page | Description | |------|-------------| | [Configuration](configuration.md) | **Complete configuration reference** - databases, environment variables, and deployment settings | +| [Scaling Guide](scale.md) | 📈 **Production Scaling** - Horizontal/vertical scaling, Kubernetes HPA, connection pooling, and performance tuning | +| [Performance Tuning](tuning.md) | Optimize Gunicorn workers, database connections, and container resources | | [Dynamic Client Registration](dcr.md) | 🔐 **OAuth2 DCR** - Automatic client provisioning for streamable HTTP servers | | [Backups](backup.md) | How to persist and restore your database, configs, and resource state | | [Export & Import](export-import.md) | Complete configuration management with CLI, API, and Admin UI | diff --git a/docs/docs/manage/scale.md b/docs/docs/manage/scale.md new file mode 100644 index 000000000..da09ce92c --- /dev/null +++ b/docs/docs/manage/scale.md @@ -0,0 +1,1445 @@ +# Scaling MCP Gateway + +> Comprehensive guide to scaling MCP Gateway from development to production, covering vertical scaling, horizontal scaling, connection pooling, performance tuning, and Kubernetes deployment strategies. + +## Overview + +MCP Gateway is designed to scale from single-container development environments to distributed multi-node production deployments. This guide covers: + +- **Vertical Scaling**: Optimizing single-instance performance with Gunicorn workers +- **Horizontal Scaling**: Multi-container deployments with shared state +- **Database Optimization**: PostgreSQL connection pooling and settings +- **Cache Architecture**: Redis for distributed caching +- **Performance Tuning**: Configuration and benchmarking +- **Kubernetes Deployment**: HPA, resource limits, and best practices + +--- + +## Table of Contents + +1. [Understanding the GIL and Worker Architecture](#1-understanding-the-gil-and-worker-architecture) +2. [Vertical Scaling with Gunicorn](#2-vertical-scaling-with-gunicorn) +3. [Future: Python 3.14 and PostgreSQL 18](#3-future-python-314-and-postgresql-18) +4. [Horizontal Scaling with Kubernetes](#4-horizontal-scaling-with-kubernetes) +5. [Database Connection Pooling](#5-database-connection-pooling) +6. [Redis for Distributed Caching](#6-redis-for-distributed-caching) +7. [Performance Tuning](#7-performance-tuning) +8. [Benchmarking and Load Testing](#8-benchmarking-and-load-testing) +9. [Health Checks and Readiness](#9-health-checks-and-readiness) +10. [Stateless Architecture and Long-Running Connections](#10-stateless-architecture-and-long-running-connections) +11. [Kubernetes Production Deployment](#11-kubernetes-production-deployment) +12. [Monitoring and Observability](#12-monitoring-and-observability) + +--- + +## 1. Understanding the GIL and Worker Architecture + +### The Python Global Interpreter Lock (GIL) + +Python's Global Interpreter Lock (GIL) prevents multiple native threads from executing Python bytecode simultaneously. This means: + +- **Single worker** = Single CPU core usage (even on multi-core systems) +- **I/O-bound workloads** (API calls, database queries) benefit from async/await +- **CPU-bound workloads** (JSON parsing, encryption) require multiple processes + +### Pydantic v2: Rust-Powered Performance + +MCP Gateway leverages **Pydantic v2.11+** for all request/response validation and schema definitions. Unlike pure Python libraries, Pydantic v2 includes a **Rust-based core** (`pydantic-core`) that significantly improves performance: + +**Performance benefits:** +- **5-50x faster validation** compared to Pydantic v1 +- **JSON parsing** in Rust (bypasses GIL for serialization/deserialization) +- **Schema validation** runs in compiled Rust code +- **Reduced CPU overhead** for request processing + +**Impact on scaling:** +- 5,463 lines of Pydantic schemas (`mcpgateway/schemas.py`) +- Every API request validated through Rust-optimized code +- Lower CPU usage per request = higher throughput per worker +- Rust components release the GIL during execution + +This means that even within a single worker process, Pydantic's Rust core can run concurrently with Python code for validation-heavy workloads. + +### MCP Gateway's Solution: Gunicorn with Multiple Workers + +MCP Gateway uses **Gunicorn with UvicornWorker** to spawn multiple worker processes: + +```python +# gunicorn.config.py +workers = 8 # Multiple processes bypass the GIL +worker_class = "uvicorn.workers.UvicornWorker" # Async support +timeout = 600 # 10-minute timeout for long-running operations +preload_app = True # Load app once, then fork (memory efficient) +``` + +**Key benefits:** + +- Each worker is a separate process with its own GIL +- 8 workers = ability to use 8 CPU cores +- UvicornWorker enables async I/O within each worker +- Preloading reduces memory footprint (shared code segments) + +The trade-off is that you are running multiple Python interpreter instances, and each consumes additional memory. + +This also requires having shared state (e.g. Redis or a Database). +--- + +## 2. Vertical Scaling with Gunicorn + +### Worker Count Calculation + +**Formula**: `workers = (2 × CPU_cores) + 1` + +**Examples:** + +| CPU Cores | Recommended Workers | Use Case | +|-----------|---------------------|----------| +| 1 | 2-3 | Development/testing | +| 2 | 4-5 | Small production | +| 4 | 8-9 | Medium production | +| 8 | 16-17 | Large production | + +### Configuration Methods + +#### Environment Variables + +```bash +# Automatic detection based on CPU cores +export GUNICORN_WORKERS=auto + +# Manual override +export GUNICORN_WORKERS=16 +export GUNICORN_TIMEOUT=600 +export GUNICORN_MAX_REQUESTS=100000 +export GUNICORN_MAX_REQUESTS_JITTER=100 +export GUNICORN_PRELOAD_APP=true +``` + +#### Kubernetes ConfigMap + +```yaml +# charts/mcp-stack/values.yaml +mcpContextForge: + config: + GUNICORN_WORKERS: "16" # Number of worker processes + GUNICORN_TIMEOUT: "600" # Worker timeout (seconds) + GUNICORN_MAX_REQUESTS: "100000" # Requests before worker restart + GUNICORN_MAX_REQUESTS_JITTER: "100" # Prevents thundering herd + GUNICORN_PRELOAD_APP: "true" # Memory optimization +``` + +### Resource Allocation + +**CPU**: Allocate 1 CPU core per 2 workers (allows for I/O wait) + +**Memory**: +- Base: 256MB +- Per worker: 128-256MB (depending on workload) +- Formula: `memory = 256 + (workers × 200)` MB + +**Example for 16 workers:** +- CPU: `8-10 cores` (allows headroom) +- Memory: `3.5-4 GB` (256 + 16×200 = 3.5GB) + +```yaml +# Kubernetes resource limits +resources: + limits: + cpu: 10000m # 10 cores + memory: 4Gi + requests: + cpu: 8000m # 8 cores + memory: 3584Mi # 3.5GB +``` + +--- + +## 3. Future: Python 3.14 and PostgreSQL 18 + +### Python 3.14 (Free-Threaded Mode) + +**Status**: Beta (as of July 2025) - [PEP 703](https://peps.python.org/pep-0703/) + +Python 3.14 introduces **optional free-threading** (GIL removal), a groundbreaking change that enables true parallel multi-threading: + +```bash +# Enable free-threading mode +python3.14 -X gil=0 -m gunicorn ... + +# Or use PYTHON_GIL environment variable +PYTHON_GIL=0 python3.14 -m gunicorn ... +``` + +**Performance characteristics:** + +| Workload Type | Expected Impact | +|---------------|----------------| +| Single-threaded | **3-15% slower** (overhead from thread-safety mechanisms) | +| Multi-threaded (I/O-bound) | **Minimal impact** (already benefits from async/await) | +| Multi-threaded (CPU-bound) | **Near-linear scaling** with CPU cores | +| Multi-process (current) | **No change** (already bypasses GIL) | + +**Benefits when available:** +- **True parallel threads**: Multiple threads execute Python code simultaneously +- **Lower memory overhead**: Threads share memory (vs. separate processes) +- **Faster inter-thread communication**: Shared memory, no IPC overhead +- **Better resource efficiency**: One interpreter instance instead of multiple processes + +**Trade-offs:** +- **Single-threaded penalty**: 3-15% slower due to fine-grained locking +- **Library compatibility**: Some C extensions need updates (most popular libraries already compatible) +- **Different scaling model**: Move from `workers=16` to `workers=2 --threads=32` + +**Migration strategy:** + +1. **Now (Python 3.11-3.13)**: Continue using multi-process Gunicorn + ```python + workers = 16 # Multiple processes + worker_class = "uvicorn.workers.UvicornWorker" + ``` + +2. **Python 3.14 beta**: Test in staging environment + ```bash + # Build free-threaded Python + ./configure --enable-experimental-jit --with-pydebug + make + + # Test with free-threading + PYTHON_GIL=0 python3.14 -m pytest tests/ + ``` + +3. **Python 3.14 stable**: Evaluate hybrid approach + ```python + workers = 4 # Fewer processes + threads = 8 # More threads per process + worker_class = "uvicorn.workers.UvicornWorker" + ``` + +4. **Post-migration**: Thread-based scaling + ```python + workers = 2 # Minimal processes + threads = 32 # Scale with threads + preload_app = True # Single app load + ``` + +**Current recommendation**: +- **Production**: Use Python 3.11-3.13 with multi-process Gunicorn (proven, stable) +- **Testing**: Experiment with Python 3.14 beta in non-production environments +- **Monitoring**: Watch for library compatibility announcements + +**Why MCP Gateway is well-positioned for free-threading:** + +MCP Gateway's architecture already benefits from components that will perform even better with Python 3.14: + +1. **Pydantic v2 Rust core**: Already bypasses GIL for validation - will work seamlessly with free-threading +2. **FastAPI/Uvicorn**: Built for async I/O - natural fit for thread-based concurrency +3. **SQLAlchemy async**: Database operations already non-blocking +4. **Stateless design**: No shared mutable state between requests + +**Resources:** +- [Python 3.14 Free-Threading Guide](https://www.pythoncheatsheet.org/blog/python-3-14-breaking-free-from-gil) +- [PEP 703: Making the GIL Optional](https://peps.python.org/pep-0703/) +- [Python 3.14 Release Schedule](https://peps.python.org/pep-0745/) +- [Pydantic v2 Performance](https://docs.pydantic.dev/latest/blog/pydantic-v2/) + +### PostgreSQL 18 (Async I/O) + +**Status**: Development (expected 2025) + +PostgreSQL 18 introduces native async I/O: + +- **Improved connection handling**: Better async query performance +- **Reduced latency**: Non-blocking I/O operations +- **Better scalability**: Efficient connection multiplexing + +**Current recommendation**: PostgreSQL 16+ (stable async support via asyncpg) + +```bash +# Production-ready now +DATABASE_URL=postgresql+asyncpg://user:pass@postgres:5432/mcp +``` + +--- + +## 4. Horizontal Scaling with Kubernetes + +### Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Load Balancer │ +│ (Kubernetes Service) │ +└────────────┬────────────────────────────────┬───────────────┘ + │ │ + ┌────────▼─────────┐ ┌────────▼─────────┐ + │ Gateway Pod 1 │ │ Gateway Pod 2 │ + │ (8 workers) │ │ (8 workers) │ + └────────┬─────────┘ └────────┬─────────┘ + │ │ + └────────────┬───────────────────┘ + │ + ┌───────────────▼───────────────────────┐ + │ │ + ┌─────▼──────┐ ┌──────────▼─────┐ + │ PostgreSQL │ │ Redis │ + │ (shared) │ │ (shared) │ + └────────────┘ └────────────────┘ +``` + +### Shared State Requirements + +For multi-pod deployments: + +1. **Shared PostgreSQL**: All data (servers, tools, users, teams) +2. **Shared Redis**: Distributed caching and session management +3. **Stateless pods**: No local state, can be killed/restarted anytime + +### Kubernetes Deployment + +#### Helm Chart Configuration + +```yaml +# charts/mcp-stack/values.yaml +mcpContextForge: + replicaCount: 3 # Start with 3 pods + + # Horizontal Pod Autoscaler + hpa: + enabled: true + minReplicas: 3 # Never scale below 3 + maxReplicas: 20 # Scale up to 20 pods + targetCPUUtilizationPercentage: 70 # Scale at 70% CPU + targetMemoryUtilizationPercentage: 80 # Scale at 80% memory + + # Pod resources + resources: + limits: + cpu: 2000m # 2 cores per pod + memory: 4Gi + requests: + cpu: 1000m # 1 core per pod + memory: 2Gi + + # Environment configuration + config: + GUNICORN_WORKERS: "8" # 8 workers per pod + CACHE_TYPE: redis # Shared cache + DB_POOL_SIZE: "50" # Per-pod pool size + +# Shared PostgreSQL +postgres: + enabled: true + resources: + limits: + cpu: 4000m # 4 cores + memory: 8Gi + requests: + cpu: 2000m + memory: 4Gi + + # Important: Set max_connections + # Formula: (num_pods × DB_POOL_SIZE × 1.2) + 20 + # Example: (20 pods × 50 pool × 1.2) + 20 = 1220 + config: + max_connections: 1500 # Adjust based on scale + +# Shared Redis +redis: + enabled: true + resources: + limits: + cpu: 2000m + memory: 4Gi + requests: + cpu: 1000m + memory: 2Gi +``` + +#### Deploy with Helm + +```bash +# Install/upgrade with custom values +helm upgrade --install mcp-stack ./charts/mcp-stack \ + --namespace mcp-gateway \ + --create-namespace \ + --values production-values.yaml + +# Verify HPA +kubectl get hpa -n mcp-gateway +``` + +### Horizontal Scaling Calculation + +**Total capacity** = `pods × workers × requests_per_second` + +**Example:** +- 10 pods × 8 workers × 100 RPS = **8,000 RPS** + +**Database connections needed:** +- 10 pods × 50 pool size = **500 connections** +- Add 20% overhead = **600 connections** +- Set `max_connections=1000` (buffer for maintenance) + +--- + +## 5. Database Connection Pooling + +### Connection Pool Architecture + +SQLAlchemy manages a connection pool per process: + +``` +Pod 1 (8 workers) → 8 connection pools → PostgreSQL +Pod 2 (8 workers) → 8 connection pools → PostgreSQL +Pod N (8 workers) → 8 connection pools → PostgreSQL +``` + +### Pool Configuration + +#### Environment Variables + +```bash +# Connection pool settings +DB_POOL_SIZE=50 # Persistent connections per worker +DB_MAX_OVERFLOW=10 # Additional connections allowed +DB_POOL_TIMEOUT=60 # Wait time before timeout (seconds) +DB_POOL_RECYCLE=3600 # Recycle connections after 1 hour +DB_MAX_RETRIES=5 # Retry attempts on failure +DB_RETRY_INTERVAL_MS=2000 # Retry interval +``` + +#### Configuration in Code + +```python +# mcpgateway/config.py +@property +def database_settings(self) -> dict: + return { + "pool_size": self.db_pool_size, # 50 + "max_overflow": self.db_max_overflow, # 10 + "pool_timeout": self.db_pool_timeout, # 60s + "pool_recycle": self.db_pool_recycle, # 3600s + } +``` + +### PostgreSQL Configuration + +#### Calculate max_connections + +```bash +# Formula +max_connections = (num_pods × num_workers × pool_size × 1.2) + buffer + +# Example: 10 pods, 8 workers, 50 pool size +max_connections = (10 × 8 × 50 × 1.2) + 200 = 5000 connections +``` + +#### PostgreSQL Configuration File + +```ini +# postgresql.conf +max_connections = 5000 +shared_buffers = 16GB # 25% of RAM +effective_cache_size = 48GB # 75% of RAM +work_mem = 16MB # Per operation +maintenance_work_mem = 2GB +``` + +#### Managed Services + +**IBM Cloud Databases for PostgreSQL:** +```bash +# Increase max_connections via CLI +ibmcloud cdb deployment-configuration postgres \ + --configuration max_connections=5000 +``` + +**AWS RDS:** +```bash +# Via parameter group +max_connections = {DBInstanceClassMemory/9531392} +``` + +**Google Cloud SQL:** +```bash +# Auto-scales based on instance size +# 4 vCPU = 400 connections +# 8 vCPU = 800 connections +``` + +### Connection Pool Monitoring + +```python +# Health endpoint checks pool status +@app.get("/health") +async def healthcheck(db: Session = Depends(get_db)): + try: + db.execute(text("SELECT 1")) + return {"status": "healthy"} + except Exception as e: + return {"status": "unhealthy", "error": str(e)} +``` + +```bash +# Check PostgreSQL connections +kubectl exec -it postgres-pod -- psql -U admin -d postgresdb \ + -c "SELECT count(*) FROM pg_stat_activity;" +``` + +--- + +## 6. Redis for Distributed Caching + +### Architecture + +Redis provides shared state across all Gateway pods: + +- **Session storage**: User sessions (TTL: 3600s) +- **Message cache**: Ephemeral data (TTL: 600s) +- **Federation cache**: Gateway peer discovery + +### Configuration + +#### Enable Redis Caching + +```bash +# .env or Kubernetes ConfigMap +CACHE_TYPE=redis +REDIS_URL=redis://redis-service:6379/0 +CACHE_PREFIX=mcpgw: +SESSION_TTL=3600 +MESSAGE_TTL=600 +REDIS_MAX_RETRIES=3 +REDIS_RETRY_INTERVAL_MS=2000 +``` + +#### Kubernetes Deployment + +```yaml +# charts/mcp-stack/values.yaml +redis: + enabled: true + + resources: + limits: + cpu: 2000m + memory: 4Gi + requests: + cpu: 1000m + memory: 2Gi + + # Enable persistence + persistence: + enabled: true + size: 10Gi +``` + +### Redis Sizing + +**Memory calculation:** +- Sessions: `concurrent_users × 50KB` +- Messages: `messages_per_minute × 100KB × (TTL/60)` + +**Example:** +- 10,000 users × 50KB = 500MB +- 1,000 msg/min × 100KB × 10min = 1GB +- **Total: 1.5GB + 50% overhead = 2.5GB** + +### High Availability + +**Redis Sentinel** (3+ nodes): +```yaml +redis: + sentinel: + enabled: true + quorum: 2 + + replicas: 3 # 1 primary + 2 replicas +``` + +**Redis Cluster** (6+ nodes): +```bash +REDIS_URL=redis://redis-cluster:6379/0?cluster=true +``` + +--- + +## 7. Performance Tuning + +### Application Architecture Performance + +MCP Gateway's technology stack is optimized for high performance: + +**Rust-Powered Components:** +- **Pydantic v2** (5-50x faster validation via Rust core) +- **Uvicorn** (ASGI server with Rust-based httptools) + +**Async-First Design:** +- **FastAPI** (async request handling) +- **SQLAlchemy 2.0** (async database operations) +- **asyncio** event loop per worker + +**Performance characteristics:** +- Request validation: **< 1ms** (Pydantic v2 Rust core) +- JSON serialization: **3-5x faster** than pure Python +- Database queries: Non-blocking async I/O +- Concurrent requests per worker: **1000+** (async event loop) + +### System-Level Optimization + +#### Kernel Parameters + +```bash +# /etc/sysctl.conf +net.core.somaxconn=4096 +net.ipv4.tcp_max_syn_backlog=4096 +net.ipv4.ip_local_port_range=1024 65535 +net.ipv4.tcp_tw_reuse=1 +fs.file-max=2097152 + +# Apply changes +sysctl -p +``` + +#### File Descriptors + +```bash +# /etc/security/limits.conf +* soft nofile 1048576 +* hard nofile 1048576 + +# Verify +ulimit -n +``` + +### Gunicorn Tuning + +#### Optimal Settings + +```python +# gunicorn.config.py +workers = (CPU_cores × 2) + 1 +timeout = 600 # Long enough for LLM calls +max_requests = 100000 # Prevent memory leaks +max_requests_jitter = 100 # Randomize restart +preload_app = True # Reduce memory +reuse_port = True # Load balance across workers +``` + +#### Worker Class Selection + +**UvicornWorker** (default - best for async): +```python +worker_class = "uvicorn.workers.UvicornWorker" +``` + +**Gevent** (alternative for I/O-heavy): +```bash +pip install gunicorn[gevent] +worker_class = "gevent" +worker_connections = 1000 +``` + +### Application Tuning + +```bash +# Resource limits +TOOL_TIMEOUT=60 +TOOL_CONCURRENT_LIMIT=10 +RESOURCE_CACHE_SIZE=1000 +RESOURCE_CACHE_TTL=3600 + +# Retry configuration +RETRY_MAX_ATTEMPTS=3 +RETRY_BASE_DELAY=1.0 +RETRY_MAX_DELAY=60 + +# Health check intervals +HEALTH_CHECK_INTERVAL=60 +HEALTH_CHECK_TIMEOUT=10 +UNHEALTHY_THRESHOLD=3 +``` + +--- + +## 8. Benchmarking and Load Testing + +### Tools + +**hey** - HTTP load generator +```bash +# Install +brew install hey # macOS +sudo apt install hey # Ubuntu + +# Or from source +go install github.com/rakyll/hey@latest +``` + +**k6** - Modern load testing +```bash +brew install k6 # macOS +``` + +### Baseline Test + +#### Prepare Environment + +```bash +# Get JWT token +export MCPGATEWAY_BEARER_TOKEN=$(python3 -m mcpgateway.utils.create_jwt_token \ + --username admin@example.com --exp 0 --secret my-test-key) + +# Create test payload +cat > payload.json <1000 RPS per pod) +- **P99 latency**: 99th percentile (target: <500ms) +- **Error rate**: 5xx responses (target: <0.1%) + +### Kubernetes Load Test + +```bash +# Deploy test pod +kubectl run load-test --image=williamyeh/hey:latest \ + --rm -it --restart=Never -- \ + -n 100000 -c 500 \ + -H "Authorization: Bearer $TOKEN" \ + http://mcp-gateway-service/ +``` + +### Advanced: k6 Script + +```javascript +// load-test.k6.js +import http from 'k6/http'; +import { check } from 'k6'; + +export let options = { + stages: [ + { duration: '2m', target: 100 }, // Ramp up + { duration: '5m', target: 100 }, // Sustained + { duration: '2m', target: 500 }, // Spike + { duration: '5m', target: 500 }, // High load + { duration: '2m', target: 0 }, // Ramp down + ], + thresholds: { + http_req_duration: ['p(99)<500'], // 99% < 500ms + http_req_failed: ['rate<0.01'], // <1% errors + }, +}; + +export default function () { + const payload = JSON.stringify({ + jsonrpc: '2.0', + id: 1, + method: 'tools/list', + params: {}, + }); + + const res = http.post('http://localhost:4444/', payload, { + headers: { + 'Content-Type': 'application/json', + 'Authorization': `Bearer ${__ENV.TOKEN}`, + }, + }); + + check(res, { + 'status is 200': (r) => r.status === 200, + 'response time < 500ms': (r) => r.timings.duration < 500, + }); +} +``` + +```bash +# Run k6 test +TOKEN=$MCPGATEWAY_BEARER_TOKEN k6 run load-test.k6.js +``` + +--- + +## 9. Health Checks and Readiness + +### Health Check Endpoints + +MCP Gateway provides two health endpoints: + +#### Liveness Probe: `/health` + +**Purpose**: Is the application alive? + +```python +@app.get("/health") +async def healthcheck(db: Session = Depends(get_db)): + """Check database connectivity""" + try: + db.execute(text("SELECT 1")) + return {"status": "healthy"} + except Exception as e: + return {"status": "unhealthy", "error": str(e)} +``` + +**Response:** +```json +{ + "status": "healthy" +} +``` + +#### Readiness Probe: `/ready` + +**Purpose**: Is the application ready to receive traffic? + +```python +@app.get("/ready") +async def readiness_check(db: Session = Depends(get_db)): + """Check if ready to serve traffic""" + try: + await asyncio.to_thread(db.execute, text("SELECT 1")) + return JSONResponse({"status": "ready"}, status_code=200) + except Exception as e: + return JSONResponse( + {"status": "not ready", "error": str(e)}, + status_code=503 + ) +``` + +### Kubernetes Probe Configuration + +```yaml +# charts/mcp-stack/templates/deployment-mcpgateway.yaml +containers: + - name: mcp-context-forge + + # Startup probe (initial readiness) + startupProbe: + exec: + command: + - python3 + - /app/mcpgateway/utils/db_isready.py + - --max-tries=1 + - --timeout=2 + initialDelaySeconds: 10 + periodSeconds: 5 + failureThreshold: 60 # 5 minutes max + + # Readiness probe (traffic routing) + readinessProbe: + httpGet: + path: /ready + port: 4444 + initialDelaySeconds: 15 + periodSeconds: 10 + timeoutSeconds: 2 + successThreshold: 1 + failureThreshold: 3 + + # Liveness probe (restart if unhealthy) + livenessProbe: + httpGet: + path: /health + port: 4444 + initialDelaySeconds: 10 + periodSeconds: 15 + timeoutSeconds: 2 + successThreshold: 1 + failureThreshold: 3 +``` + +### Probe Tuning Guidelines + +**Startup Probe:** +- Use for slow initialization (database migrations, model loading) +- `failureThreshold × periodSeconds` = max startup time +- Example: 60 × 5s = 5 minutes + +**Readiness Probe:** +- Aggressive: Remove pod from load balancer quickly +- `failureThreshold` = 3 (fail fast) +- `periodSeconds` = 10 (frequent checks) + +**Liveness Probe:** +- Conservative: Avoid unnecessary restarts +- `failureThreshold` = 5 (tolerate transient issues) +- `periodSeconds` = 15 (less frequent) + +### Monitoring Health + +```bash +# Check pod health +kubectl get pods -n mcp-gateway + +# Detailed status +kubectl describe pod -n mcp-gateway + +# Check readiness +kubectl get pods -n mcp-gateway \ + -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' + +# Test health endpoint +kubectl exec -it -n mcp-gateway -- \ + curl http://localhost:4444/health + +# View probe failures +kubectl get events -n mcp-gateway \ + --field-selector involvedObject.name= +``` + +--- + +## 10. Stateless Architecture and Long-Running Connections + +### Stateless Design Principles + +MCP Gateway is designed to be **stateless**, enabling horizontal scaling: + +1. **No local session storage**: All sessions in Redis +2. **No in-memory caching** (in production): Use Redis +3. **Database-backed state**: All data in PostgreSQL +4. **Shared configuration**: Environment variables via ConfigMap + +### Session Management + +#### Stateful Sessions (Not Recommended for Scale) + +```bash +USE_STATEFUL_SESSIONS=true # Event store in database +``` + +**Limitations:** +- Sessions tied to specific pods +- Requires sticky sessions (session affinity) +- Doesn't scale horizontally + +#### Stateless Sessions (Recommended) + +```bash +USE_STATEFUL_SESSIONS=false +JSON_RESPONSE_ENABLED=true +CACHE_TYPE=redis +``` + +**Benefits:** +- Any pod can handle any request +- True horizontal scaling +- Automatic failover + +### Long-Running Connections + +MCP Gateway supports long-running connections for streaming: + +#### Server-Sent Events (SSE) + +```python +# Endpoint: /servers/{id}/sse +@app.get("/servers/{server_id}/sse") +async def sse_endpoint(server_id: int): + """Stream events to client""" + # Connection can last minutes/hours +``` + +#### WebSocket + +```python +# Endpoint: /servers/{id}/ws +@app.websocket("/servers/{server_id}/ws") +async def websocket_endpoint(server_id: int): + """Bidirectional streaming""" +``` + +### Load Balancer Configuration + +**Kubernetes Service** (default): +```yaml +# Distributes connections across pods +apiVersion: v1 +kind: Service +metadata: + name: mcp-gateway-service +spec: + type: ClusterIP + sessionAffinity: None # No sticky sessions + ports: + - port: 80 + targetPort: 4444 +``` + +**NGINX Ingress** (for WebSocket): +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + annotations: + nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" + nginx.ingress.kubernetes.io/proxy-send-timeout: "3600" + nginx.ingress.kubernetes.io/websocket-services: "mcp-gateway-service" +spec: + rules: + - host: gateway.example.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mcp-gateway-service + port: + number: 80 +``` + +### Connection Lifecycle + +``` +Client → Load Balancer → Pod A (SSE stream) + ↓ + (Pod A dies) + ↓ +Client ← Load Balancer → Pod B (reconnect) +``` + +**Best practices:** +1. Client implements reconnection logic +2. Server sets `SSE_KEEPALIVE_INTERVAL=30` (keepalive events) +3. Load balancer timeout > keepalive interval + +--- + +## 11. Kubernetes Production Deployment + +### Reference Architecture + +```yaml +# production-values.yaml +mcpContextForge: + # --- Scaling --- + replicaCount: 5 + + hpa: + enabled: true + minReplicas: 5 + maxReplicas: 50 + targetCPUUtilizationPercentage: 70 + targetMemoryUtilizationPercentage: 80 + + # --- Resources --- + resources: + limits: + cpu: 4000m # 4 cores per pod + memory: 8Gi + requests: + cpu: 2000m # 2 cores per pod + memory: 4Gi + + # --- Configuration --- + config: + # Gunicorn + GUNICORN_WORKERS: "16" + GUNICORN_TIMEOUT: "600" + GUNICORN_MAX_REQUESTS: "100000" + GUNICORN_PRELOAD_APP: "true" + + # Database + DB_POOL_SIZE: "50" + DB_MAX_OVERFLOW: "10" + DB_POOL_TIMEOUT: "60" + DB_POOL_RECYCLE: "3600" + + # Cache + CACHE_TYPE: redis + CACHE_PREFIX: mcpgw: + SESSION_TTL: "3600" + MESSAGE_TTL: "600" + + # Performance + TOOL_CONCURRENT_LIMIT: "20" + RESOURCE_CACHE_SIZE: "2000" + + # --- Health Checks --- + probes: + startup: + type: exec + command: ["python3", "/app/mcpgateway/utils/db_isready.py"] + periodSeconds: 5 + failureThreshold: 60 + + readiness: + type: http + path: /ready + port: 4444 + periodSeconds: 10 + failureThreshold: 3 + + liveness: + type: http + path: /health + port: 4444 + periodSeconds: 15 + failureThreshold: 5 + +# --- PostgreSQL --- +postgres: + enabled: true + + resources: + limits: + cpu: 8000m # 8 cores + memory: 32Gi + requests: + cpu: 4000m + memory: 16Gi + + persistence: + enabled: true + size: 100Gi + storageClassName: fast-ssd + + # Connection limits + # max_connections = (50 pods × 16 workers × 50 pool × 1.2) + 200 + config: + max_connections: 50000 + shared_buffers: 8GB + effective_cache_size: 24GB + work_mem: 32MB + +# --- Redis --- +redis: + enabled: true + + resources: + limits: + cpu: 4000m + memory: 16Gi + requests: + cpu: 2000m + memory: 8Gi + + persistence: + enabled: true + size: 50Gi +``` + +### Deployment Steps + +```bash +# 1. Create namespace +kubectl create namespace mcp-gateway + +# 2. Create secrets +kubectl create secret generic mcp-secrets \ + -n mcp-gateway \ + --from-literal=JWT_SECRET_KEY=$(openssl rand -hex 32) \ + --from-literal=AUTH_ENCRYPTION_SECRET=$(openssl rand -hex 32) \ + --from-literal=POSTGRES_PASSWORD=$(openssl rand -base64 32) + +# 3. Install with Helm +helm upgrade --install mcp-stack ./charts/mcp-stack \ + -n mcp-gateway \ + -f production-values.yaml \ + --wait \ + --timeout 10m + +# 4. Verify deployment +kubectl get pods -n mcp-gateway +kubectl get hpa -n mcp-gateway +kubectl get svc -n mcp-gateway + +# 5. Run migration job +kubectl get jobs -n mcp-gateway + +# 6. Test scaling +kubectl top pods -n mcp-gateway +``` + +### Pod Disruption Budget + +```yaml +# pdb.yaml +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: mcp-gateway-pdb + namespace: mcp-gateway +spec: + minAvailable: 3 # Keep 3 pods always running + selector: + matchLabels: + app: mcp-gateway +``` + +### Network Policies + +```yaml +# network-policy.yaml +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: mcp-gateway-policy + namespace: mcp-gateway +spec: + podSelector: + matchLabels: + app: mcp-gateway + policyTypes: + - Ingress + - Egress + ingress: + - from: + - podSelector: + matchLabels: + app: ingress-nginx + ports: + - protocol: TCP + port: 4444 + egress: + - to: + - podSelector: + matchLabels: + app: postgres + ports: + - protocol: TCP + port: 5432 + - to: + - podSelector: + matchLabels: + app: redis + ports: + - protocol: TCP + port: 6379 +``` + +--- + +## 12. Monitoring and Observability + +### OpenTelemetry Integration + +MCP Gateway includes built-in OpenTelemetry support: + +```bash +# Enable observability +OTEL_ENABLE_OBSERVABILITY=true +OTEL_TRACES_EXPORTER=otlp +OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 +OTEL_SERVICE_NAME=mcp-gateway +``` + +### Prometheus Metrics + +Deploy Prometheus stack: + +```bash +# Add Prometheus Helm repo +helm repo add prometheus-community \ + https://prometheus-community.github.io/helm-charts + +# Install kube-prometheus-stack +helm install prometheus prometheus-community/kube-prometheus-stack \ + -n monitoring \ + --create-namespace +``` + +### Key Metrics to Monitor + +**Application Metrics:** +- Request rate: `rate(http_requests_total[1m])` +- Latency: `histogram_quantile(0.99, http_request_duration_seconds)` +- Error rate: `rate(http_requests_total{status=~"5.."}[1m])` + +**System Metrics:** +- CPU usage: `container_cpu_usage_seconds_total` +- Memory usage: `container_memory_working_set_bytes` +- Network I/O: `container_network_receive_bytes_total` + +**Database Metrics:** +- Connection pool usage: `db_pool_size` / `db_pool_connections_active` +- Query latency: `db_query_duration_seconds` +- Deadlocks: `pg_stat_database_deadlocks` + +**HPA Metrics:** +```bash +kubectl get hpa -n mcp-gateway -w +``` + +### Grafana Dashboards + +Import dashboards: +1. **Kubernetes Cluster Monitoring** (ID: 7249) +2. **PostgreSQL** (ID: 9628) +3. **Redis** (ID: 11835) +4. **NGINX Ingress** (ID: 9614) + +### Alerting Rules + +```yaml +# prometheus-rules.yaml +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: mcp-gateway-alerts + namespace: monitoring +spec: + groups: + - name: mcp-gateway + interval: 30s + rules: + - alert: HighErrorRate + expr: | + rate(http_requests_total{status=~"5..", namespace="mcp-gateway"}[5m]) > 0.05 + for: 5m + annotations: + summary: "High error rate detected" + + - alert: HighLatency + expr: | + histogram_quantile(0.99, + rate(http_request_duration_seconds_bucket[5m])) > 1 + for: 5m + annotations: + summary: "P99 latency exceeds 1s" + + - alert: DatabaseConnectionPoolExhausted + expr: | + db_pool_connections_active / db_pool_size > 0.9 + for: 2m + annotations: + summary: "Database connection pool >90% utilized" +``` + +--- + +## Summary and Checklist + +### Performance Technology Stack + +MCP Gateway is built on a high-performance foundation: + +✅ **Pydantic v2.11+** - Rust-powered validation (5-50x faster than v1) +✅ **FastAPI** - Modern async framework with OpenAPI support +✅ **Uvicorn** - ASGI server with Rust-based HTTP parsing +✅ **SQLAlchemy 2.0** - Async database operations +✅ **Python 3.11+** - Current stable with excellent performance +🔮 **Python 3.14** - Future free-threading support (beta) + +### Scaling Checklist + +- [ ] **Vertical Scaling** + - [ ] Configure Gunicorn workers: `(2 × CPU) + 1` + - [ ] Allocate CPU: 1 core per 2 workers + - [ ] Allocate memory: 256MB + (workers × 200MB) + +- [ ] **Horizontal Scaling** + - [ ] Deploy to Kubernetes with HPA enabled + - [ ] Set `minReplicas` ≥ 3 for high availability + - [ ] Configure shared PostgreSQL and Redis + +- [ ] **Database Optimization** + - [ ] Calculate `max_connections`: `(pods × workers × pool) × 1.2` + - [ ] Set `DB_POOL_SIZE` per worker (recommended: 50) + - [ ] Configure `DB_POOL_RECYCLE=3600` to prevent stale connections + +- [ ] **Caching** + - [ ] Enable Redis: `CACHE_TYPE=redis` + - [ ] Set `REDIS_URL` to shared Redis instance + - [ ] Configure TTLs: `SESSION_TTL=3600`, `MESSAGE_TTL=600` + +- [ ] **Performance** + - [ ] Tune Gunicorn: `GUNICORN_PRELOAD_APP=true` + - [ ] Set timeouts: `GUNICORN_TIMEOUT=600` + - [ ] Configure retries: `RETRY_MAX_ATTEMPTS=3` + +- [ ] **Health Checks** + - [ ] Configure `/health` liveness probe + - [ ] Configure `/ready` readiness probe + - [ ] Set appropriate thresholds and timeouts + +- [ ] **Monitoring** + - [ ] Enable OpenTelemetry: `OTEL_ENABLE_OBSERVABILITY=true` + - [ ] Deploy Prometheus and Grafana + - [ ] Configure alerts for errors, latency, and resources + +- [ ] **Load Testing** + - [ ] Benchmark with `hey` or `k6` + - [ ] Target: >1000 RPS per pod, P99 <500ms + - [ ] Test failover scenarios + +### Reference Documentation + +- [Gunicorn Configuration](../deployment/local.md) +- [Kubernetes Deployment](../deployment/kubernetes.md) +- [Helm Charts](../deployment/helm.md) +- [Performance Testing](../testing/performance.md) +- [Observability](observability.md) +- [Configuration Guide](configuration.md) +- [Database Tuning](tuning.md) + +--- + +## Additional Resources + +### External Links + +- [Gunicorn Documentation](https://docs.gunicorn.org/) +- [Kubernetes HPA](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) +- [PostgreSQL Connection Pooling](https://www.postgresql.org/docs/current/runtime-config-connection.html) +- [Redis Cluster](https://redis.io/docs/reference/cluster-spec/) +- [OpenTelemetry Python](https://opentelemetry.io/docs/instrumentation/python/) + +### Community + +- [GitHub Discussions](https://github.com/ibm/mcp-context-forge/discussions) +- [Issue Tracker](https://github.com/ibm/mcp-context-forge/issues) + +--- + +*Last updated: 2025-10-02* From 3f3c9e101caa3b26cb1d0de21959ddac84c31bdd Mon Sep 17 00:00:00 2001 From: Mihai Criveti Date: Fri, 3 Oct 2025 17:58:19 +0100 Subject: [PATCH 2/2] Add scale docs Signed-off-by: Mihai Criveti --- docs/docs/architecture/plugins.md | 2 +- docs/docs/using/plugins/index.md | 10 +++++----- docs/docs/using/plugins/plugins.md | 6 +++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/docs/architecture/plugins.md b/docs/docs/architecture/plugins.md index 309b9bf50..397895023 100644 --- a/docs/docs/architecture/plugins.md +++ b/docs/docs/architecture/plugins.md @@ -638,7 +638,7 @@ class PluginMode(str, Enum): DISABLED = "disabled" # Plugin loaded but not executed class HookType(str, Enum): - """Available hook points in MCP request lifecycle""" + """Available hook points in MCP request lifecycle""" PROMPT_PRE_FETCH = "prompt_pre_fetch" # Before prompt retrieval PROMPT_POST_FETCH = "prompt_post_fetch" # After prompt rendering TOOL_PRE_INVOKE = "tool_pre_invoke" # Before tool execution diff --git a/docs/docs/using/plugins/index.md b/docs/docs/using/plugins/index.md index 07aa84dff..d7284d4e4 100644 --- a/docs/docs/using/plugins/index.md +++ b/docs/docs/using/plugins/index.md @@ -20,7 +20,7 @@ The MCP Context Forge Plugin Framework provides a comprehensive, production-grad !!! details "Plugin Framework Specification" Check the [specification](https://ibm.github.io/mcp-context-forge/architecture/plugins/) docs for a detailed design of the plugin system. - + The plugin framework implements a **hybrid architecture** supporting both native and external service integrations: ### Native Plugins @@ -92,9 +92,9 @@ class MyPlugin(Plugin): super().__init__(config) async def prompt_pre_fetch(self, payload: PromptPrehookPayload, context: PluginContext) -> PromptPrehookResult: - # modify + # modify return PromptPrehookResult(modified_payload=payload) - + # or block # return PromptPrehookResult( # continue_processing=False, @@ -118,7 +118,7 @@ plugins: priority: 120 ``` -**External plugin quickstart:** +**External plugin quickstart:** !!! details "Plugins Lifecycle Guide" See the [plugin lifecycle guide](https://ibm.github.io/mcp-context-forge/using/plugins/lifecycle/) for building, testing, and serving extenal plugins. @@ -278,7 +278,7 @@ Available hook values for the `hooks` field: #### Condition Fields -Users may only want plugins to be invoked on specific servers, tools, and prompts. To address this, a set of conditionals can be applied to a plugin. The attributes in a conditional combine together in as a set of `and` operations, while each attribute list item is `or`ed with other items in the list. +Users may only want plugins to be invoked on specific servers, tools, and prompts. To address this, a set of conditionals can be applied to a plugin. The attributes in a conditional combine together in as a set of `and` operations, while each attribute list item is `or`ed with other items in the list. The `conditions` array contains objects that specify when plugins should execute: diff --git a/docs/docs/using/plugins/plugins.md b/docs/docs/using/plugins/plugins.md index c08c91fae..d1bc6eca6 100644 --- a/docs/docs/using/plugins/plugins.md +++ b/docs/docs/using/plugins/plugins.md @@ -4,9 +4,9 @@ MCP Context Forge provides a comprehensive collection of production-ready plugin ## Plugin Categories -- [Security & Safety](#security-safety) -- [Reliability & Performance](#reliability-performance) -- [Content Transformation & Formatting](#content-transformation-formatting) +- [Security & Safety](#security-safety) +- [Reliability & Performance](#reliability-performance) +- [Content Transformation & Formatting](#content-transformation-formatting) - [Content Filtering & Validation](#content-filtering-validation) - [Compliance & Governance](#compliance-governance) - [Network & Integration](#network-integration)