diff --git a/docs/development/backend/events.mdx b/docs/development/backend/events.mdx index d964be2..10e2a55 100644 --- a/docs/development/backend/events.mdx +++ b/docs/development/backend/events.mdx @@ -7,6 +7,8 @@ description: Type-safe event system for decoupled communication between core sys The Global Event Bus enables decoupled communication between core systems and plugins through a type-safe event system. Core systems emit events when important actions occur, and plugins can react without direct coupling to business logic. +**Note**: This documentation covers the **internal backend event bus** for plugin communication. If you're looking for the **satellite events system** (incoming events from satellites), see [Satellite Events](/development/backend/satellite-events). + ## Overview Key features: diff --git a/docs/development/backend/index.mdx b/docs/development/backend/index.mdx index 123a370..3e6a234 100644 --- a/docs/development/backend/index.mdx +++ b/docs/development/backend/index.mdx @@ -112,6 +112,14 @@ The development server starts at `http://localhost:3000` with API documentation > Database-backed job processing system with persistent storage, automatic retries, and rate limiting for long-running background tasks. + + } + href="/deploystack/development/backend/satellite-events" + title="Satellite Events" + > + Real-time event processing from satellites with convention-based handlers routing to business tables for operational visibility. + ## Project Structure diff --git a/docs/development/backend/satellite-communication.mdx b/docs/development/backend/satellite-communication.mdx index 8a17548..4e3ecfb 100644 --- a/docs/development/backend/satellite-communication.mdx +++ b/docs/development/backend/satellite-communication.mdx @@ -71,6 +71,26 @@ Satellites use **outbound-only HTTPS polling** to communicate with the backend, └─────────────────┘ Command Response └─────────────────┘ ``` +### Communication Channels + +The system uses three distinct communication patterns: + +**Command Polling (Backend → Satellite)**: +- Backend creates commands, satellites poll and execute +- Adaptive intervals: 2-60 seconds based on priority +- Used for: MCP server configuration, process management + +**Heartbeat (Satellite → Backend, Periodic)**: +- Satellites report status every 30 seconds +- Contains: System metrics, process counts, resource usage +- Used for: Health monitoring, capacity planning + +**Events (Satellite → Backend, Immediate)**: +- Satellites emit events when actions occur, batched every 3 seconds +- Contains: Point-in-time occurrences with precise timestamps +- Used for: Real-time UI updates, audit trails, user notifications +- See [Satellite Events](/development/backend/satellite-events) for detailed implementation + ### Dual Deployment Models **Global Satellites**: Cloud-hosted by DeployStack team @@ -474,6 +494,7 @@ For detailed API endpoints, request/response formats, and authentication pattern For detailed satellite architecture and implementation: +- [Satellite Events](/development/backend/satellite-events) - Real-time event processing system - [API Security](/development/backend/api-security) - Security patterns and authorization - [Database Management](/development/backend/database) - Schema and data management - [OAuth2 Server](/development/backend/oauth2-server) - OAuth2 implementation details diff --git a/docs/development/backend/satellite-events.mdx b/docs/development/backend/satellite-events.mdx new file mode 100644 index 0000000..8affe0c --- /dev/null +++ b/docs/development/backend/satellite-events.mdx @@ -0,0 +1,563 @@ +--- +title: Satellite Events System +description: Real-time event processing from satellites to backend with convention-based handler architecture and business logic routing. +--- + +# Satellite Events System + +The Satellite Events System provides real-time communication from satellites to the backend for operational visibility, audit trails, and user feedback. Events are processed through a convention-based dispatcher that routes them to handlers updating existing business tables. + +## Architecture Overview + +### Event Flow + +``` +Satellite → EventBus (3s batching) → POST /api/satellites/{id}/events → Backend Dispatcher → Handler → Business Table +``` + +**Key Principle**: Events are **routing triggers** that update existing business tables, not raw event storage. Each handler performs meaningful business logic rather than storing JSON blobs. + +### Why Events vs Heartbeat? + +DeployStack uses three distinct communication channels: + +**Heartbeat (Every 30 seconds)**: +- Aggregate metrics and system health +- Resource monitoring and capacity planning +- Process counts grouped by team + +**Events (Immediate with 3s batching)**: +- Point-in-time occurrences with precise timestamps +- Real-time UI updates and user notifications +- Audit trails for compliance + +**Commands (Polling)**: +- Backend-initiated tasks +- Configuration updates and process management + +## Backend Implementation + +### Directory Structure + +``` +services/backend/src/events/satellite/ +├── index.ts # Event dispatcher (auto-discovers handlers) +├── types.ts # Shared TypeScript interfaces +├── mcp-server-started.ts # Updates satelliteProcesses status +├── mcp-server-crashed.ts # Updates satelliteProcesses with error +├── mcp-tool-executed.ts # Logs to satelliteUsageLogs +└── [future-event-types].ts # Additional handlers as needed +``` + +### Convention-Based Handler Discovery + +The dispatcher automatically discovers and registers handlers from the `handlerModules` array in `index.ts`: + +```typescript +const handlerModules = [ + () => import('./mcp-server-started'), + () => import('./mcp-tool-executed'), + () => import('./mcp-server-crashed'), + // Add new handlers here - they will be automatically registered +]; +``` + +Each handler must export three components: + +1. **EVENT_TYPE**: String constant identifying the event +2. **SCHEMA**: JSON Schema for AJV validation +3. **handle()**: Async function that updates business tables + +### Handler Interface + +All event handlers must implement this interface: + +```typescript +export interface EventHandler { + EVENT_TYPE: string; + SCHEMA: Record; + handle: ( + satelliteId: string, + eventData: Record, + db: LibSQLDatabase, + eventTimestamp: Date + ) => Promise; +} +``` + +## Event Processing + +### Batch Endpoint + +**Route**: `POST /api/satellites/{satelliteId}/events` + +**Authentication**: Satellite API key (Bearer token via `requireSatelliteAuth()` middleware) + +**Request Schema**: +```json +{ + "events": [ + { + "type": "mcp.server.started", + "timestamp": "2025-01-10T10:30:45.123Z", + "data": { + "processId": "proc-123", + "serverId": "filesystem-team-xyz", + "serverName": "Filesystem MCP", + "teamId": "team-xyz", + "pid": 12345, + "localPort": 8080 + } + } + ] +} +``` + +**Response Schema**: +```json +{ + "success": true, + "processed": 45, + "failed": 0, + "event_ids": ["evt_1736512345_abc123", "evt_1736512346_def456"] +} +``` + +### Batch Processing Strategy + +The dispatcher processes batched events with isolated error handling: + +1. Validate request structure (events array present) +2. Validate batch size (1-100 events) +3. Process each event individually: + - Check event type exists in registry + - Validate event data against handler schema using AJV + - Parse and validate timestamp + - Call handler.handle() for valid events + - Track successful and failed events +4. Return aggregated results + +**Error Isolation**: Invalid events are logged and skipped without failing the entire batch. Valid events in the same batch are still processed. + +### Partial Success Handling + +When some events fail validation, the endpoint returns partial success: + +```json +{ + "success": true, + "processed": 43, + "failed": 2, + "event_ids": ["evt_001", "evt_002", "..."], + "failures": [ + { + "index": 5, + "type": "mcp.unknown.event", + "error": "Unknown event type" + }, + { + "index": 12, + "type": "mcp.tool.executed", + "error": "Missing required field: toolName" + } + ] +} +``` + +## Implemented Event Types + +### MCP Server Lifecycle + +#### mcp.server.started +Updates `satelliteProcesses` table when server successfully spawns. + +**Business Logic**: Sets status='running', records start time and process PID. + +**Required Fields**: `processId`, `serverId`, `serverName`, `teamId` + +**Optional Fields**: `pid`, `localPort` + +#### mcp.server.crashed +Updates `satelliteProcesses` table when server exits unexpectedly. + +**Business Logic**: Sets status='failed', logs error details and exit code. + +**Required Fields**: `processId`, `serverId`, `serverName`, `teamId` + +**Optional Fields**: `exitCode`, `signal`, `errorMessage`, `stackTrace` + +### Tool Execution + +#### mcp.tool.executed +Inserts record into `satelliteUsageLogs` for analytics and audit trails. + +**Business Logic**: Logs tool execution with metrics, user context, and performance data. + +**Required Fields**: `toolName`, `serverId`, `teamId` + +**Optional Fields**: `processId`, `userId`, `durationMs`, `statusCode`, `errorMessage`, `requestSizeBytes`, `responseSizeBytes`, `userAgent`, `ipAddress` + +## Creating New Event Handlers + +### Handler Template + +Create a new file in `services/backend/src/events/satellite/`: + +```typescript +import type { LibSQLDatabase } from 'drizzle-orm/libsql'; +import { yourTable } from '../../db/schema.sqlite'; +import { eq } from 'drizzle-orm'; + +export const EVENT_TYPE = 'your.event.type'; + +export const SCHEMA = { + type: 'object', + properties: { + requiredField: { + type: 'string', + minLength: 1, + description: 'Description of this field' + }, + optionalField: { + type: 'number', + description: 'Optional numeric field' + } + }, + required: ['requiredField'], + additionalProperties: true +} as const; + +interface YourEventData { + requiredField: string; + optionalField?: number; +} + +export async function handle( + satelliteId: string, + eventData: Record, + db: LibSQLDatabase, + eventTimestamp: Date +): Promise { + const data = eventData as unknown as YourEventData; + + // Update existing business table + await db.update(yourTable) + .set({ + status: 'updated', + updated_at: eventTimestamp + }) + .where(eq(yourTable.id, data.requiredField)); +} +``` + +### Registration Steps + +1. Create handler file in `services/backend/src/events/satellite/` +2. Export `EVENT_TYPE`, `SCHEMA`, and `handle()` function +3. Add import to `handlerModules` array in `index.ts`: +```typescript +const handlerModules = [ + () => import('./mcp-server-started'), + () => import('./mcp-tool-executed'), + () => import('./mcp-server-crashed'), + () => import('./your-new-handler'), // Add here +]; +``` +4. Handler is automatically registered and ready to process events + +## Schema Validation + +### AJV Configuration + +The dispatcher uses AJV with specific configuration for compatibility: + +```typescript +const ajv = new Ajv({ + allErrors: true, // Report all validation errors + strict: false, // Allow unknown keywords + strictTypes: false // Disable strict type checking +}); +addFormats(ajv); // Add format validators (email, date-time, etc.) +``` + +### Validation Process + +For each event: +1. Compile handler SCHEMA with AJV +2. Validate event.data against compiled schema +3. Log validation errors with instance path details +4. Skip invalid events (don't fail entire batch) + +### Schema Best Practices + +- Use `minLength: 1` for required string fields +- Include descriptive `description` fields for documentation +- Set `additionalProperties: true` to allow future extensibility +- Use `required` array for mandatory fields +- Leverage AJV formats: `email`, `date-time`, `uri`, `uuid` + +## Database Integration + +### Event-to-Table Mapping + +Events route to existing business tables based on their purpose: + +| Event Type | Business Table | Action | +|-----------|----------------|--------| +| `mcp.server.started` | `satelliteProcesses` | Update status='running', set start time | +| `mcp.server.crashed` | `satelliteProcesses` | Update status='failed', log error details | +| `mcp.tool.executed` | `satelliteUsageLogs` | Insert usage record with metrics | + +### Transaction Strategy + +Each event is processed in a separate database transaction: +- Failed events don't rollback other events +- Maintains data consistency per event +- Isolated error handling prevents cascade failures + +### Database Driver Compatibility + +When updating records, use the driver-compatible pattern: + +```typescript +const result = await db.update(table).set(data).where(condition); + +// Handle both SQLite (changes) and Turso (rowsAffected) +const updated = (result.changes || result.rowsAffected || 0) > 0; +``` + +## Performance Considerations + +### Batch Processing Efficiency + +- **Target**: < 100ms per 100-event batch +- **Isolation**: Each event in separate transaction +- **Logging**: Structured logging with batch metrics +- **Monitoring**: Track processing duration and success rates + +### Database Performance + +- Updates use indexed lookups (processId, satelliteId) +- Inserts optimized for high-volume logging +- No generic JSON storage overhead +- Leverages existing optimized table schemas + +### Memory Usage + +- Batch size limited to 100 events (backend validation) +- Event processing is sequential (simple implementation) +- No long-lived memory allocations +- Efficient JSON parsing with TypeScript interfaces + +## Error Handling + +### Invalid Event Type + +**Response**: Partial success with failure details + +**Logging**: Warn level with event type + +**Action**: Skip event, continue batch processing + +### Schema Validation Failure + +**Response**: Partial success with validation errors + +**Logging**: Warn level with instance path details + +**Action**: Skip event, log validation errors + +### Handler Execution Error + +**Response**: Partial success with error message + +**Logging**: Error level with stack trace + +**Action**: Catch error, track failure, continue batch + +### Database Transaction Failure + +**Response**: Partial success with database error + +**Logging**: Error level with query details + +**Action**: Rollback transaction, track failure, continue batch + +## Testing + +### Unit Testing + +Test individual event handlers in isolation: + +```typescript +// Test handler validation +const validData = { processId: 'proc-123', serverId: 'server-xyz', ... }; +await handler.handle('satellite-id', validData, mockDb, new Date()); + +// Test schema validation +const validate = ajv.compile(handler.SCHEMA); +expect(validate(validData)).toBe(true); +``` + +### Integration Testing + +Test full endpoint with satellite authentication: + +```bash +curl -X POST http://localhost:3000/api/satellites/{satelliteId}/events \ + -H "Authorization: Bearer {satellite_api_key}" \ + -H "Content-Type: application/json" \ + -d '{ + "events": [ + { + "type": "mcp.server.started", + "timestamp": "2025-01-10T10:30:45.123Z", + "data": { + "processId": "proc-123", + "serverId": "filesystem-test", + "serverName": "Filesystem MCP", + "teamId": "test-team" + } + } + ] + }' +``` + +### Batch Processing Tests + +- Single event batch (1 event) +- Normal batch (50 events) +- Maximum batch (100 events) +- Oversized batch (> 100 events, should reject) +- Mixed success/failure batch +- Unknown event type handling +- Invalid timestamp handling +- Schema validation failures + +## Monitoring and Debugging + +### Structured Logging + +All event operations are logged with structured data: + +```bash +# Event processing started +{"level":"info","satelliteId":"sat-123","batchSize":45} + +# Successful processing +{"level":"info","satelliteId":"sat-123","eventType":"mcp.server.started","msg":"Event processed"} + +# Validation failure +{"level":"warn","eventType":"unknown.type","msg":"Unknown event type"} + +# Batch complete +{"level":"info","satelliteId":"sat-123","processed":43,"failed":2,"msg":"Batch complete"} +``` + +### Debug Queries + +Check registered event types: + +```typescript +import { getRegisteredEventTypes } from '../events/satellite'; + +const types = await getRegisteredEventTypes(); +console.log('Registered event types:', types); +``` + +Verify database updates: + +```sql +-- Check process status after mcp.server.started +SELECT status, started_at, process_pid +FROM satelliteProcesses +WHERE id = 'proc-123'; + +-- Check tool execution logs +SELECT tool_name, duration_ms, status_code, timestamp +FROM satelliteUsageLogs +WHERE satellite_id = 'sat-123' +ORDER BY timestamp DESC +LIMIT 10; +``` + +## Best Practices + +### Event Handler Design + +**DO**: +- Update existing business tables with structured data +- Use TypeScript interfaces for type safety +- Include comprehensive field descriptions in schemas +- Log important state changes +- Handle optional fields gracefully + +**DON'T**: +- Store raw JSON in generic events tables +- Assume all optional fields are present +- Skip error handling in database operations +- Use blocking operations (keep handlers async) +- Duplicate business logic across handlers + +### Schema Design + +**DO**: +- Use descriptive field names matching domain concepts +- Include `description` for documentation +- Set appropriate `minLength` and format constraints +- Use `additionalProperties: true` for extensibility +- Mark truly required fields in `required` array + +**DON'T**: +- Over-constrain with excessive validation +- Use generic field names like `data` or `info` +- Forget to set `as const` on schema objects +- Validate business logic in schemas (do that in handlers) +- Create schemas with circular references + +### Database Operations + +**DO**: +- Use parameterized queries via Drizzle ORM +- Handle both SQLite and Turso driver differences +- Include timestamps for all state changes +- Use transactions for multi-step operations +- Index frequently queried fields + +**DON'T**: +- Concatenate SQL strings manually +- Assume specific driver properties exist +- Skip error handling for database operations +- Create N+1 query patterns +- Store large BLOBs in event data + +## Future Enhancements + +### Planned Event Types + +- **Client Connections**: `mcp.client.connected`, `mcp.client.disconnected` +- **Tool Discovery**: `mcp.tools.discovered`, `mcp.tools.updated` +- **Configuration**: `config.refreshed`, `config.error` +- **Satellite Lifecycle**: `satellite.registered`, `satellite.deregistered` +- **Process Management**: `mcp.server.restarted`, `mcp.server.permanently_failed` + +### Performance Optimizations + +- Batch database insertions for high-volume events +- Async event processing with job queue +- Event sampling for high-frequency events +- Compression for large event payloads + +### Analytics Features + +- Real-time event aggregation +- Custom alert rules based on events +- Event replay for debugging +- Historical event analysis dashboards + +## Related Documentation + +- [Satellite Event System](/development/satellite/event-system) - Satellite-side event emission +- [Satellite Communication](/development/backend/satellite-communication) - Full satellite communication architecture +- [API Documentation](/development/backend/api) - OpenAPI specification generation +- [Database Management](/development/backend/database) - Schema and migrations diff --git a/docs/development/frontend/ui-design-charts.mdx b/docs/development/frontend/ui-design-charts.mdx new file mode 100644 index 0000000..10c0922 --- /dev/null +++ b/docs/development/frontend/ui-design-charts.mdx @@ -0,0 +1,197 @@ +--- +title: Charts with Vue ECharts +description: Guide to using Apache ECharts for data visualization in DeployStack frontend +sidebar: Charts +--- + +# Charts with Vue ECharts + +DeployStack uses [vue-echarts](https://github.com/ecomfe/vue-echarts) for data visualization, providing powerful and performant charts with Apache ECharts integration for Vue 3. + +## Chart Components + +The chart components are available in `/components/ui/chart/` following the shadcn/vue pattern with CVA variants. + +### LineChart Component + +The `LineChart` component provides a simplified API for time-series data visualization, perfect for displaying metrics like HTTP calls, user activity, or any trend data. + +```vue + + + +``` + +### LineChart Props + +| Prop | Type | Default | Description | +|------|------|---------|-------------| +| `data` | `number[]` | required | Array of data points | +| `labels` | `string[]` | required | Array of x-axis labels | +| `name` | `string` | `'Data'` | Series name for tooltip | +| `smooth` | `boolean` | `true` | Enable smooth line interpolation | +| `showArea` | `boolean` | `true` | Show area fill under line | +| `color` | `string` | `'#0f766e'` | Line color (DeployStack teal) | +| `areaColor` | `string` | `'rgba(15, 118, 110, 0.3)'` | Area gradient color | +| `size` | `'sm' \| 'md' \| 'lg' \| 'xl'` | `'md'` | Chart height | +| `loading` | `boolean` | `false` | Show loading state | +| `autoresize` | `boolean` | `true` | Auto-resize with container | + +### Size Variants + +```vue + + + + + + + + + + + +``` + +### Custom Styling + +```html + + + + + + + + +``` + +## Advanced Usage: Base Chart Component + +For full control over chart configuration, use the base `Chart` component with custom ECharts options: + +```html + + + +``` + +## Best Practices + +1. **Use LineChart for simple cases** - The simplified API reduces boilerplate +2. **Use Chart for complex visualizations** - When you need full ECharts control +3. **Set explicit height** - Always use size prop for consistent layouts +4. **Enable autoresize** - Charts automatically adapt to container size changes +5. **Handle loading states** - Use the `loading` prop for better UX +6. **Use CanvasRenderer** - Better performance for most use cases + +## Resources + +- [GitHub Repository](https://github.com/ecomfe/vue-echarts) +- [Apache ECharts Documentation](https://echarts.apache.org/) +- [Chart Examples](https://echarts.apache.org/examples/) diff --git a/docs/development/satellite/architecture.mdx b/docs/development/satellite/architecture.mdx index 9237bb4..571ae08 100644 --- a/docs/development/satellite/architecture.mdx +++ b/docs/development/satellite/architecture.mdx @@ -19,13 +19,15 @@ Satellites operate as edge workers similar to GitHub Actions runners, providing: - **MCP Transport Protocols**: SSE, Streamable HTTP, Direct HTTP communication - **Dual MCP Server Management**: HTTP proxy + stdio subprocess support (ready for implementation) - **Team Isolation**: nsjail sandboxing with built-in resource limits (ready for implementation) -- **OAuth 2.1 Resource Server**: Token introspection with Backend (implemented) -- **Backend Polling Communication**: Outbound-only, firewall-friendly (implemented) +- **OAuth 2.1 Resource Server**: Token introspection with Backend +- **Backend Polling Communication**: Outbound-only, firewall-friendly +- **Real-Time Event System**: Immediate satellite → backend event emission with automatic batching - **Process Lifecycle Management**: Spawn, monitor, terminate MCP servers (ready for implementation) +- **Background Jobs System**: Cron-like recurring tasks with automatic error handling ## Current Implementation Architecture -### Phase 1: MCP Transport Layer (Implemented) +### Phase 1: MCP Transport Layer The current satellite implementation provides complete MCP client interface support: @@ -82,7 +84,7 @@ MCP Client Satellite │◀─── Response via SSE ─────│ (Stream response back) ``` -### Core Components (Implemented) +### Core Components **Session Manager:** - Cryptographically secure 32-byte base64url session IDs @@ -236,7 +238,7 @@ Each satellite instance will contain five core components: ## Communication Patterns -### Client-to-Satellite Communication (Implemented) +### Client-to-Satellite Communication **Multiple Transport Protocols:** - **SSE (Server-Sent Events)**: Real-time streaming with session management @@ -262,7 +264,7 @@ MCP Client Satellite - **Activity Tracking**: Updated on each message - **State Management**: Client info and initialization status -### Satellite-to-Backend Communication (Implemented) +### Satellite-to-Backend Communication **HTTP Polling Pattern:** ``` @@ -288,6 +290,34 @@ Satellite Backend For complete implementation details, see [Backend Polling Implementation](/development/satellite/polling). +### Real-Time Event System + +**Event Emission with Batching:** +``` +Satellite Operations EventBus Backend + │ │ │ + │─── mcp.server.started ──▶│ │ + │─── mcp.tool.executed ───▶│ [Queue] │ + │─── mcp.client.connected ─▶│ │ + │ [Every 3 seconds] │ + │ │ │ + │ │─── POST /events ───▶│ + │ │◀─── 200 OK ─────────│ +``` + +**Event Features:** +- **Immediate Emission**: Events emitted when actions occur (not delayed by 30s heartbeat) +- **Automatic Batching**: Events collected for 3 seconds, then sent as single batch (max 100 events) +- **Memory Management**: In-memory queue (10,000 event limit) with overflow protection +- **Graceful Error Handling**: 429 exponential backoff, 400 drops invalid events, 500/network errors retry +- **10 Event Types**: Server lifecycle, client connections, tool discovery, configuration updates + +**Difference from Heartbeat:** +- **Heartbeat** (every 30s): Aggregate metrics, system health, resource usage +- **Events** (immediate): Point-in-time occurrences, user actions, precise timestamps + +For complete event system documentation, see [Event System](/development/satellite/event-system). + ## Security Architecture ### Current Security (No Authentication) @@ -468,9 +498,13 @@ The satellite service has completed **Phase 1: MCP Transport Implementation** an - **Command Processing**: HTTP MCP server management (spawn/kill/restart/health_check) - **Heartbeat Service**: Process status reporting and system metrics - **Configuration Sync**: Real-time MCP server configuration updates +- **Event System**: Real-time event emission with automatic batching (10 event types) **Foundation Infrastructure:** - **HTTP Server**: Fastify with Swagger documentation - **Logging System**: Pino with structured logging - **Build Pipeline**: TypeScript compilation and bundling - **Development Workflow**: Hot reload and code quality tools +- **Background Jobs System**: Cron-like job management for recurring tasks + +For details on the background jobs system, see [Background Jobs System](/development/satellite/background-jobs). diff --git a/docs/development/satellite/backend-communication.mdx b/docs/development/satellite/backend-communication.mdx index c022988..5464501 100644 --- a/docs/development/satellite/backend-communication.mdx +++ b/docs/development/satellite/backend-communication.mdx @@ -43,6 +43,30 @@ Satellites adjust polling frequency based on Backend guidance: - **Backoff Mode**: Exponential backoff up to 5 minutes on errors - **Maintenance Mode**: Reduced polling during maintenance windows +### Communication Channels + +The satellite uses three distinct communication channels with the Backend: + +**1. Command Polling (Backend → Satellite)** +- Backend creates commands, satellite polls and executes +- Adaptive intervals: 2-60 seconds based on command priority +- Used for: MCP server configuration, process management, system updates +- Direction: Backend initiates, satellite responds + +**2. Heartbeat (Satellite → Backend, Periodic)** +- Satellite reports status every 30 seconds +- Contains: System metrics, process counts, resource usage +- Used for: Health monitoring, capacity planning, aggregate analytics +- Direction: Satellite reports on fixed schedule + +**3. Events (Satellite → Backend, Immediate)** +- Satellite emits events when actions occur, batched every 3 seconds +- Contains: Point-in-time occurrences with precise timestamps +- Used for: Real-time UI updates, audit trails, user notifications +- Direction: Satellite reports immediately (not waiting for heartbeat) + +For detailed event system documentation, see [Event System](/development/satellite/event-system). + ## Current Implementation ### Phase 1: Basic Connection Testing ✅ diff --git a/docs/development/satellite/background-jobs.mdx b/docs/development/satellite/background-jobs.mdx new file mode 100644 index 0000000..47a58be --- /dev/null +++ b/docs/development/satellite/background-jobs.mdx @@ -0,0 +1,593 @@ +--- +title: Background Jobs System +description: Cron-like job system for managing recurring background tasks in DeployStack Satellite with automatic error handling and monitoring. +sidebar: Satellite Development +--- + +import { Callout } from 'fumadocs-ui/components/callout'; + +# Background Jobs System + +DeployStack Satellite implements a centralized job management system for recurring background tasks. The system provides a consistent pattern for cron-like operations with automatic error handling, execution metrics, and lifecycle management. + +## Architecture Overview + +The job system consists of three core components: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Job System Architecture │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ BaseJob │ │ JobManager │ │ Concrete Job │ │ +│ │ │ │ │ │ │ │ +│ │ • Interval │◄───│ • Registry │◄───│ HeartbeatJob │ │ +│ │ • Execute │ │ • Lifecycle │ │ CleanupJob │ │ +│ │ • Metrics │ │ • Monitoring │ │ CustomJob │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### BaseJob Abstract Class + +All jobs extend `BaseJob`, which provides: + +- **Automatic Interval Execution**: Jobs run on configured intervals +- **Immediate First Run**: Execute immediately on start, then follow interval +- **Error Handling**: Automatic error catching with structured logging +- **Execution Metrics**: Track execution count, timing, and errors +- **Lifecycle Management**: Start/stop methods with state tracking + +### JobManager + +The `JobManager` provides centralized control: + +- **Job Registry**: Register and track all jobs +- **Lifecycle Control**: Start/stop all jobs or individual jobs +- **Status Monitoring**: Query job statistics and execution state +- **Graceful Shutdown**: Stop all jobs cleanly on satellite shutdown + +## Current Jobs + +| Job Name | Interval | Purpose | Status | +|----------|----------|---------|--------| +| `heartbeat` | 30 seconds | Send status updates to backend | ✅ Active | +| `cleanup` | 5 minutes | Template for new jobs | 📝 Example | + +## Creating a New Job + +Add a new background job in three steps: + +### Step 1: Create Job File + +Create `src/jobs/process-health-job.ts`: + +```typescript +import { BaseJob } from './base-job'; +import { FastifyBaseLogger } from 'fastify'; + +export class ProcessHealthJob extends BaseJob { + constructor(logger: FastifyBaseLogger) { + super('process-health', 120000, logger); // 2 minutes + } + + protected async execute(): Promise { + this.logger.info({ + operation: 'process_health_check' + }, 'Checking process health...'); + + // Your job logic here + + this.logger.info({ + operation: 'process_health_complete' + }, 'Health check completed'); + } +} +``` + +### Step 2: Export from Index + +Add to `src/jobs/index.ts`: + +```typescript +export { ProcessHealthJob } from './process-health-job'; +``` + +### Step 3: Register in Server + +Add to `src/server.ts`: + +```typescript +import { JobManager, HeartbeatJob, CleanupJob, ProcessHealthJob } from './jobs'; + +const jobManager = new JobManager(server.log); +jobManager.register(new HeartbeatJob(heartbeatService)); +jobManager.register(new CleanupJob(server.log)); +jobManager.register(new ProcessHealthJob(server.log)); + +await jobManager.startAll(); +``` + + +That's it! Your job will start running immediately and then execute every 2 minutes automatically. + + +## Job Intervals + +Common interval values in milliseconds: + +```typescript +// Seconds +30 * 1000 // 30 seconds +60 * 1000 // 1 minute + +// Minutes +2 * 60 * 1000 // 2 minutes +5 * 60 * 1000 // 5 minutes +10 * 60 * 1000 // 10 minutes +15 * 60 * 1000 // 15 minutes +30 * 60 * 1000 // 30 minutes + +// Hours +60 * 60 * 1000 // 1 hour +6 * 60 * 60 * 1000 // 6 hours +24 * 60 * 60 * 1000 // 24 hours +``` + +### Environment-Configurable Intervals + +Make job intervals configurable: + +```typescript +export class MyJob extends BaseJob { + constructor(logger: FastifyBaseLogger) { + const interval = parseInt( + process.env.MY_JOB_INTERVAL || '300000', + 10 + ); + super('my-job', interval, logger); + } +} +``` + +Add to `.env.example`: +```bash +# My Job interval in milliseconds (default: 5 minutes) +MY_JOB_INTERVAL=300000 +``` + +## Jobs with Dependencies + +If your job needs access to services, inject them via constructor: + +```typescript +export class ProcessHealthJob extends BaseJob { + constructor( + logger: FastifyBaseLogger, + private processManager: ProcessManager, + private runtimeState: RuntimeState + ) { + super('process-health', 120000, logger); + } + + protected async execute(): Promise { + const processes = this.processManager.getAllProcesses(); + + for (const proc of processes) { + if (proc.errorCount > 10) { + this.logger.warn({ + process_id: proc.config.installation_id, + error_count: proc.errorCount + }, 'Process has high error count'); + } + } + } +} +``` + +Register with dependencies: + +```typescript +jobManager.register( + new ProcessHealthJob(server.log, processManager, runtimeState) +); +``` + +## Job Lifecycle + +### Initialization Flow + +``` +Satellite Startup + │ + ├── Register Satellite with Backend + │ + ├── Initialize Services + │ + ├── Create JobManager + │ + ├── Register Jobs + │ ├── new HeartbeatJob(heartbeatService) + │ ├── new CleanupJob(logger) + │ └── new CustomJob(logger) + │ + ├── jobManager.startAll() + │ ├── Start Job 1 → Execute immediately → Set interval + │ ├── Start Job 2 → Execute immediately → Set interval + │ └── Start Job 3 → Execute immediately → Set interval + │ + └── Satellite Ready +``` + +### Job Execution Flow + +``` +Job Start + │ + ├── Execute Immediately (first run) + │ ├── Log: job_execute_start + │ ├── Run execute() method + │ ├── Track execution time + │ ├── Log: job_execute_success + │ └── Update metrics + │ + ├── Wait for Interval + │ + └── Execute on Interval (repeating) + ├── Log: job_execute_start + ├── Run execute() method + ├── Handle errors (if any) + ├── Log: job_execute_success or job_execute_error + └── Update metrics → Repeat +``` + +## Monitoring and Observability + +### Structured Logging + +All job events are logged with structured data: + +```typescript +// Job started +{ + "operation": "job_start", + "job_name": "process-health", + "interval_ms": 120000, + "interval_seconds": 120 +} + +// Job executing +{ + "operation": "job_execute_start", + "job_name": "process-health", + "execution_number": 5 +} + +// Job completed +{ + "operation": "job_execute_success", + "job_name": "process-health", + "execution_number": 5, + "execution_time_ms": 234 +} + +// Job error +{ + "operation": "job_execute_error", + "job_name": "process-health", + "execution_number": 5, + "error_count": 2, + "error": "Connection timeout" +} +``` + +### Job Statistics + +Query job statistics via JobManager: + +```typescript +const stats = jobManager.getStats('process-health'); +// Returns: +{ + executionCount: 42, + errorCount: 2, + averageExecutionTime: 234, + isRunning: true +} +``` + +Get all job statistics: + +```typescript +const allStats = jobManager.getAllStats(); +// Returns array of all job statistics +``` + +## Error Handling + +### Automatic Error Recovery + +The `BaseJob` class automatically handles errors: + +```typescript +protected async execute(): Promise { + // If this throws, BaseJob catches it + await someOperationThatMightFail(); + + // Job continues running on next interval +} +``` + +### Custom Error Handling + +Add custom error handling for specific scenarios: + +```typescript +protected async execute(): Promise { + try { + await this.criticalOperation(); + } catch (error) { + this.logger.error({ error }, 'Critical operation failed'); + // Don't throw - let BaseJob track the error + } +} +``` + +### Timeout Protection + +Add timeouts for long-running operations: + +```typescript +protected async execute(): Promise { + const controller = new AbortController(); + const timeout = setTimeout(() => controller.abort(), 60000); + + try { + await this.longOperation({ signal: controller.signal }); + } finally { + clearTimeout(timeout); + } +} +``` + +## Best Practices + +### 1. Keep Jobs Focused + +Each job should have a single responsibility: + +**Good:** +```typescript +export class SessionCleanupJob extends BaseJob { + protected async execute(): Promise { + await this.cleanupExpiredSessions(); + } +} +``` + +**Bad:** +```typescript +export class MaintenanceJob extends BaseJob { + protected async execute(): Promise { + await this.cleanupSessions(); + await this.checkProcessHealth(); + await this.rotateBlogs(); + await this.updateMetrics(); + } +} +``` + +### 2. Choose Appropriate Intervals + +- **High-frequency (30s-1m)**: Health checks, critical monitoring +- **Medium (5m-15m)**: Cleanup tasks, periodic updates +- **Low (1h+)**: Reports, analytics, maintenance + +### 3. Document Job Purpose + +Add clear comments explaining what the job does: + +```typescript +/** + * Process Health Check Job + * + * Monitors all running MCP server processes and restarts unhealthy ones. + * Runs every 2 minutes to ensure quick failure detection. + * + * Checks: + * - Process still running + * - Error count within limits + * - Response time acceptable + * - Memory usage not excessive + */ +export class ProcessHealthJob extends BaseJob { + // ... +} +``` + +### 4. Use Structured Logging + +Always log with operation context: + +```typescript +protected async execute(): Promise { + this.logger.info({ + operation: 'cleanup_start', + session_count: sessions.length + }, 'Starting session cleanup...'); + + // ... cleanup logic ... + + this.logger.info({ + operation: 'cleanup_complete', + removed_count: removed + }, 'Session cleanup completed'); +} +``` + +## Common Job Patterns + +### Health Check Pattern + +```typescript +export class HealthCheckJob extends BaseJob { + constructor( + logger: FastifyBaseLogger, + private service: ServiceToMonitor + ) { + super('health-check', 120000, logger); + } + + protected async execute(): Promise { + const isHealthy = await this.service.checkHealth(); + + if (!isHealthy) { + this.logger.warn({ + operation: 'health_check_failed', + service: 'my-service' + }, 'Service health check failed'); + + await this.service.restart(); + } + } +} +``` + +### Cleanup Pattern + +```typescript +export class CleanupJob extends BaseJob { + constructor( + logger: FastifyBaseLogger, + private manager: ResourceManager + ) { + super('cleanup', 900000, logger); // 15 minutes + } + + protected async execute(): Promise { + const expired = await this.manager.findExpired(); + + for (const resource of expired) { + await this.manager.cleanup(resource); + } + + this.logger.info({ + operation: 'cleanup_complete', + count: expired.length + }, 'Cleanup completed'); + } +} +``` + +### Metrics Collection Pattern + +```typescript +export class MetricsJob extends BaseJob { + constructor( + logger: FastifyBaseLogger, + private collector: MetricsCollector + ) { + super('metrics', 300000, logger); // 5 minutes + } + + protected async execute(): Promise { + const metrics = await this.collector.collect(); + await this.collector.report(metrics); + } +} +``` + +## Troubleshooting + +### Job Not Starting + +Check if the job is registered: + +```bash +# Look for job_start logs +grep "job_start" satellite.log | grep "my-job" +``` + +Verify registration in code: + +```typescript +const jobs = jobManager.getRegisteredJobs(); +console.log(jobs); // Should include your job name +``` + +### Job Failing Repeatedly + +Check error logs: + +```bash +# Find job errors +grep "job_execute_error" satellite.log | grep "my-job" +``` + +Review error count in statistics: + +```typescript +const stats = jobManager.getStats('my-job'); +console.log(`Error count: ${stats.errorCount}`); +``` + +### Performance Issues + +Monitor execution time: + +```bash +# Check execution times +grep "job_execute_success" satellite.log | grep "my-job" +``` + +If execution time approaches interval: +- Increase the interval +- Optimize job logic +- Consider breaking into smaller jobs + +### Job Not Executing on Time + +Verify interval configuration: + +```typescript +// Log interval on job creation +this.logger.info({ + job_name: 'my-job', + interval_ms: this.intervalMs, + interval_seconds: this.intervalMs / 1000 +}, 'Job interval configured'); +``` + +Check system clock drift if timing is critical. + +## Future Enhancements + +Planned improvements to the job system: + +- Job dependencies (Job B waits for Job A completion) +- Conditional execution (skip job if condition not met) +- Job state persistence (resume after satellite restart) +- Distributed coordination (multi-satellite job scheduling) +- Retry logic with exponential backoff +- Dynamic interval adjustment based on load +- Prometheus metrics export +- Web UI for job management + +## Implementation Status + +**Current Features:** +- ✅ BaseJob abstract class with interval management +- ✅ JobManager for centralized control +- ✅ Automatic error handling and logging +- ✅ Execution metrics tracking +- ✅ HeartbeatJob integration +- ✅ Template job for reference + +**In Development:** +- 🚧 Job priority levels +- 🚧 Job status API endpoint +- 🚧 Advanced monitoring features + + +The job system is production-ready and actively used for the heartbeat service. The pattern has proven stable and is ready for additional jobs. + diff --git a/docs/development/satellite/event-system.mdx b/docs/development/satellite/event-system.mdx new file mode 100644 index 0000000..c1f861c --- /dev/null +++ b/docs/development/satellite/event-system.mdx @@ -0,0 +1,914 @@ +--- +title: Event System +description: Real-time event emission from satellite to backend for operational visibility, audit trails, and user feedback. +sidebar: Satellite Development +--- + +import { Callout } from 'fumadocs-ui/components/callout'; + +# Satellite Event System + +The Satellite Event System provides real-time communication from satellites to the backend for operational visibility. Unlike the 30-second heartbeat cycle, events are emitted immediately when significant actions occur and batched for efficient transmission. + +## Why Events? + +DeployStack Satellites use a **polling-based communication pattern** where satellites make outbound HTTP requests to the backend. This is firewall-friendly and NAT-compatible, but creates a timing gap: + +**Problem**: Important satellite operations need immediate backend visibility +- MCP client connects (user expects instant UI feedback) +- Tool executes (audit trail needs precise timestamps) +- Process crashes (alerts need immediate dispatch) +- Security events (compliance requires real-time logging) + +**Solution**: Event emission with automatic batching +- Events emitted immediately when actions occur +- Batched every 3 seconds for network efficiency +- Sent to backend via existing authentication +- Zero impact on satellite performance + +## Architecture Overview + +``` +Satellite Components EventBus Backend + │ │ │ + │─── emit('mcp.server.started') ───▶│ │ + │ │ │ + │─── emit('mcp.tool.executed') ────▶│ │ + │ [Queue] │ + │─── emit('mcp.client.connected') ─▶│ │ + │ │ │ + │ [Every 3 seconds] │ + │ │ │ + │ │─── POST /events ───▶│ + │ │ │ + │ │◀─── 200 OK ─────────│ +``` + +### Key Components + +**EventBus Service** (`src/services/event-bus.ts`) +- In-memory queue for event collection +- 3-second batch window (configurable) +- Automatic transmission to backend +- Graceful error handling and retry + +**Event Registry** (`src/events/registry.ts`) +- Type-safe event definitions +- Event data structures +- Compile-time validation + +**Backend Integration** (`src/services/backend-client.ts`) +- `sendEvents()` method for batch transmission +- Uses existing satellite authentication +- Handles partial success responses + +## Event Types + +The satellite emits 10 event types across 4 categories: + +### MCP Server Lifecycle + +#### `mcp.server.started` +Emitted when MCP server process successfully spawns and completes handshake. + +**Data Structure:** +```typescript +{ + server_id: string; // installation_id + server_slug: string; // installation_name + team_id: string; + process_id: number; // OS process ID + transport: 'stdio'; + tool_count: number; // Tools discovered (0 initially) + spawn_duration_ms: number; +} +``` + +**Example:** +```typescript +eventBus.emit('mcp.server.started', { + server_id: 'inst_abc123', + server_slug: 'filesystem', + team_id: 'team_xyz', + process_id: 12345, + transport: 'stdio', + tool_count: 0, + spawn_duration_ms: 234 +}); +``` + +#### `mcp.server.crashed` +Emitted when MCP server process exits unexpectedly with non-zero code. + +**Data Structure:** +```typescript +{ + server_id: string; + server_slug: string; + team_id: string; + process_id: number; + exit_code: number; + signal: string; // 'SIGTERM', 'SIGKILL', etc. + uptime_seconds: number; + crash_count: number; + will_restart: boolean; +} +``` + +#### `mcp.server.restarted` +Emitted after successful automatic restart following a crash. + +**Data Structure:** +```typescript +{ + server_id: string; + server_slug: string; + team_id: string; + old_process_id: number; + new_process_id: number; + restart_reason: 'crash'; + attempt_number: number; // 1, 2, or 3 +} +``` + +#### `mcp.server.permanently_failed` +Emitted when server exhausts all 3 restart attempts. + +**Data Structure:** +```typescript +{ + server_id: string; + server_slug: string; + team_id: string; + total_crashes: number; + last_error: string; + failed_at: string; // ISO 8601 timestamp +} +``` + +### Client Connections + +#### `mcp.client.connected` +Emitted when MCP client establishes SSE connection to satellite. + +**Data Structure:** +```typescript +{ + session_id: string; + client_type: 'vscode' | 'cursor' | 'claude' | 'unknown'; + user_agent: string; + team_id: string; + transport: 'sse'; + ip_address: string; +} +``` + +**Client Type Detection:** +- Parses `User-Agent` header +- Detects VS Code, Cursor, Claude Desktop +- Falls back to 'unknown' for unrecognized clients + +#### `mcp.client.disconnected` +Emitted when SSE connection closes (client disconnect, timeout, or error). + +**Data Structure:** +```typescript +{ + session_id: string; + team_id: string; + connection_duration_seconds: number; + tool_execution_count: number; + disconnect_reason: 'client_close' | 'timeout' | 'error'; +} +``` + +### Tool Discovery + +#### `mcp.tools.discovered` +Emitted after successful tool discovery from HTTP or stdio MCP server. + +**Data Structure:** +```typescript +{ + server_id: string; + server_slug: string; + team_id: string; + tool_count: number; + tool_names: string[]; + discovery_duration_ms: number; + previous_tool_count: number; +} +``` + +#### `mcp.tools.updated` +Emitted when tool list changes during configuration refresh. + +**Data Structure:** +```typescript +{ + server_id: string; + server_slug: string; + team_id: string; + added_tools: string[]; + removed_tools: string[]; + total_tools: number; +} +``` + +### Configuration Management + +#### `config.refreshed` +Emitted after successful configuration fetch from backend. + +**Data Structure:** +```typescript +{ + config_hash: string; + server_count: number; + teams_count: number; + change_detected: boolean; + fetch_duration_ms: number; +} +``` + +#### `config.error` +Emitted when configuration fetch fails. + +**Data Structure:** +```typescript +{ + error_type: 'server_error'; + error_message: string; + status_code: number | null; + retry_in: number; // Seconds until next retry +} +``` + +## Event Batching + +### Batch Window: 3 Seconds + +Events are collected in memory for 3 seconds, then sent as a single batch: + +``` +0s 3s 6s 9s +│───────────────│───────────────│───────────────│ +│ Collect events│ Send batch │ Collect events│ +│ (6 events) │ (6 events) │ (2 events) │ +``` + +**Benefits:** +- Reduces HTTP request overhead +- Efficient network usage +- Near real-time (3s latency acceptable) +- Backend-friendly batching + +### Max Batch Size: 100 Events + +If more than 100 events accumulate, only first 100 are sent: + +``` +0-3s: Collect 150 events +3s: Send first 100, keep 50 in queue +3-6s: Collect 30 more (queue = 80) +6s: Send 80 events +``` + +### Empty Batch Handling + +If no events occur, no HTTP request is made: + +``` +0-3s: No events +3s: Skip sending (no request) +3-6s: No events +6s: Skip sending (no request) +``` + +## Memory Management + +### Queue Limit: 10,000 Events + +The in-memory queue holds a maximum of 10,000 events: + +**Normal Operation:** +- Events queued and sent every 3 seconds +- Queue size typically under 100 events + +**Backend Outage:** +- Events accumulate in queue +- Queue grows up to 10,000 events +- Oldest events dropped when limit reached +- Dropped count logged for monitoring + +**Memory Usage:** +- Average event: 1-2KB +- 10,000 events ≈ 10-20MB RAM +- Acceptable footprint for satellite process + + +Queue is in-memory only. Satellite restarts clear the queue. This is acceptable because events are operational telemetry, not critical data requiring persistence. + + +## Error Handling + +### Backend Response Codes + +**400 Bad Request** (Invalid event data) +- Drops the invalid event immediately +- Logs error with event details +- Continues processing other events +- No retry for malformed data + +**401 Unauthorized** (Authentication failed) +- Keeps events in queue +- Logs authentication error +- Retries in next batch cycle +- May indicate satellite needs re-registration + +**429 Too Many Requests** (Rate limited) +- Implements exponential backoff +- Backoff sequence: 3s → 6s → 12s → 24s → 48s (max) +- Keeps all events in queue +- Resumes normal 3s batching after successful send + +**500 Internal Server Error** (Backend failure) +- Keeps events in queue +- Logs backend error +- Retries in next 3s batch cycle +- Continues normal operations + +**Network Timeout / Connection Refused** +- Keeps events in queue +- Logs connection failure +- Retries in next 3s batch cycle +- Satellite continues operating normally + +### Retry Strategy + +**Natural Retry Pattern:** +- Failed batches remain in queue +- Next 3-second cycle automatically includes them +- No explicit retry logic needed + +**Exponential Backoff:** +- Only applies to 429 rate limit responses +- Temporary increase in batch interval +- Returns to 3s after successful send + +**Event Dropping:** +- Only drops events for 400 validation errors +- Never drops events for temporary failures +- Queue overflow drops oldest events (logged) + +## Graceful Shutdown + +When satellite receives SIGTERM or SIGINT: + +``` +1. Stop accepting new events +2. Cancel next scheduled batch +3. Flush all queued events immediately +4. Wait up to 5 seconds for completion +5. If successful: Log success, proceed with shutdown +6. If timeout: Force shutdown, log lost event count +``` + +**Configuration:** +```bash +EVENT_FLUSH_TIMEOUT_MS=5000 # 5-second grace period +``` + +## Integration Points + +### ProcessManager + +**Location:** `src/process/manager.ts` + +**Events Emitted:** +- `mcp.server.started` - After spawn + handshake +- `mcp.server.crashed` - On unexpected exit +- `mcp.server.restarted` - After auto-restart +- `mcp.server.permanently_failed` - After 3 failed restarts + +**Implementation Pattern:** +```typescript +// Constructor accepts optional EventBus +constructor( + logger: Logger, + eventBus?: EventBus +) { + this.eventBus = eventBus; +} + +// Emit events with try-catch protection +try { + this.eventBus?.emit('mcp.server.started', { + server_id: config.installation_id, + server_slug: config.installation_name, + team_id: config.team_id, + process_id: process.pid, + transport: 'stdio', + tool_count: 0, + spawn_duration_ms: elapsed + }); +} catch (error) { + this.logger.warn({ error }, 'Failed to emit event (non-fatal)'); +} +``` + +### SessionManager + +**Location:** `src/core/session-manager.ts` + +**Events Emitted:** +- `mcp.client.connected` - On new SSE session creation +- `mcp.client.disconnected` - On session cleanup + +**Client Type Detection:** +```typescript +private detectClientType(userAgent?: string): string { + if (!userAgent) return 'unknown'; + const ua = userAgent.toLowerCase(); + if (ua.includes('vscode')) return 'vscode'; + if (ua.includes('cursor')) return 'cursor'; + if (ua.includes('claude')) return 'claude'; + return 'unknown'; +} +``` + +### RemoteToolDiscoveryManager + +**Location:** `src/services/remote-tool-discovery-manager.ts` + +**Events Emitted:** +- `mcp.tools.discovered` - After initial tool discovery +- `mcp.tools.updated` - When tool list changes + +**Tool Change Detection:** +```typescript +// Compare previous and current tool lists +const addedTools = currentTools.filter(t => !previousTools.includes(t)); +const removedTools = previousTools.filter(t => !currentTools.includes(t)); + +if (addedTools.length > 0 || removedTools.length > 0) { + eventBus.emit('mcp.tools.updated', { + server_id, + server_slug, + team_id, + added_tools: addedTools, + removed_tools: removedTools, + total_tools: currentTools.length + }); +} +``` + +### DynamicConfigManager + +**Location:** `src/services/dynamic-config-manager.ts` + +**Events Emitted:** +- `config.refreshed` - After successful config fetch +- `config.error` - On config fetch failure + +**Configuration Hash:** +```typescript +import crypto from 'crypto'; + +const configHash = crypto + .createHash('sha256') + .update(JSON.stringify(config)) + .digest('hex') + .substring(0, 12); +``` + +## Configuration + +### Environment Variables + +```bash +# Event batching (default: 3000ms = 3 seconds) +EVENT_BATCH_INTERVAL_MS=3000 + +# Max events per batch (default: 100) +EVENT_MAX_BATCH_SIZE=100 + +# Max events in memory (default: 10000) +EVENT_MAX_QUEUE_SIZE=10000 + +# Graceful shutdown timeout (default: 5000ms) +EVENT_FLUSH_TIMEOUT_MS=5000 +``` + +### Development vs Production + +**Development:** +```bash +EVENT_BATCH_INTERVAL_MS=1000 # 1s for faster feedback +EVENT_MAX_QUEUE_SIZE=1000 # Smaller queue +LOG_LEVEL=debug # Verbose logging +``` + +**Production:** +```bash +EVENT_BATCH_INTERVAL_MS=3000 # Standard 3s +EVENT_MAX_QUEUE_SIZE=10000 # Full queue +LOG_LEVEL=info # Standard logging +``` + +## Monitoring Events + +### Structured Logging + +All event operations are logged with structured data: + +```bash +# Event emission +{"level":"debug","operation":"event_emitted","event_type":"mcp.server.started","queue_size":23} + +# Batch sending +{"level":"info","operation":"event_batch_sending","event_count":45,"queue_size":45} + +# Batch success +{"level":"info","operation":"event_batch_success","event_count":45,"duration_ms":234} + +# Queue overflow +{"level":"warn","operation":"event_queue_overflow","dropped_count":10,"queue_size":10000} + +# Backend errors +{"level":"error","operation":"event_batch_error","error":"Connection refused"} +``` + +### Log Searches + +```bash +# All event emissions +grep "event_emitted" logs/satellite.log + +# Specific event type +grep "mcp.server.started" logs/satellite.log + +# Batch operations +grep "event_batch" logs/satellite.log + +# Errors only +grep "event.*error" logs/satellite.log + +# Queue issues +grep "event_queue_overflow" logs/satellite.log +``` + +### EventBus Statistics + +Access runtime statistics programmatically: + +```typescript +const stats = eventBus.getStats(); +console.log({ + queueSize: stats.queueSize, // Current events in queue + totalEmitted: stats.totalEmitted, // Total events emitted + totalSent: stats.totalSent, // Total events sent to backend + totalFailed: stats.totalFailed, // Total send failures + totalDropped: stats.totalDropped, // Total events dropped + lastBatchSentAt: stats.lastBatchSentAt, // ISO timestamp + lastErrorAt: stats.lastErrorAt, // ISO timestamp + isShuttingDown: stats.isShuttingDown // Graceful shutdown status +}); +``` + +## Type Safety + +### Compile-Time Validation + +The event system uses TypeScript for complete type safety: + +```typescript +// ✅ Valid: Correct event type and data structure +eventBus.emit('mcp.server.started', { + server_id: 'inst_123', + server_slug: 'filesystem', + team_id: 'team_xyz', + process_id: 12345, + transport: 'stdio', + tool_count: 0, + spawn_duration_ms: 234 +}); + +// ❌ TypeScript Error: Unknown event type +eventBus.emit('invalid.event.type', { ... }); + +// ❌ TypeScript Error: Missing required field +eventBus.emit('mcp.server.started', { + server_id: 'inst_123', + // Missing: server_slug, team_id, process_id, etc. +}); + +// ❌ TypeScript Error: Wrong field type +eventBus.emit('mcp.server.started', { + server_id: 123, // Should be string + // ... +}); +``` + +### Event Registry + +**Location:** `src/events/registry.ts` + +```typescript +// Event type union +export type EventType = + | 'mcp.server.started' + | 'mcp.server.crashed' + | 'mcp.client.connected' + // ... all event types + +// Event data mapping +export interface EventDataMap { + 'mcp.server.started': { + server_id: string; + server_slug: string; + team_id: string; + process_id: number; + transport: 'stdio'; + tool_count: number; + spawn_duration_ms: number; + }; + // ... all event data structures +} + +// Complete event structure +export interface SatelliteEvent { + type: EventType; + timestamp: string; // ISO 8601 + data: EventDataMap[EventType]; +} +``` + +## Best Practices + +### DO ✅ + +**Wrap emit() calls in try-catch:** +```typescript +try { + this.eventBus?.emit('event.type', { ... }); +} catch (error) { + this.logger.warn({ error }, 'Failed to emit event (non-fatal)'); +} +``` + +**Use optional chaining:** +```typescript +// EventBus might be undefined during initialization +this.eventBus?.emit('event.type', { ... }); +``` + +**Include all required fields:** +```typescript +// TypeScript enforces this, but be explicit +eventBus.emit('mcp.server.started', { + server_id: config.installation_id, // Required + server_slug: config.installation_name, // Required + team_id: config.team_id, // Required + // ... all required fields +}); +``` + +**Calculate metrics before emitting:** +```typescript +const duration = Date.now() - startTime; +this.logger.info({ duration_ms: duration }); +eventBus.emit('operation.completed', { duration_ms: duration }); +``` + +**Use descriptive event names:** +```typescript +// ✅ Clear intent +'mcp.server.crashed' +'mcp.client.connected' + +// ❌ Vague +'server.event' +'client.update' +``` + +### DON'T ❌ + +**Never block on event emission:** +```typescript +// ❌ BAD: Don't await event emission +await eventBus.emit('event.type', { ... }); + +// ✅ GOOD: Fire-and-forget +eventBus.emit('event.type', { ... }); +``` + +**Never throw errors from emission failures:** +```typescript +// ❌ BAD: Event failure crashes service +eventBus.emit('event.type', { ... }); // Might throw + +// ✅ GOOD: Wrapped in try-catch +try { + eventBus.emit('event.type', { ... }); +} catch (error) { + logger.warn({ error }, 'Event emission failed (non-fatal)'); +} +``` + +**Never emit sensitive data:** +```typescript +// ❌ BAD: Includes passwords +eventBus.emit('auth.failed', { + username: 'user@example.com', + password: 'secret123' // Never log passwords! +}); + +// ✅ GOOD: Sanitized data +eventBus.emit('auth.failed', { + username: 'user@example.com', + reason: 'invalid_credentials' +}); +``` + +**Avoid high-frequency emission without sampling:** +```typescript +// ❌ BAD: Emits thousands of events +for (const item of largeArray) { + eventBus.emit('item.processed', { item }); +} + +// ✅ GOOD: Emit summary after batch +const processed = largeArray.map(processItem); +eventBus.emit('batch.processed', { + item_count: largeArray.length, + duration_ms: elapsed +}); +``` + +**Never assume EventBus is defined:** +```typescript +// ❌ BAD: Crashes if EventBus not initialized +this.eventBus.emit('event.type', { ... }); + +// ✅ GOOD: Optional chaining +this.eventBus?.emit('event.type', { ... }); +``` + +## Troubleshooting + +### Events Not Emitting + +**Symptom:** No `event_emitted` logs in satellite logs + +**Diagnosis:** +```typescript +// Check if EventBus is defined +console.log('EventBus defined:', !!this.eventBus); +``` + +**Fix:** Verify EventBus is assigned to service in `src/server.ts`: +```typescript +(yourService as any).eventBus = eventBus; +``` + +### Events Not Reaching Backend + +**Symptom:** Events emitted but not in backend database + +**Check backend connectivity:** +```bash +curl http://localhost:3000/api/health +``` + +**Check event batch errors:** +```bash +grep "event_batch_error" logs/satellite.log +``` + +**Check backend logs:** +```bash +# Backend should log received events +grep "satelliteEvents" logs/backend.log +``` + +### High Queue Size + +**Symptom:** `event_queue_overflow` warnings in logs + +**Causes:** +- Backend unreachable (network issues) +- Backend overloaded (429 responses) +- Very high event volume + +**Solutions:** +```bash +# Check backend connectivity +curl http://localhost:3000/api/satellites/{id}/events + +# Check for rate limiting +grep "429" logs/satellite.log + +# Monitor queue size +grep "queue_size" logs/satellite.log | tail -20 +``` + +### Batch Send Failures + +**Symptom:** Repeated `event_batch_error` logs + +**Check error details:** +```bash +grep "event_batch_error" logs/satellite.log | jq . +``` + +**Common causes:** +- Network timeout → Check network connectivity +- 401 Unauthorized → Verify satellite API key +- 500 Server Error → Check backend logs +- Connection refused → Verify backend running + +## Performance Considerations + +### Network Efficiency + +**Batch Size Impact:** +- 100 events/batch ≈ 100-200KB payload +- Single HTTP request vs 100 individual requests +- Reduced network overhead +- Backend-friendly batching + +**Batch Interval Trade-offs:** +- 3s default: Near real-time with efficient batching +- 1s interval: More real-time, more requests +- 5s interval: Less real-time, fewer requests + +### Memory Usage + +**Queue Memory:** +- Average event: 1-2KB +- Max queue: 10,000 events +- Total memory: 10-20MB +- Acceptable for satellite process + +**Queue Growth:** +- Normal: < 100 events +- Backend outage: Grows to 10,000 +- Overflow: Oldest events dropped + +### CPU Impact + +**Event Emission:** +- Synchronous queue operation +- No I/O during emit() +- < 1ms overhead per event + +**Batch Processing:** +- JSON serialization every 3 seconds +- Single HTTP POST request +- Minimal CPU impact + +## Future Enhancements + +### Disk-Based Queue (Planned) + +**Benefits:** +- Survive satellite restarts +- No event loss during crashes +- Longer backend outage tolerance + +**Trade-offs:** +- Increased complexity +- Disk I/O overhead +- Not needed for operational telemetry + +### Event Sampling (Planned) + +**High-Volume Events:** +- Sample 10% of tool executions +- 100% sampling for errors +- Configurable sampling rates + +**Benefits:** +- Reduced network traffic +- Lower backend load +- Maintained visibility into patterns + +### Real-Time Streaming (Future) + +**WebSocket Event Stream:** +- Real-time event delivery to frontend +- Sub-second latency +- Live operational dashboards + +**Requirements:** +- WebSocket infrastructure +- Frontend event handling +- Connection management + +## Related Documentation + +- [Backend Communication](/development/satellite/backend-communication) - Satellite-backend communication patterns +- [Polling](/development/satellite/polling) - Command polling system +- [Logging](/development/satellite/logging) - Structured logging configuration +- [Process Management](/development/satellite/process-management) - MCP server lifecycle diff --git a/docs/development/satellite/index.mdx b/docs/development/satellite/index.mdx index 290feab..725a492 100644 --- a/docs/development/satellite/index.mdx +++ b/docs/development/satellite/index.mdx @@ -5,7 +5,7 @@ sidebar: Getting Started --- import { Card, Cards } from 'fumadocs-ui/components/card'; -import { Cloud, Shield, Plug, Settings, Network, TestTube, Wrench, BookOpen, Terminal, Users } from 'lucide-react'; +import { Cloud, Shield, Plug, Settings, Network, TestTube, Wrench, BookOpen, Terminal, Users, Timer } from 'lucide-react'; # DeployStack Satellite Development @@ -23,11 +23,13 @@ DeployStack Satellites are **edge workers** (similar to GitHub Actions runners) - ✅ **TypeScript + Webpack** build system with full type safety - ✅ **Development Workflow** with hot reload and linting - ✅ **Backend Communication** (polling, commands, heartbeat with team-grouped processes) +- ✅ **Real-Time Event System** (immediate event emission with 3s batching, 10 event types) - ✅ **OAuth 2.1 Authentication** (token introspection, team context) - ✅ **stdio MCP Server Process Management** (spawn, monitor, auto-restart, terminate) - ✅ **Team Isolation** (environment-based: nsjail in production, plain spawn in dev) - ✅ **Auto-Restart Protection** (max 3 attempts, permanently_failed status) - ✅ **Tool Discovery** (HTTP and stdio MCP servers) +- ✅ **Background Jobs System** (cron-like recurring tasks with automatic error handling) ## Architecture Vision @@ -89,30 +91,6 @@ npm run lint # ESLint with auto-fix npm run release # Release management ``` -### Current MCP Transport Endpoints - -- **GET** `/sse` - Establish SSE connection with session management -- **POST** `/message?session={id}` - Send JSON-RPC messages via SSE sessions -- **GET/POST** `/mcp` - Streamable HTTP transport with optional sessions -- **OPTIONS** `/mcp` - CORS preflight handling - -### Testing MCP Transport - -```bash -# Test SSE connection -curl -N -H "Accept: text/event-stream" http://localhost:3001/sse - -# Send JSON-RPC message (replace SESSION_ID) -curl -X POST "http://localhost:3001/message?session=SESSION_ID" \ - -H "Content-Type: application/json" \ - -d '{"jsonrpc":"2.0","id":"1","method":"initialize","params":{}}' - -# Direct HTTP transport -curl -X POST http://localhost:3001/mcp \ - -H "Content-Type: application/json" \ - -d '{"jsonrpc":"2.0","id":"1","method":"tools/list","params":{}}' -``` - ## Development Guides @@ -195,11 +173,27 @@ curl -X POST http://localhost:3001/mcp \ > Satellite deployment patterns, monitoring, scaling, and operational considerations. + + } + href="/development/satellite/background-jobs" + title="Background Jobs" + > + Cron-like job system for recurring tasks with automatic error handling and monitoring. + + + } + href="/development/satellite/event-system" + title="Event System" + > + Real-time event emission from satellite to backend with automatic batching and error handling. + ## Current Features -### MCP Transport Layer (Implemented) +### MCP Transport Layer - **SSE Transport**: Server-Sent Events with session management - **SSE Messaging**: JSON-RPC message sending via established sessions - **Streamable HTTP**: Direct HTTP communication with optional streaming @@ -222,29 +216,30 @@ curl -X POST http://localhost:3001/mcp \ ## Implemented Features -### Phase 2: MCP Server Process Management ✅ COMPLETED +### Phase 2: MCP Server Process Management - **Process Lifecycle**: Spawn, monitor, auto-restart (max 3), and terminate MCP servers - **stdio Communication**: Full JSON-RPC 2.0 protocol over stdin/stdout -- **HTTP Proxy**: Reverse proxy for external MCP server endpoints ✅ working +- **HTTP Proxy**: Reverse proxy for external MCP server endpoints working - **Health Monitoring**: Process crash detection with auto-restart - **Resource Limits**: nsjail with 100MB RAM, 60s CPU, 50 processes (production Linux) - **Tool Discovery**: Automatic tool caching from both HTTP and stdio servers - **Team-Grouped Heartbeat**: processes_by_team reporting every 30 seconds -### Phase 3: Team Isolation (Infrastructure Ready) +### Phase 3: Team Isolation - **nsjail Sandboxing**: Complete process isolation with built-in resource limits - **Namespace Isolation**: PID, mount, UTS, IPC namespaces per team - **Filesystem Isolation**: Team-specific read-only and writable directories - **Credential Management**: Secure environment injection via nsjail -### Phase 4: Backend Integration ✅ COMPLETED +### Phase 4: Backend Integration - **HTTP Polling**: Outbound communication with DeployStack Backend - **Configuration Sync**: Dynamic configuration updates from Backend - **Status Reporting**: Real-time satellite health and usage metrics - **Command Processing**: Execute Backend commands with acknowledgment +- **Event System**: Real-time event emission with automatic batching (10 event types) ### Phase 5: Enterprise Features -- **OAuth 2.1 Authentication**: Resource server with token introspection ✅ COMPLETED +- **OAuth 2.1 Authentication**: Resource server with token introspection - **Audit Logging**: Complete audit trails for compliance - **Multi-Region Support**: Global satellite deployment - **Auto-Scaling**: Dynamic resource allocation based on demand @@ -269,28 +264,12 @@ Follow established patterns when adding new routes: 4. Use manual JSON serialization with `JSON.stringify()` 5. Register routes in `src/routes/index.ts` -### Logging Best Practices -- Use structured logging with context objects -- Pass logger instances as parameters to services -- Include operation identifiers for traceability -- Use appropriate log levels (debug, info, warn, error) -- Avoid console.log statements in favor of Pino logger - ### Configuration Management - Use environment variables for configuration - Provide sensible defaults for development - Document all configuration options - Support both development and production modes -## Strategic Context - -The satellite service represents DeployStack's evolution from a developer tool into a comprehensive enterprise MCP management platform. This strategic pivot addresses: - -- **Adoption Friction**: Eliminates CLI installation barriers (12x better conversion) -- **Market Differentiation**: Creates new "MCP-as-a-Service" category -- **Enterprise Requirements**: Provides team isolation and compliance features -- **Scalability**: Enables horizontal scaling and global deployment - ## Contributing When contributing to satellite development: @@ -300,5 +279,4 @@ When contributing to satellite development: 3. **Document Changes**: Update relevant documentation for new features 4. **Test Thoroughly**: Ensure changes work in both development and production 5. **Consider Enterprise**: Design features with team isolation and security in mind -6. **MCP Compliance**: Ensure JSON-RPC 2.0 protocol compliance diff --git a/docs/development/satellite/process-management.mdx b/docs/development/satellite/process-management.mdx index fd4f0a2..e8c746b 100644 --- a/docs/development/satellite/process-management.mdx +++ b/docs/development/satellite/process-management.mdx @@ -244,6 +244,39 @@ The ProcessManager emits events for monitoring and integration: - Request tracking includes: `request_id`, `method`, `duration_ms` - Error context includes: error messages, exit codes, signals +## Event Emission + +The ProcessManager emits real-time events to the Backend for operational visibility and audit trails. These events are batched every 3 seconds and sent via the Event System. + +### Lifecycle Events + +**mcp.server.started** +- Emitted after successful spawn and handshake completion +- Includes: server_id, process_id, spawn_duration_ms, tool_count +- Provides immediate visibility into new MCP server availability + +**mcp.server.crashed** +- Emitted on unexpected process exit with non-zero code +- Includes: exit_code, signal, uptime_seconds, crash_count, will_restart +- Enables real-time alerting for process failures + +**mcp.server.restarted** +- Emitted after successful automatic restart +- Includes: old_process_id, new_process_id, restart_reason, attempt_number +- Tracks restart attempts for reliability monitoring + +**mcp.server.permanently_failed** +- Emitted when restart limit (3 attempts) is exceeded +- Includes: total_crashes, last_error, failed_at timestamp +- Critical alert requiring manual intervention + +**Event vs Internal Events:** +- ProcessManager internal events (processSpawned, processTerminated, etc.) are for satellite-internal coordination +- Event System events (mcp.server.started, etc.) are sent to Backend for external visibility +- Both work together: Internal events trigger state changes, Event System events provide audit trail + +For complete event system documentation and all event types, see [Event System](/development/satellite/event-system). + ## Team Isolation ### Installation Name Format