-
-
Notifications
You must be signed in to change notification settings - Fork 167
Description
Summary
HAPI experiences a memory leak when Socket.IO connections fail and enter a reconnection loop. After prolonged reconnection failures, the system runs out of memory and crashes.
Environment
- HAPI Version: 0.9.2
- Installation:
bun install -g @twsxtd/hapi-linux-x64 - OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Kernel: 6.8.0-90-generic x86_64
- CPU: AMD EPYC 7K62 48-Core Processor (4 vCPUs)
- Memory: 7.5 GB RAM + 1.9 GB Swap
- Node.js: v24.13.0
- Bun: 1.3.6
Running Services
The following HAPI processes were running:
- `hapi server`
- `hapi claude --yolo`
- `hapi daemon start-sync`
- `hapi codex` (spawned by daemon for remote sessions)
Timeline of Events
1. Socket Disconnection (00:36:59)
The initial disconnection was caused by a ping timeout:
```
[00:36:59.175] [API] Socket disconnected: ping timeout
```
2. Continuous Reconnection Failures (00:37 - 01:11)
Socket.IO entered a reconnection loop with 35+ consecutive failures over ~35 minutes:
```
[00:37:50.616] [API] Socket connection error: {}
[00:39:20.261] [API] Socket connection error: {}
[00:40:28.289] [API] Socket connection error: {}
... (35+ more errors)
[01:11:25.322] [API] Socket connection error: {}
```
3. Memory Pressure Begins (01:08:44)
System started experiencing memory pressure during the reconnection loop:
```
Jan 19 01:08:44 systemd-journald[74650]: Under memory pressure, flushing caches.
Jan 19 01:08:46 systemd-journald[74650]: Under memory pressure, flushing caches.
... (continuous until crash)
Jan 19 01:14:49 systemd-journald[74650]: Under memory pressure, flushing caches.
```
4. System Crash (01:17:26)
The system became unresponsive and automatically rebooted at 01:17:26.
Code Analysis
After reviewing the source code, I identified potential issues in the Socket.IO client configuration:
In `cli/src/api/apiSession.ts` (lines 74-87):
```typescript
this.socket = io(`${configuration.serverUrl}/cli`, {
auth: { ... },
path: '/socket.io/',
reconnection: true,
reconnectionAttempts: Infinity, //
reconnectionDelay: 1000,
reconnectionDelayMax: 5000, //
transports: ['websocket'],
autoConnect: false
})
```
In `cli/src/api/apiMachine.ts` (lines 217-229):
```typescript
this.socket = io(`${configuration.serverUrl}/cli`, {
transports: ['websocket'],
auth: { ... },
path: '/socket.io/',
reconnection: true,
reconnectionDelay: 1000,
reconnectionDelayMax: 5000 //
})
```
Potential Issues
-
Unlimited reconnection attempts (`reconnectionAttempts: Infinity`): The client will attempt to reconnect forever, potentially accumulating resources with each failed attempt.
-
Short max delay (`reconnectionDelayMax: 5000`): With only 5 seconds max delay, during prolonged outages the client makes frequent reconnection attempts (every 5 seconds after reaching max delay).
-
No cleanup between attempts: Looking at the disconnect handler, there's no explicit cleanup of internal buffers or event listeners that might accumulate.
-
Concurrent resource usage: The Codex session was processing a large code review task with 4,409,050 input tokens, adding to memory pressure.
Memory Usage Before Crash
Based on current measurements after restart, HAPI processes typically consume:
| Process | Memory |
|---|---|
| hapi claude (2 instances) | ~360 MB |
| hapi server | ~170 MB |
| hapi daemon | ~150 MB |
| hapi codex | ~160 MB |
| hapi mcp | ~140 MB |
| Total HAPI | ~1 GB |
Suggested Fixes
1. Add maximum reconnection attempts
```typescript
reconnectionAttempts: 50, // Stop after 50 failed attempts
```
2. Increase maximum reconnection delay (exponential backoff)
```typescript
reconnectionDelayMax: 30000, // 30 seconds max delay
```
3. Add reconnection failure handler
```typescript
this.socket.on('reconnect_failed', () => {
logger.error('[API] Reconnection failed after max attempts')
// Optionally: notify user, trigger graceful shutdown, or restart
})
```
4. Consider using the existing `withRetry` utility
The codebase already has a well-designed retry utility in `cli/src/utils/time.ts` with proper exponential backoff and max attempts. Consider using similar logic for Socket.IO reconnection.
5. Add memory monitoring
Consider adding a memory watchdog that logs warnings when memory usage exceeds thresholds and takes corrective action before OOM.
Logs
Full HAPI log file: `~/.hapi/logs/2026-01-18-01-44-23-pid-651323.log` (4.7 MB)
Steps to Reproduce
- Start HAPI services (`hapi server`, `hapi claude --yolo`, `hapi daemon start-sync`)
- Run a resource-intensive task (e.g., large code review with Codex)
- Simulate network instability or server unavailability causing Socket.IO disconnection
- Wait for prolonged reconnection failures (30+ minutes)
- Observe memory pressure and eventual OOM
Impact
- Long-running HAPI services are at risk of OOM crashes
- Requires manual restart after crash
- Can cause data loss for in-progress tasks
Thank you for developing this excellent tool! Hope this detailed report helps identify and fix the issue. 🙏