Memory leak during Socket.IO reconnection failures leads to OOM crash

## Summary

HAPI experiences a memory leak when Socket.IO connections fail and enter a reconnection loop. After prolonged reconnection failures, the system runs out of memory and crashes.

## Environment

- **HAPI Version**: 0.9.2
- **Installation**: `bun install -g @twsxtd/hapi-linux-x64`
- **OS**: Ubuntu 24.04.3 LTS (Noble Numbat)
- **Kernel**: 6.8.0-90-generic x86_64
- **CPU**: AMD EPYC 7K62 48-Core Processor (4 vCPUs)
- **Memory**: 7.5 GB RAM + 1.9 GB Swap
- **Node.js**: v24.13.0
- **Bun**: 1.3.6

## Running Services

The following HAPI processes were running:
- \`hapi server\`
- \`hapi claude --yolo\`
- \`hapi daemon start-sync\`
- \`hapi codex\` (spawned by daemon for remote sessions)

## Timeline of Events

### 1. Socket Disconnection (00:36:59)

The initial disconnection was caused by a ping timeout:

\`\`\`
[00:36:59.175] [API] Socket disconnected: ping timeout
\`\`\`

### 2. Continuous Reconnection Failures (00:37 - 01:11)

Socket.IO entered a reconnection loop with 35+ consecutive failures over ~35 minutes:

\`\`\`
[00:37:50.616] [API] Socket connection error: {}
[00:39:20.261] [API] Socket connection error: {}
[00:40:28.289] [API] Socket connection error: {}
... (35+ more errors)
[01:11:25.322] [API] Socket connection error: {}
\`\`\`

### 3. Memory Pressure Begins (01:08:44)

System started experiencing memory pressure during the reconnection loop:

\`\`\`
Jan 19 01:08:44 systemd-journald[74650]: Under memory pressure, flushing caches.
Jan 19 01:08:46 systemd-journald[74650]: Under memory pressure, flushing caches.
... (continuous until crash)
Jan 19 01:14:49 systemd-journald[74650]: Under memory pressure, flushing caches.
\`\`\`

### 4. System Crash (01:17:26)

The system became unresponsive and automatically rebooted at 01:17:26.

## Code Analysis

After reviewing the source code, I identified potential issues in the Socket.IO client configuration:

### In \`cli/src/api/apiSession.ts\` (lines 74-87):

\`\`\`typescript
this.socket = io(\`\${configuration.serverUrl}/cli\`, {
    auth: { ... },
    path: '/socket.io/',
    reconnection: true,
    reconnectionAttempts: Infinity,  // ⚠️ No limit on reconnection attempts
    reconnectionDelay: 1000,
    reconnectionDelayMax: 5000,      // ⚠️ Max delay only 5 seconds
    transports: ['websocket'],
    autoConnect: false
})
\`\`\`

### In \`cli/src/api/apiMachine.ts\` (lines 217-229):

\`\`\`typescript
this.socket = io(\`\${configuration.serverUrl}/cli\`, {
    transports: ['websocket'],
    auth: { ... },
    path: '/socket.io/',
    reconnection: true,
    reconnectionDelay: 1000,
    reconnectionDelayMax: 5000       // ⚠️ Same issue
})
\`\`\`

### Potential Issues

1. **Unlimited reconnection attempts** (\`reconnectionAttempts: Infinity\`): The client will attempt to reconnect forever, potentially accumulating resources with each failed attempt.

2. **Short max delay** (\`reconnectionDelayMax: 5000\`): With only 5 seconds max delay, during prolonged outages the client makes frequent reconnection attempts (every 5 seconds after reaching max delay).

3. **No cleanup between attempts**: Looking at the disconnect handler, there's no explicit cleanup of internal buffers or event listeners that might accumulate.

4. **Concurrent resource usage**: The Codex session was processing a large code review task with 4,409,050 input tokens, adding to memory pressure.

## Memory Usage Before Crash

Based on current measurements after restart, HAPI processes typically consume:

| Process | Memory |
|---------|--------|
| hapi claude (2 instances) | ~360 MB |
| hapi server | ~170 MB |
| hapi daemon | ~150 MB |
| hapi codex | ~160 MB |
| hapi mcp | ~140 MB |
| **Total HAPI** | **~1 GB** |

## Suggested Fixes

### 1. Add maximum reconnection attempts

\`\`\`typescript
reconnectionAttempts: 50,  // Stop after 50 failed attempts
\`\`\`

### 2. Increase maximum reconnection delay (exponential backoff)

\`\`\`typescript
reconnectionDelayMax: 30000,  // 30 seconds max delay
\`\`\`

### 3. Add reconnection failure handler

\`\`\`typescript
this.socket.on('reconnect_failed', () => {
    logger.error('[API] Reconnection failed after max attempts')
    // Optionally: notify user, trigger graceful shutdown, or restart
})
\`\`\`

### 4. Consider using the existing \`withRetry\` utility

The codebase already has a well-designed retry utility in \`cli/src/utils/time.ts\` with proper exponential backoff and max attempts. Consider using similar logic for Socket.IO reconnection.

### 5. Add memory monitoring

Consider adding a memory watchdog that logs warnings when memory usage exceeds thresholds and takes corrective action before OOM.

## Logs

Full HAPI log file: \`~/.hapi/logs/2026-01-18-01-44-23-pid-651323.log\` (4.7 MB)

## Steps to Reproduce

1. Start HAPI services (\`hapi server\`, \`hapi claude --yolo\`, \`hapi daemon start-sync\`)
2. Run a resource-intensive task (e.g., large code review with Codex)
3. Simulate network instability or server unavailability causing Socket.IO disconnection
4. Wait for prolonged reconnection failures (30+ minutes)
5. Observe memory pressure and eventual OOM

## Impact

- Long-running HAPI services are at risk of OOM crashes
- Requires manual restart after crash
- Can cause data loss for in-progress tasks

---

Thank you for developing this excellent tool! Hope this detailed report helps identify and fix the issue. 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak during Socket.IO reconnection failures leads to OOM crash #88

Summary

Environment

Running Services

Timeline of Events

1. Socket Disconnection (00:36:59)

2. Continuous Reconnection Failures (00:37 - 01:11)

3. Memory Pressure Begins (01:08:44)

4. System Crash (01:17:26)

Code Analysis

In `cli/src/api/apiSession.ts` (lines 74-87):

In `cli/src/api/apiMachine.ts` (lines 217-229):

Potential Issues

Memory Usage Before Crash

Suggested Fixes

1. Add maximum reconnection attempts

2. Increase maximum reconnection delay (exponential backoff)

3. Add reconnection failure handler

4. Consider using the existing `withRetry` utility

5. Add memory monitoring

Logs

Steps to Reproduce

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Process	Memory
hapi claude (2 instances)	~360 MB
hapi server	~170 MB
hapi daemon	~150 MB
hapi codex	~160 MB
hapi mcp	~140 MB
Total HAPI	~1 GB

Uh oh!

Memory leak during Socket.IO reconnection failures leads to OOM crash #88

Description

Summary

Environment

Running Services

Timeline of Events

1. Socket Disconnection (00:36:59)

2. Continuous Reconnection Failures (00:37 - 01:11)

3. Memory Pressure Begins (01:08:44)

4. System Crash (01:17:26)

Code Analysis

In `cli/src/api/apiSession.ts` (lines 74-87):

In `cli/src/api/apiMachine.ts` (lines 217-229):

Potential Issues

Memory Usage Before Crash

Suggested Fixes

1. Add maximum reconnection attempts

2. Increase maximum reconnection delay (exponential backoff)

3. Add reconnection failure handler

4. Consider using the existing `withRetry` utility

5. Add memory monitoring

Logs

Steps to Reproduce

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions