Skip to content

Memory leak during Socket.IO reconnection failures leads to OOM crash #88

@Lies188

Description

@Lies188

Summary

HAPI experiences a memory leak when Socket.IO connections fail and enter a reconnection loop. After prolonged reconnection failures, the system runs out of memory and crashes.

Environment

  • HAPI Version: 0.9.2
  • Installation: bun install -g @twsxtd/hapi-linux-x64
  • OS: Ubuntu 24.04.3 LTS (Noble Numbat)
  • Kernel: 6.8.0-90-generic x86_64
  • CPU: AMD EPYC 7K62 48-Core Processor (4 vCPUs)
  • Memory: 7.5 GB RAM + 1.9 GB Swap
  • Node.js: v24.13.0
  • Bun: 1.3.6

Running Services

The following HAPI processes were running:

  • `hapi server`
  • `hapi claude --yolo`
  • `hapi daemon start-sync`
  • `hapi codex` (spawned by daemon for remote sessions)

Timeline of Events

1. Socket Disconnection (00:36:59)

The initial disconnection was caused by a ping timeout:

```
[00:36:59.175] [API] Socket disconnected: ping timeout
```

2. Continuous Reconnection Failures (00:37 - 01:11)

Socket.IO entered a reconnection loop with 35+ consecutive failures over ~35 minutes:

```
[00:37:50.616] [API] Socket connection error: {}
[00:39:20.261] [API] Socket connection error: {}
[00:40:28.289] [API] Socket connection error: {}
... (35+ more errors)
[01:11:25.322] [API] Socket connection error: {}
```

3. Memory Pressure Begins (01:08:44)

System started experiencing memory pressure during the reconnection loop:

```
Jan 19 01:08:44 systemd-journald[74650]: Under memory pressure, flushing caches.
Jan 19 01:08:46 systemd-journald[74650]: Under memory pressure, flushing caches.
... (continuous until crash)
Jan 19 01:14:49 systemd-journald[74650]: Under memory pressure, flushing caches.
```

4. System Crash (01:17:26)

The system became unresponsive and automatically rebooted at 01:17:26.

Code Analysis

After reviewing the source code, I identified potential issues in the Socket.IO client configuration:

In `cli/src/api/apiSession.ts` (lines 74-87):

```typescript
this.socket = io(`${configuration.serverUrl}/cli`, {
auth: { ... },
path: '/socket.io/',
reconnection: true,
reconnectionAttempts: Infinity, // ⚠️ No limit on reconnection attempts
reconnectionDelay: 1000,
reconnectionDelayMax: 5000, // ⚠️ Max delay only 5 seconds
transports: ['websocket'],
autoConnect: false
})
```

In `cli/src/api/apiMachine.ts` (lines 217-229):

```typescript
this.socket = io(`${configuration.serverUrl}/cli`, {
transports: ['websocket'],
auth: { ... },
path: '/socket.io/',
reconnection: true,
reconnectionDelay: 1000,
reconnectionDelayMax: 5000 // ⚠️ Same issue
})
```

Potential Issues

  1. Unlimited reconnection attempts (`reconnectionAttempts: Infinity`): The client will attempt to reconnect forever, potentially accumulating resources with each failed attempt.

  2. Short max delay (`reconnectionDelayMax: 5000`): With only 5 seconds max delay, during prolonged outages the client makes frequent reconnection attempts (every 5 seconds after reaching max delay).

  3. No cleanup between attempts: Looking at the disconnect handler, there's no explicit cleanup of internal buffers or event listeners that might accumulate.

  4. Concurrent resource usage: The Codex session was processing a large code review task with 4,409,050 input tokens, adding to memory pressure.

Memory Usage Before Crash

Based on current measurements after restart, HAPI processes typically consume:

Process Memory
hapi claude (2 instances) ~360 MB
hapi server ~170 MB
hapi daemon ~150 MB
hapi codex ~160 MB
hapi mcp ~140 MB
Total HAPI ~1 GB

Suggested Fixes

1. Add maximum reconnection attempts

```typescript
reconnectionAttempts: 50, // Stop after 50 failed attempts
```

2. Increase maximum reconnection delay (exponential backoff)

```typescript
reconnectionDelayMax: 30000, // 30 seconds max delay
```

3. Add reconnection failure handler

```typescript
this.socket.on('reconnect_failed', () => {
logger.error('[API] Reconnection failed after max attempts')
// Optionally: notify user, trigger graceful shutdown, or restart
})
```

4. Consider using the existing `withRetry` utility

The codebase already has a well-designed retry utility in `cli/src/utils/time.ts` with proper exponential backoff and max attempts. Consider using similar logic for Socket.IO reconnection.

5. Add memory monitoring

Consider adding a memory watchdog that logs warnings when memory usage exceeds thresholds and takes corrective action before OOM.

Logs

Full HAPI log file: `~/.hapi/logs/2026-01-18-01-44-23-pid-651323.log` (4.7 MB)

Steps to Reproduce

  1. Start HAPI services (`hapi server`, `hapi claude --yolo`, `hapi daemon start-sync`)
  2. Run a resource-intensive task (e.g., large code review with Codex)
  3. Simulate network instability or server unavailability causing Socket.IO disconnection
  4. Wait for prolonged reconnection failures (30+ minutes)
  5. Observe memory pressure and eventual OOM

Impact

  • Long-running HAPI services are at risk of OOM crashes
  • Requires manual restart after crash
  • Can cause data loss for in-progress tasks

Thank you for developing this excellent tool! Hope this detailed report helps identify and fix the issue. 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions