Skip to content

Conversation

@xingdi-eric-yuan
Copy link
Collaborator

@xingdi-eric-yuan xingdi-eric-yuan commented Jan 8, 2026

Fix nohup commands not returning immediately with timeout wrapper

Problem

When nohup commands with background execution (&) are run through the timeout wrapper in Docker/Kubernetes terminals without TTY, the timeout command doesn't return immediately. This caused the requests package setup in SWE-Bench to hang for 300 seconds when starting gunicorn servers.

Root Cause - Bash Job Control in Non-TTY Mode

timeout 300 /bin/bash -c 'nohup gunicorn ... > /dev/null 2>&1 &'

In non-TTY mode:

  • The timeout command monitors the bash shell it launches
  • Even with &, bash's job control tracks the backgrounded process
  • Bash waits for background jobs to fully detach before exiting
  • This waiting behavior persists even with output redirection and setsid
  • Results in the timeout wrapper waiting unnecessarily

Why TTY matters:

  • With TTY: Background processes detach cleanly from the controlling terminal
  • Without TTY (docker exec, kubectl exec): Bash job control behaves differently

Solution - Use Subshell (...)

The fix uses bash subshells (...) to create a subprocess that exits immediately after backgrounding:

Before (insufficient):

self.terminal.run("nohup gunicorn ... > /dev/null 2>&1 &")

After (correct):

self.terminal.run("(nohup gunicorn ... > /dev/null 2>&1 &)")

Why this works:

  • The (...) creates a subshell that is a separate process
  • The subshell backgrounds the nohup command and exits immediately
  • The parent bash shell (inside timeout) sees the subshell exit with status 0
  • The timeout wrapper returns immediately since its direct child (subshell) has exited
  • The background process continues running independently in the subshell's context
  • Simpler syntax than sh -c with equivalent behavior

Changes

File Changes
debug_gym/gym/envs/swe_bench.py Use (...) subshell wrapper for 2 nohup gunicorn commands
tests/gym/terminals/test_docker.py Added test with warmup to verify subshell fix
tests/gym/terminals/test_kubernetes.py Added test with warmup to verify subshell fix

Commit history:

  1. Initial fix with output redirection (insufficient for non-TTY mode)
  2. Tried setsid approach (still had 5+ second delay due to bash job control)
  3. Tried sh -c approach (3+ second delay, improved but still not optimal)
  4. Final fix with (...) subshell + warmup commands to exclude startup overhead

Testing

Added comprehensive tests verifying (...) subshell with nohup works in non-TTY Docker/Kubernetes exec:

Test: test_*_nohup_with_subshell_returns_immediately

  • ✅ Warms up terminal with dummy command to exclude container/pod startup time
  • ✅ Verifies (nohup ... > /dev/null 2>&1 &) returns in < 2 seconds
  • ✅ Confirms background process actually starts and runs
  • ✅ Tests that subshell approach avoids bash job control delays

Test: test_*_nohup_without_redirection_may_timeout

  • ⚠️ Demonstrates the original problem
  • ⚠️ Shows commands without proper wrapping hit the timeout
  • ⚠️ Provides regression testing

Each test:

  • Runs in non-TTY mode (via Docker/Kubernetes exec)
  • Warms up the terminal first to isolate command execution time from startup overhead
  • Measures execution time with time.time()
  • Verifies background processes are running using pgrep
  • Properly cleans up with pkill

Impact

  • ✅ Eliminates 300-second hangs when setting up requests package environments
  • ✅ Works correctly in non-TTY execution contexts (docker exec, kubectl exec)
  • ✅ Bypasses bash job control issues
  • ✅ Applies to all SWE-Bench tasks that use the requests package
  • ✅ Simpler syntax than sh -c or setsid
  • ✅ Tests account for startup overhead to accurately measure performance

Technical Note

This approach uses bash subshells for clean process separation:

  • (...) - Creates subshell that exits immediately after backgrounding
  • nohup - Ignores SIGHUP signals
  • > /dev/null 2>&1 - Closes stdout/stderr file descriptors
  • & - Backgrounds the process within the subshell

The key insight is that subshells (...) create separate processes that don't participate in the parent shell's job control, allowing immediate return.

Related

Fixes timeout issue reported in #325 (screenshot shows the 300s timeout being triggered in Kubernetes exec without TTY)


🤖 Generated with Claude Code

xingdi-eric-yuan and others added 3 commits January 8, 2026 14:49
When nohup commands with background execution (&) are run through the
timeout wrapper in Docker/Kubernetes terminals, the timeout command
doesn't return immediately because background processes inherit the
shell's stdout/stderr file descriptors.

This fix adds proper output redirection (> /dev/null 2>&1) to the
gunicorn nohup commands in SWE-Bench setup, ensuring the timeout
wrapper returns immediately after the shell exits instead of waiting
for the full timeout period.

Also adds comprehensive tests for both Docker and Kubernetes terminals
to verify nohup commands with proper redirection return immediately.

Fixes issue reported in #325 where requests package setup would hang
for 300 seconds when starting background gunicorn servers.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous fix using only output redirection was insufficient because
in non-TTY mode, the timeout command monitors the entire process group,
not just file descriptors. Even with > /dev/null 2>&1, backgrounded
processes remain in the same process group as the shell.

Using setsid creates a new session, completely detaching the process
from timeout's process group. This ensures the timeout-wrapped command
returns immediately after the shell exits, even in non-TTY execution
contexts like docker exec and kubectl exec.

Changes:
- Added setsid before nohup gunicorn commands
- Updated test names and documentation to reflect setsid usage
- Tests verify processes detach properly in non-TTY mode

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@MarcCote MarcCote merged commit 200d308 into main Jan 9, 2026
8 checks passed
@MarcCote MarcCote deleted the timeout_nohup branch January 9, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants