Skip to content

SWT bench: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop #378

@juanmichelini

Description

@juanmichelini

🚨 Critical: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop

Summary

Multiple long-running evaluation jobs are getting stuck during the evaluation phase when conda-env create enters an infinite dependency resolution loop, consuming 99.9% CPU and preventing jobs from completing despite successful inference.

Affected Pods

Pod Model Runtime Status Stuck Containers
eval-eval-21371512858-gpt-5-2-co-kdvhp gpt-5.2-codex 42h+ ❌ Killed 1+ (recurring)
eval-21413268827-nemotron-3-9jmwh NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 42h+ ⚠️ Stuck 2 containers

Pattern: All affected pods are running SWT-bench_Verified benchmark.


Problem Description

Timeline

  1. Inference Phase: Completes successfully (432/432 instances)
  2. Output Generation: All JSONL files created correctly
  3. ⚠️ Evaluation Phase: swtbench-eval starts processing test instances
  4. 🚨 Stuck: Some test instances require creating conda environments that enter infinite dependency resolution

Symptoms

Stuck Process Pattern:

USER    PID    %CPU  %MEM      VSZ      RSS  COMMAND
root     13    99.9   9.6  12.4GB   11.9GB  /opt/miniconda3/bin/python /opt/miniconda3/bin/conda-env create --file environment.yml
  • CPU Usage: Stuck at 99.9% indefinitely
  • Memory: Consumes 8-12 GB per stuck process
  • Duration: Continues for hours without progress
  • Status: Process is in 'R' (running) state but making no progress
  • Behavior: Conda dependency resolver enters infinite loop, never completes or fails

Multiple Stuck Containers:

  • Evaluation harness retries failed environment setups
  • Each retry spawns new Docker container
  • New container immediately gets stuck in same conda loop
  • Results in 2+ concurrent stuck processes per pod

Impact

Resource Waste

  • CPU: ~200% CPU per pod (2 stuck processes × 99.9% each)
  • Memory: ~20 GB per pod
  • Time: Jobs running 40+ hours when they should complete in <10 hours
  • Cluster: Multiple pods affected simultaneously

Data Loss Risk

  • Inference work is complete but inaccessible
  • Pods must be killed to free resources
  • Evaluation metrics cannot be calculated

Root Cause Analysis

Problem: Specific SWT-bench test instances have environment.yml files with complex or conflicting dependencies that conda's SAT solver cannot resolve efficiently.

Why Conda Gets Stuck:

  • Conda's dependency resolver uses a SAT solver
  • Complex dependency graphs can cause exponential search space
  • Some SWT-bench instances have legacy/conflicting requirements
  • No timeout mechanism in conda-env create
  • Process never fails, just loops indefinitely

Why It Keeps Retrying:

  • Evaluation harness expects environment creation to eventually succeed or fail
  • Conda process never returns error code (still "running")
  • Harness kills container after timeout and retries
  • Same problematic environment.yml gets retried indefinitely

Evidence

Pod: eval-eval-21371512858-gpt-5-2-co-kdvhp

Status: Killed after 42 hours

Observations:

  • Inference completed: 432/432 instances
  • Stuck process: PID 13 in container, 99.9% CPU for 4+ hours
  • Container killed, respawned, immediately stuck again
  • Multiple kill/respawn cycles before pod termination

Diagnostics: Archived in eval-21371512858-gpt-5-2-codex-COMPLETE.tar.gz (304 MB)

Pod: eval-21413268827-nemotron-3-9jmwh

Status: Currently stuck (42+ hours)

Two Stuck Containers:

Container 0b11aeaf7f3b:

  • Created: 2026-01-29 10:28:10 UTC
  • PID 13 at 99.9% CPU
  • Memory: 11.9 GB
  • Duration: 4.5+ hours stuck

Container 39de478efa10:

  • Created: 2026-01-29 12:50:17 UTC
  • PID 12 at 99.9% CPU
  • Memory: 8.8 GB
  • Duration: 2.2+ hours stuck

Process Details:

# Inside container 0b11aeaf7f3b
root  13  99.9  9.6  12386412  11929916  ?  R  10:28  273:14  /opt/miniconda3/bin/python /opt/miniconda3/bin/conda-env create --file environment.yml

Diagnostics: Available in eval-21413268827-nemotron-3-9jmwh_logs/ (969 MB)


Reproduction Steps

  1. Start SWT-bench_Verified evaluation job with any model
  2. Wait for inference to complete (works fine)
  3. Wait for evaluation phase to start
  4. Monitor for stuck conda-env create processes
  5. Eventually 1-2 containers will get stuck at 99.9% CPU

Consistency: Observed in 2/2 SWT-bench_Verified jobs monitored over 40+ hours each.


Proposed Solutions

Option 1: Add Timeout to Conda Environment Creation (RECOMMENDED)

Implementation:

# In evaluation harness, replace:
conda-env create --file environment.yml

# With timeout wrapper:
timeout 300 conda-env create --file environment.yml || {
  echo "Conda environment creation timed out after 5 minutes"
  echo "Trying with mamba as fallback..."
  timeout 300 mamba env create --file environment.yml || {
    echo "Environment creation failed, skipping instance"
    exit 1
  }
}

Benefits:

  • Prevents infinite loops
  • Allows harness to skip problematic instances
  • Provides clear failure signal

Option 2: Use Mamba Instead of Conda

Rationale:

  • Mamba uses libsolv (faster SAT solver)
  • Typically 10-100x faster than conda
  • More robust with complex dependencies
  • Drop-in replacement for conda

Implementation:

# Install mamba in eval container base image
mamba env create --file environment.yml

Option 3: Pre-create Problematic Environments

Steps:

  1. Identify which instances have problematic environments
  2. Pre-build and cache these environments
  3. Skip conda resolution during evaluation
  4. Use cached environments

Option 4: Skip Environment Creation for Known Bad Instances

Implementation:

  • Maintain allowlist/denylist of instance IDs
  • Skip evaluation for instances with known-bad environments
  • Log skip reason for metrics

Immediate Workaround

For Stuck Jobs:

# Find stuck containers
kubectl exec -n evaluation-jobs <POD_NAME> -c eval-container -- \
  ps aux --sort=-%cpu | grep "conda-env create" | grep -v grep

# Kill specific Docker container
kubectl exec -n evaluation-jobs <POD_NAME> -c eval-container -- \
  docker kill <CONTAINER_ID>

# Or kill entire pod to free resources
kubectl delete pod -n evaluation-jobs <POD_NAME>

Note: Killing containers only provides temporary relief - they will respawn and get stuck again. Killing the pod is the only way to fully stop the retry loop.


Monitoring

Detection Script:

# Check all eval pods for stuck conda processes
for pod in $(kubectl get pods -n evaluation-jobs -o name | grep eval-); do
  echo "=== $pod ==="
  kubectl exec -n evaluation-jobs ${pod#pod/} -c eval-container -- \
    ps aux --sort=-%cpu | head -5 | grep -E "conda-env|PID"
done

Indicators of Stuck Job:

  • Process at 99.9% CPU for >10 minutes
  • Memory usage stable (not growing)
  • CPU time growing but no log output
  • Multiple containers with same pattern

Related Issues

This may be related to:

  • SWT-bench dataset quality (some instances may have unsolvable environments)
  • Conda SAT solver performance with legacy Python packages
  • Lack of timeout mechanisms in evaluation harness
  • Docker container retry logic

Request for Action

  1. Immediate: Add timeout wrapper to conda environment creation
  2. Short-term: Switch to mamba for faster dependency resolution
  3. Long-term: Identify and fix/skip problematic SWT-bench instances
  4. Monitoring: Add detection for stuck processes in eval dashboard

Additional Context

  • Benchmark: SWT-bench_Verified (princeton-nlp/SWE-bench_Verified)
  • Namespace: evaluation-jobs
  • Image: Python 3.12-slim with conda/miniconda3
  • Workers: 24 parallel workers
  • Container Runtime: Docker-in-Docker (dind)

All diagnostic logs, output files, and process snapshots are archived and available for analysis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions