SWT bench: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop

## 🚨 Critical: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop

### Summary

Multiple long-running evaluation jobs are getting stuck during the evaluation phase when `conda-env create` enters an infinite dependency resolution loop, consuming 99.9% CPU and preventing jobs from completing despite successful inference.

### Affected Pods

| Pod | Model | Runtime | Status | Stuck Containers |
|-----|-------|---------|--------|------------------|
| `eval-eval-21371512858-gpt-5-2-co-kdvhp` | `gpt-5.2-codex` | 42h+ | ❌ Killed | 1+ (recurring) |
| `eval-21413268827-nemotron-3-9jmwh` | `NVIDIA-Nemotron-3-Nano-30B-A3B-FP8` | 42h+ | ⚠️ Stuck | 2 containers |

**Pattern:** All affected pods are running **SWT-bench_Verified** benchmark.

---

### Problem Description

#### Timeline
1. ✅ **Inference Phase**: Completes successfully (432/432 instances)
2. ✅ **Output Generation**: All JSONL files created correctly
3. ⚠️ **Evaluation Phase**: `swtbench-eval` starts processing test instances
4. 🚨 **Stuck**: Some test instances require creating conda environments that enter infinite dependency resolution

#### Symptoms

**Stuck Process Pattern:**
```
USER    PID    %CPU  %MEM      VSZ      RSS  COMMAND
root     13    99.9   9.6  12.4GB   11.9GB  /opt/miniconda3/bin/python /opt/miniconda3/bin/conda-env create --file environment.yml
```

- **CPU Usage**: Stuck at 99.9% indefinitely
- **Memory**: Consumes 8-12 GB per stuck process
- **Duration**: Continues for hours without progress
- **Status**: Process is in 'R' (running) state but making no progress
- **Behavior**: Conda dependency resolver enters infinite loop, never completes or fails

**Multiple Stuck Containers:**
- Evaluation harness retries failed environment setups
- Each retry spawns new Docker container
- New container immediately gets stuck in same conda loop
- Results in 2+ concurrent stuck processes per pod

---

### Impact

#### Resource Waste
- **CPU**: ~200% CPU per pod (2 stuck processes × 99.9% each)
- **Memory**: ~20 GB per pod
- **Time**: Jobs running 40+ hours when they should complete in <10 hours
- **Cluster**: Multiple pods affected simultaneously

#### Data Loss Risk
- Inference work is complete but inaccessible
- Pods must be killed to free resources
- Evaluation metrics cannot be calculated

---

### Root Cause Analysis

**Problem:** Specific SWT-bench test instances have `environment.yml` files with complex or conflicting dependencies that conda's SAT solver cannot resolve efficiently.

**Why Conda Gets Stuck:**
- Conda's dependency resolver uses a SAT solver
- Complex dependency graphs can cause exponential search space
- Some SWT-bench instances have legacy/conflicting requirements
- No timeout mechanism in `conda-env create`
- Process never fails, just loops indefinitely

**Why It Keeps Retrying:**
- Evaluation harness expects environment creation to eventually succeed or fail
- Conda process never returns error code (still "running")
- Harness kills container after timeout and retries
- Same problematic environment.yml gets retried indefinitely

---

### Evidence

#### Pod: eval-eval-21371512858-gpt-5-2-co-kdvhp

**Status:** Killed after 42 hours

**Observations:**
- Inference completed: 432/432 instances
- Stuck process: PID 13 in container, 99.9% CPU for 4+ hours
- Container killed, respawned, immediately stuck again
- Multiple kill/respawn cycles before pod termination

**Diagnostics:** Archived in `eval-21371512858-gpt-5-2-codex-COMPLETE.tar.gz` (304 MB)

#### Pod: eval-21413268827-nemotron-3-9jmwh

**Status:** Currently stuck (42+ hours)

**Two Stuck Containers:**

Container `0b11aeaf7f3b`:
- Created: 2026-01-29 10:28:10 UTC
- PID 13 at 99.9% CPU
- Memory: 11.9 GB
- Duration: 4.5+ hours stuck

Container `39de478efa10`:
- Created: 2026-01-29 12:50:17 UTC  
- PID 12 at 99.9% CPU
- Memory: 8.8 GB
- Duration: 2.2+ hours stuck

**Process Details:**
```bash
# Inside container 0b11aeaf7f3b
root  13  99.9  9.6  12386412  11929916  ?  R  10:28  273:14  /opt/miniconda3/bin/python /opt/miniconda3/bin/conda-env create --file environment.yml
```

**Diagnostics:** Available in `eval-21413268827-nemotron-3-9jmwh_logs/` (969 MB)

---

### Reproduction Steps

1. Start SWT-bench_Verified evaluation job with any model
2. Wait for inference to complete (works fine)
3. Wait for evaluation phase to start
4. Monitor for stuck `conda-env create` processes
5. Eventually 1-2 containers will get stuck at 99.9% CPU

**Consistency:** Observed in 2/2 SWT-bench_Verified jobs monitored over 40+ hours each.

---

### Proposed Solutions

#### Option 1: Add Timeout to Conda Environment Creation (RECOMMENDED)

**Implementation:**
```bash
# In evaluation harness, replace:
conda-env create --file environment.yml

# With timeout wrapper:
timeout 300 conda-env create --file environment.yml || {
  echo "Conda environment creation timed out after 5 minutes"
  echo "Trying with mamba as fallback..."
  timeout 300 mamba env create --file environment.yml || {
    echo "Environment creation failed, skipping instance"
    exit 1
  }
}
```

**Benefits:**
- Prevents infinite loops
- Allows harness to skip problematic instances
- Provides clear failure signal

#### Option 2: Use Mamba Instead of Conda

**Rationale:**
- Mamba uses libsolv (faster SAT solver)
- Typically 10-100x faster than conda
- More robust with complex dependencies
- Drop-in replacement for conda

**Implementation:**
```bash
# Install mamba in eval container base image
mamba env create --file environment.yml
```

#### Option 3: Pre-create Problematic Environments

**Steps:**
1. Identify which instances have problematic environments
2. Pre-build and cache these environments
3. Skip conda resolution during evaluation
4. Use cached environments

#### Option 4: Skip Environment Creation for Known Bad Instances

**Implementation:**
- Maintain allowlist/denylist of instance IDs
- Skip evaluation for instances with known-bad environments
- Log skip reason for metrics

---

### Immediate Workaround

**For Stuck Jobs:**
```bash
# Find stuck containers
kubectl exec -n evaluation-jobs <POD_NAME> -c eval-container -- \
  ps aux --sort=-%cpu | grep "conda-env create" | grep -v grep

# Kill specific Docker container
kubectl exec -n evaluation-jobs <POD_NAME> -c eval-container -- \
  docker kill <CONTAINER_ID>

# Or kill entire pod to free resources
kubectl delete pod -n evaluation-jobs <POD_NAME>
```

**Note:** Killing containers only provides temporary relief - they will respawn and get stuck again. Killing the pod is the only way to fully stop the retry loop.

---

### Monitoring

**Detection Script:**
```bash
# Check all eval pods for stuck conda processes
for pod in $(kubectl get pods -n evaluation-jobs -o name | grep eval-); do
  echo "=== $pod ==="
  kubectl exec -n evaluation-jobs ${pod#pod/} -c eval-container -- \
    ps aux --sort=-%cpu | head -5 | grep -E "conda-env|PID"
done
```

**Indicators of Stuck Job:**
- Process at 99.9% CPU for >10 minutes
- Memory usage stable (not growing)
- CPU time growing but no log output
- Multiple containers with same pattern

---

### Related Issues

This may be related to:
- SWT-bench dataset quality (some instances may have unsolvable environments)
- Conda SAT solver performance with legacy Python packages
- Lack of timeout mechanisms in evaluation harness
- Docker container retry logic

---

### Request for Action

1. **Immediate**: Add timeout wrapper to conda environment creation
2. **Short-term**: Switch to mamba for faster dependency resolution  
3. **Long-term**: Identify and fix/skip problematic SWT-bench instances
4. **Monitoring**: Add detection for stuck processes in eval dashboard

---

### Additional Context

- **Benchmark**: SWT-bench_Verified (princeton-nlp/SWE-bench_Verified)
- **Namespace**: `evaluation-jobs`
- **Image**: Python 3.12-slim with conda/miniconda3
- **Workers**: 24 parallel workers
- **Container Runtime**: Docker-in-Docker (dind)

All diagnostic logs, output files, and process snapshots are archived and available for analysis.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWT bench: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop #378

🚨 Critical: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop

Summary

Affected Pods

Problem Description

Timeline

Symptoms

Impact

Resource Waste

Data Loss Risk

Root Cause Analysis

Evidence

Pod: eval-eval-21371512858-gpt-5-2-co-kdvhp

Pod: eval-21413268827-nemotron-3-9jmwh

Reproduction Steps

Proposed Solutions

Option 1: Add Timeout to Conda Environment Creation (RECOMMENDED)

Option 2: Use Mamba Instead of Conda

Option 3: Pre-create Problematic Environments

Option 4: Skip Environment Creation for Known Bad Instances

Immediate Workaround

Monitoring

Related Issues

Request for Action

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pod	Model	Runtime	Status	Stuck Containers
`eval-eval-21371512858-gpt-5-2-co-kdvhp`	`gpt-5.2-codex`	42h+	❌ Killed	1+ (recurring)
`eval-21413268827-nemotron-3-9jmwh`	`NVIDIA-Nemotron-3-Nano-30B-A3B-FP8`	42h+	⚠️ Stuck	2 containers

SWT bench: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop #378

Description

🚨 Critical: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop

Summary

Affected Pods

Problem Description

Timeline

Symptoms

Impact

Resource Waste

Data Loss Risk

Root Cause Analysis

Evidence

Pod: eval-eval-21371512858-gpt-5-2-co-kdvhp

Pod: eval-21413268827-nemotron-3-9jmwh

Reproduction Steps

Proposed Solutions

Option 1: Add Timeout to Conda Environment Creation (RECOMMENDED)

Option 2: Use Mamba Instead of Conda

Option 3: Pre-create Problematic Environments

Option 4: Skip Environment Creation for Known Bad Instances

Immediate Workaround

Monitoring

Related Issues

Request for Action

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions