-
Notifications
You must be signed in to change notification settings - Fork 33
Description
🚨 Critical: Evaluation Jobs Stuck in Infinite Conda Dependency Resolution Loop
Summary
Multiple long-running evaluation jobs are getting stuck during the evaluation phase when conda-env create enters an infinite dependency resolution loop, consuming 99.9% CPU and preventing jobs from completing despite successful inference.
Affected Pods
| Pod | Model | Runtime | Status | Stuck Containers |
|---|---|---|---|---|
eval-eval-21371512858-gpt-5-2-co-kdvhp |
gpt-5.2-codex |
42h+ | ❌ Killed | 1+ (recurring) |
eval-21413268827-nemotron-3-9jmwh |
NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 |
42h+ | 2 containers |
Pattern: All affected pods are running SWT-bench_Verified benchmark.
Problem Description
Timeline
- ✅ Inference Phase: Completes successfully (432/432 instances)
- ✅ Output Generation: All JSONL files created correctly
⚠️ Evaluation Phase:swtbench-evalstarts processing test instances- 🚨 Stuck: Some test instances require creating conda environments that enter infinite dependency resolution
Symptoms
Stuck Process Pattern:
USER PID %CPU %MEM VSZ RSS COMMAND
root 13 99.9 9.6 12.4GB 11.9GB /opt/miniconda3/bin/python /opt/miniconda3/bin/conda-env create --file environment.yml
- CPU Usage: Stuck at 99.9% indefinitely
- Memory: Consumes 8-12 GB per stuck process
- Duration: Continues for hours without progress
- Status: Process is in 'R' (running) state but making no progress
- Behavior: Conda dependency resolver enters infinite loop, never completes or fails
Multiple Stuck Containers:
- Evaluation harness retries failed environment setups
- Each retry spawns new Docker container
- New container immediately gets stuck in same conda loop
- Results in 2+ concurrent stuck processes per pod
Impact
Resource Waste
- CPU: ~200% CPU per pod (2 stuck processes × 99.9% each)
- Memory: ~20 GB per pod
- Time: Jobs running 40+ hours when they should complete in <10 hours
- Cluster: Multiple pods affected simultaneously
Data Loss Risk
- Inference work is complete but inaccessible
- Pods must be killed to free resources
- Evaluation metrics cannot be calculated
Root Cause Analysis
Problem: Specific SWT-bench test instances have environment.yml files with complex or conflicting dependencies that conda's SAT solver cannot resolve efficiently.
Why Conda Gets Stuck:
- Conda's dependency resolver uses a SAT solver
- Complex dependency graphs can cause exponential search space
- Some SWT-bench instances have legacy/conflicting requirements
- No timeout mechanism in
conda-env create - Process never fails, just loops indefinitely
Why It Keeps Retrying:
- Evaluation harness expects environment creation to eventually succeed or fail
- Conda process never returns error code (still "running")
- Harness kills container after timeout and retries
- Same problematic environment.yml gets retried indefinitely
Evidence
Pod: eval-eval-21371512858-gpt-5-2-co-kdvhp
Status: Killed after 42 hours
Observations:
- Inference completed: 432/432 instances
- Stuck process: PID 13 in container, 99.9% CPU for 4+ hours
- Container killed, respawned, immediately stuck again
- Multiple kill/respawn cycles before pod termination
Diagnostics: Archived in eval-21371512858-gpt-5-2-codex-COMPLETE.tar.gz (304 MB)
Pod: eval-21413268827-nemotron-3-9jmwh
Status: Currently stuck (42+ hours)
Two Stuck Containers:
Container 0b11aeaf7f3b:
- Created: 2026-01-29 10:28:10 UTC
- PID 13 at 99.9% CPU
- Memory: 11.9 GB
- Duration: 4.5+ hours stuck
Container 39de478efa10:
- Created: 2026-01-29 12:50:17 UTC
- PID 12 at 99.9% CPU
- Memory: 8.8 GB
- Duration: 2.2+ hours stuck
Process Details:
# Inside container 0b11aeaf7f3b
root 13 99.9 9.6 12386412 11929916 ? R 10:28 273:14 /opt/miniconda3/bin/python /opt/miniconda3/bin/conda-env create --file environment.ymlDiagnostics: Available in eval-21413268827-nemotron-3-9jmwh_logs/ (969 MB)
Reproduction Steps
- Start SWT-bench_Verified evaluation job with any model
- Wait for inference to complete (works fine)
- Wait for evaluation phase to start
- Monitor for stuck
conda-env createprocesses - Eventually 1-2 containers will get stuck at 99.9% CPU
Consistency: Observed in 2/2 SWT-bench_Verified jobs monitored over 40+ hours each.
Proposed Solutions
Option 1: Add Timeout to Conda Environment Creation (RECOMMENDED)
Implementation:
# In evaluation harness, replace:
conda-env create --file environment.yml
# With timeout wrapper:
timeout 300 conda-env create --file environment.yml || {
echo "Conda environment creation timed out after 5 minutes"
echo "Trying with mamba as fallback..."
timeout 300 mamba env create --file environment.yml || {
echo "Environment creation failed, skipping instance"
exit 1
}
}Benefits:
- Prevents infinite loops
- Allows harness to skip problematic instances
- Provides clear failure signal
Option 2: Use Mamba Instead of Conda
Rationale:
- Mamba uses libsolv (faster SAT solver)
- Typically 10-100x faster than conda
- More robust with complex dependencies
- Drop-in replacement for conda
Implementation:
# Install mamba in eval container base image
mamba env create --file environment.ymlOption 3: Pre-create Problematic Environments
Steps:
- Identify which instances have problematic environments
- Pre-build and cache these environments
- Skip conda resolution during evaluation
- Use cached environments
Option 4: Skip Environment Creation for Known Bad Instances
Implementation:
- Maintain allowlist/denylist of instance IDs
- Skip evaluation for instances with known-bad environments
- Log skip reason for metrics
Immediate Workaround
For Stuck Jobs:
# Find stuck containers
kubectl exec -n evaluation-jobs <POD_NAME> -c eval-container -- \
ps aux --sort=-%cpu | grep "conda-env create" | grep -v grep
# Kill specific Docker container
kubectl exec -n evaluation-jobs <POD_NAME> -c eval-container -- \
docker kill <CONTAINER_ID>
# Or kill entire pod to free resources
kubectl delete pod -n evaluation-jobs <POD_NAME>Note: Killing containers only provides temporary relief - they will respawn and get stuck again. Killing the pod is the only way to fully stop the retry loop.
Monitoring
Detection Script:
# Check all eval pods for stuck conda processes
for pod in $(kubectl get pods -n evaluation-jobs -o name | grep eval-); do
echo "=== $pod ==="
kubectl exec -n evaluation-jobs ${pod#pod/} -c eval-container -- \
ps aux --sort=-%cpu | head -5 | grep -E "conda-env|PID"
doneIndicators of Stuck Job:
- Process at 99.9% CPU for >10 minutes
- Memory usage stable (not growing)
- CPU time growing but no log output
- Multiple containers with same pattern
Related Issues
This may be related to:
- SWT-bench dataset quality (some instances may have unsolvable environments)
- Conda SAT solver performance with legacy Python packages
- Lack of timeout mechanisms in evaluation harness
- Docker container retry logic
Request for Action
- Immediate: Add timeout wrapper to conda environment creation
- Short-term: Switch to mamba for faster dependency resolution
- Long-term: Identify and fix/skip problematic SWT-bench instances
- Monitoring: Add detection for stuck processes in eval dashboard
Additional Context
- Benchmark: SWT-bench_Verified (princeton-nlp/SWE-bench_Verified)
- Namespace:
evaluation-jobs - Image: Python 3.12-slim with conda/miniconda3
- Workers: 24 parallel workers
- Container Runtime: Docker-in-Docker (dind)
All diagnostic logs, output files, and process snapshots are archived and available for analysis.