Security and Code Quality Improvements #327

xingdi-eric-yuan · 2026-01-02T16:25:39Z

AFTER #325

Security and Code Quality Improvements

Summary

This PR addresses security vulnerabilities and code quality issues identified during a comprehensive codebase review. The changes focus on preventing command injection, fixing resource leaks, and improving input validation across LLM providers and terminal operations.

Critical Security Fixes

Command Injection Prevention

workspace.py: Added shlex.quote() to all shell command path interpolations (cat, mkdir, echo, tree commands)
local.py: Fixed cp -r command to quote source and target paths
swe_bench_debug.py: Fixed git commit message injection vulnerability
r2egym.py: Added shell quoting for all path variables in symlink and find commands

Path Traversal Fix

workspace.py: Fixed path validation to use proper prefix matching (==path/* instead of ==path*) preventing escape via paths like /testbed_evil/

High Priority Fixes

HTTP Client Timeouts

openai.py: Added 300-second timeout to prevent indefinite hangs
copilot.py: Added 300-second timeout for HTTP client, 30-second timeout for Node.js subprocess

Input Validation

anthropic.py: Added hasattr() validation before accessing response attributes
copilot.py: Improved .env file parsing with proper quote handling

Medium Priority Improvements

Resource Management

base.py: Added base close() method for HTTP client cleanup
openai.py, anthropic.py: Implemented close() methods to properly release HTTP clients
azure_openai.py, copilot.py: Added close() methods that call super().close() and invalidate caches

Memory Management

openai.py: Added LRU-cached tiktoken encoder (maxsize=10)
huggingface.py: Added LRU-cached HuggingFace tokenizer (maxsize=5)
openai.py: Added 1MB size limit for JSON payload parsing

Input Validation

simple_agent.py: Added tool name validation against registered tools before execution
workspace.py: Added bounds validation for max_depth parameter (1-20)

Files Changed

debug_gym/gym/workspace.py
debug_gym/gym/terminals/local.py
debug_gym/gym/envs/swe_bench_debug.py
debug_gym/gym/envs/r2egym.py
debug_gym/llms/base.py
debug_gym/llms/openai.py
debug_gym/llms/anthropic.py
debug_gym/llms/azure_openai.py
debug_gym/llms/copilot.py
debug_gym/llms/huggingface.py
debug_gym/agents/simple_agent.py

Testing

All changes are backwards compatible
No new dependencies added
Existing functionality preserved

Use shlex.quote() instead of Python repr for proper shell escaping when wrapping commands with timeout. This fixes syntax errors when commands contain special characters like parentheses (e.g., in git apply patches). Also update test_eval_timeout to expect the new timeout message format. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- DockerTerminal.close(): Close docker_client to release connection pool resources. Previously only the container was stopped but the client connection was never closed, causing pool exhaustion after ~10 iterations. - ShellSession.close(): Add process.wait() after terminate() to avoid zombie processes. If wait times out, forcefully kill the process. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The previous fix closed the Docker client in close() to prevent connection leaks, but setting docker_client = None broke terminal reuse since setup_container() would fail with 'NoneType' has no attribute 'containers'. This fix converts docker_client to a lazy-initialized property that recreates the client on demand if it was previously closed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The close() method was only stopping the listener thread but not closing FileHandler instances, leaving file handles open. This caused processes to hang indefinitely when many loggers were created (e.g., per-agent loggers in parallel evaluation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously, _load_prompt_template() created a Jinja2 Environment without a loader, causing include and from directives to fail with no loader for this environment specified error. Now uses FileSystemLoader with the templates parent directory as root, enabling relative includes for modular prompt templates. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Allows users to specify a custom root directory for the Jinja2 FileSystemLoader, enabling modular template structures with shared components across sibling directories. Example usage: agent_args = AgentArgs( system_prompt="prompts/exploration/main.jinja", prompt_loader_root="prompts/" # enables common/ includes ) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

xingdi-eric-yuan and others added 13 commits December 19, 2025 12:45

Update azure_openai.py

4887c6a

command timeout (for bash/eval)

44a1ae8

Merge branch 'main' into aoai_timeout

3534f2d

update json log viewer to allow cross-origin requests

83fbeb9

security and code quality change

fcc3c66

Merge branch 'aoai_timeout' into minor_fixes

db02c95

Merge branch 'main' into minor_fixes

88d5776

MarcCote approved these changes Jan 7, 2026

View reviewed changes

Update .gitignore

c2aa65c

MarcCote merged commit 19cda64 into main Jan 7, 2026
8 checks passed

MarcCote deleted the minor_fixes branch January 7, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Security and Code Quality Improvements #327

Security and Code Quality Improvements #327

xingdi-eric-yuan commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Security and Code Quality Improvements #327

Security and Code Quality Improvements #327

Conversation

xingdi-eric-yuan commented Jan 2, 2026