Skip to content

Conversation

@xingdi-eric-yuan
Copy link
Collaborator

AFTER #325

Security and Code Quality Improvements

Summary

This PR addresses security vulnerabilities and code quality issues identified during a comprehensive codebase review. The changes focus on preventing command injection, fixing resource leaks, and improving input validation across LLM providers and terminal operations.

Critical Security Fixes

Command Injection Prevention

  • workspace.py: Added shlex.quote() to all shell command path interpolations (cat, mkdir, echo, tree commands)
  • local.py: Fixed cp -r command to quote source and target paths
  • swe_bench_debug.py: Fixed git commit message injection vulnerability
  • r2egym.py: Added shell quoting for all path variables in symlink and find commands

Path Traversal Fix

  • workspace.py: Fixed path validation to use proper prefix matching (==path/* instead of ==path*) preventing escape via paths like /testbed_evil/

High Priority Fixes

HTTP Client Timeouts

  • openai.py: Added 300-second timeout to prevent indefinite hangs
  • copilot.py: Added 300-second timeout for HTTP client, 30-second timeout for Node.js subprocess

Input Validation

  • anthropic.py: Added hasattr() validation before accessing response attributes
  • copilot.py: Improved .env file parsing with proper quote handling

Medium Priority Improvements

Resource Management

  • base.py: Added base close() method for HTTP client cleanup
  • openai.py, anthropic.py: Implemented close() methods to properly release HTTP clients
  • azure_openai.py, copilot.py: Added close() methods that call super().close() and invalidate caches

Memory Management

  • openai.py: Added LRU-cached tiktoken encoder (maxsize=10)
  • huggingface.py: Added LRU-cached HuggingFace tokenizer (maxsize=5)
  • openai.py: Added 1MB size limit for JSON payload parsing

Input Validation

  • simple_agent.py: Added tool name validation against registered tools before execution
  • workspace.py: Added bounds validation for max_depth parameter (1-20)

Files Changed

  • debug_gym/gym/workspace.py
  • debug_gym/gym/terminals/local.py
  • debug_gym/gym/envs/swe_bench_debug.py
  • debug_gym/gym/envs/r2egym.py
  • debug_gym/llms/base.py
  • debug_gym/llms/openai.py
  • debug_gym/llms/anthropic.py
  • debug_gym/llms/azure_openai.py
  • debug_gym/llms/copilot.py
  • debug_gym/llms/huggingface.py
  • debug_gym/agents/simple_agent.py

Testing

  • All changes are backwards compatible
  • No new dependencies added
  • Existing functionality preserved

xingdi-eric-yuan and others added 13 commits December 19, 2025 12:45
Use shlex.quote() instead of Python repr for proper shell escaping when
wrapping commands with timeout. This fixes syntax errors when commands
contain special characters like parentheses (e.g., in git apply patches).

Also update test_eval_timeout to expect the new timeout message format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- DockerTerminal.close(): Close docker_client to release connection pool
  resources. Previously only the container was stopped but the client
  connection was never closed, causing pool exhaustion after ~10 iterations.

- ShellSession.close(): Add process.wait() after terminate() to avoid
  zombie processes. If wait times out, forcefully kill the process.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous fix closed the Docker client in close() to prevent connection
leaks, but setting docker_client = None broke terminal reuse since
setup_container() would fail with 'NoneType' has no attribute 'containers'.

This fix converts docker_client to a lazy-initialized property that
recreates the client on demand if it was previously closed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The close() method was only stopping the listener thread but not
closing FileHandler instances, leaving file handles open. This
caused processes to hang indefinitely when many loggers were
created (e.g., per-agent loggers in parallel evaluation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, _load_prompt_template() created a Jinja2 Environment without
a loader, causing include and from directives to fail with
no loader for this environment specified error.

Now uses FileSystemLoader with the templates parent directory as root,
enabling relative includes for modular prompt templates.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Allows users to specify a custom root directory for the Jinja2
FileSystemLoader, enabling modular template structures with shared
components across sibling directories.

Example usage:
  agent_args = AgentArgs(
      system_prompt="prompts/exploration/main.jinja",
      prompt_loader_root="prompts/"  # enables common/ includes
  )

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@MarcCote MarcCote merged commit 19cda64 into main Jan 7, 2026
8 checks passed
@MarcCote MarcCote deleted the minor_fixes branch January 7, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants