A coding environment for agent evaluations. Provides bash and file editing tools.
⚠️ This is a template. Before building, customizeDockerfile.hudfor your project.
To test the template with the sample repository:
hud build . --build-arg REPO_URL=https://github.com/hud-evals/coding-template-sample
hud dev --port 8765
python local_test.pyThe Dockerfile uses two build arguments:
| Argument | Required | Default | Description |
|---|---|---|---|
REPO_URL |
Yes | (none) | Git repository URL to clone |
FOLDER_NAME |
No | project |
Folder name for the cloned repo |
# Required: Pass REPO_URL as a build argument
ARG REPO_URL
ARG FOLDER_NAME="project"
# The repo is cloned to /home/ubuntu/${FOLDER_NAME}
WORKDIR /home/ubuntu/${FOLDER_NAME}For private repos, set CODING_GITHUB_TOKEN locally before building:
export CODING_GITHUB_TOKEN=github_pat_XXX
hud build . --build-arg REPO_URL=https://github.com/your-org/your-repo \
--secret id=CODING_GITHUB_TOKEN,env=CODING_GITHUB_TOKENEvery task in tasks/*.py follows the 3-branch pattern, where each branch exists in the source repo cloned in the Dockerfile:
| Branch | Purpose |
|---|---|
baseline |
Starting state the agent sees, where the agent makes changes |
test |
Contains tests that grade the agent's solution |
golden |
Correct solution for validation and/or training |
Git patches will automatically be generated for every branch defined in each task.
If you're not using git-based problems, comment out the git clone section in Dockerfile.hud (lines ~63-87).
If you haven't already, connect this repo to hud.ai:
- Push to GitHub
- Go to hud.ai → New → Environment
- Connect your GitHub repo
- Your environment builds automatically on each push
Once deployed, your environment is accessible by its slug (e.g., my-org/coding).
Tools are functions agents can call. Scenarios define the evaluation lifecycle.
@env.tool()
async def bash(command: str) -> str:
"""Run a bash command."""
@env.tool()
async def editor(command: str, path: str, ...) -> str:
"""View, create, and edit files."""@env.scenario("solve-task")
async def solve_task(problem_id: str, hints_enabled: bool = False):
await env.call_tool("_start_services") # Setup
prompt = spec_to_statement(get_problem_spec(problem_id))
_ = yield prompt # Prompt → agent runs
result = await env.call_tool("_grade_solution", {"problem_id": problem_id})
yield result["score"] # Rewardfrom grading import problem, Grade, EnvironmentState, AgentPatchGrader
@problem(
id="fix-bug",
description="Fix the login bug in auth.py",
difficulty="easy",
base="main", test="test-branch", golden="golden-branch",
)
def fix_bug(state: EnvironmentState) -> Grade:
return Grade.from_subscores([
AgentPatchGrader.grade(state, weight=1.0, ...)
])Tasks are scenario instances with specific arguments.
In Code:
task = env("solve-task", problem_id="fix-bug")From JSON (remote_tasks.json):
{
"env": {"name": "my-org/coding"},
"scenario": "solve-task",
"args": {"problem_id": "fix-bug", "hints_enabled": false}
}On Platform:
After deploying, create tasks from your scenarios on hud.ai. Access them by slug:
from hud.datasets import load_tasks
tasks = load_tasks("my-org/coding-tasks")Run tasks and see results on hud.ai. You have three options:
On Platform:
Run evaluations at scale directly on hud.ai with parallel execution and automatic tracing.
CLI:
hud eval ./remote_tasks.json --model gpt-4o --remote # https://hud.ai/models
hud eval my-org/coding-tasks --model gpt-4o --remote --group 5Python:
import hud
from hud.agents import OpenAIChatAgent # See all models: https://hud.ai/models
task = env("solve-task", problem_id="fix-bug")
async with hud.eval(task) as ctx:
agent = OpenAIChatAgent.create(model="gpt-4o") # Uses inference.hud.ai
await agent.run(ctx)
# Results are automatically traced to hud.aiWith Variants (A/B Testing):
tasks = [env("solve-task", problem_id="fix-bug"), env("solve-task", problem_id="add-feature")]
variants = {"model": ["gpt-4o-mini", "gpt-4o"]}
async with hud.eval(tasks, variants=variants, group=2) as ctx:
agent = OpenAIChatAgent.create(model=ctx.variants["model"])
await agent.run(ctx)This environment requires Docker. Use hud dev with hot-reload:
# 1. Build the Docker image (first time only)
hud build . --build-arg REPO_URL=https://github.com/your-org/your-repo
# For private repos, also pass the secret:
hud build . --build-arg REPO_URL=https://github.com/your-org/your-repo \
--secret id=CODING_GITHUB_TOKEN,env=CODING_GITHUB_TOKEN
# 2. Start with hot-reload on tasks/grading
hud dev -w tasks -w grading --port 8765
# 3. Test locally
python local_test.py
⚠️ Local runs one task at a time. The local environment uses a single container, so tasks run sequentially. For parallel execution, push and run remotely:hud push hud eval ./remote_tasks.json --model gpt-4o --remote --group 5
When you save a watched file, the MCP server restarts with fresh imports:
| Component | Reloaded? |
|---|---|
tasks/*.py |
✅ Yes |
grading/*.py |
✅ Yes |
tools/*.py |
✅ Yes (if watched) |
When to rebuild: Dockerfile changes, system packages, service configs.
coding-template/
├── env.py # Tools + scenario registration
├── scenarios.py # Shared helpers + scenarios
├── tools/ # bash, editor
├── grading/ # @problem decorator, graders
├── tasks/ # Problem definitions
├── local_test.py # Dev testing
└── Dockerfile.hud # Container config
This template is designed to be adapted for different tech stacks. Here's how to customize it for your project.
Edit pyproject.toml:
[project]
name = "your-company-evaluation-framework"
description = "AI Agent Evaluation Framework for [Your Project]"In Dockerfile.hud, uncomment and configure the section for your language:
Python:
RUN python3 -m pip install --upgrade pip
RUN pip install poetry # or: pipenv, pip-tools
RUN cd /home/ubuntu/${FOLDER_NAME} && pip install -r requirements.txtJava/Maven:
RUN apt-get install -y openjdk-17-jdk maven
RUN cd /home/ubuntu/${FOLDER_NAME} && mvn dependency:resolveGo:
RUN wget https://go.dev/dl/go1.21.0.linux-amd64.tar.gz && tar -C /usr/local -xzf go*.tar.gz
RUN cd /home/ubuntu/${FOLDER_NAME} && go mod downloadRust:
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
RUN cd /home/ubuntu/${FOLDER_NAME} && cargo fetchThe grading system requires JUnit XML output. Configure your test framework:
pytest:
# In grading/runner.py
def run_pytest_tests(self) -> str:
result = subprocess.run(
["pytest", "--junit-xml=pytest_results.xml"] + self.test_files,
cwd=self.grade_working_dir, capture_output=True, text=True,
)
return Path(self.grade_working_dir, "pytest_results.xml").read_text()Go test:
def run_go_tests(self) -> str:
result = subprocess.run(
["go", "test", "-v", "./...", "-json"],
cwd=self.grade_working_dir, capture_output=True, text=True,
)
return self._convert_go_json_to_junit(result.stdout)Adapt or remove database setup in grading/runner.py:
No database:
# Comment out database reset steps in run_grading()MySQL:
drop_cmd = f"mysql -u root -p{password} -e 'DROP DATABASE IF EXISTS {db_name}'"
create_cmd = f"mysql -u root -p{password} -e 'CREATE DATABASE {db_name}'"MongoDB:
drop_cmd = f"mongo {db_name} --eval 'db.dropDatabase()'"If your project doesn't run as ubuntu:
# In tools/bash.py - remove sudo wrapper
subprocess.run(["bash", "-lc", command], ...)
# Or use a different user
subprocess.run(["sudo", "-u", "youruser", "bash", "-lc", command], ...)Full documentation: docs.hud.ai