-
-
Notifications
You must be signed in to change notification settings - Fork 96
Eval Testing Quick Reference
Joe Curlee (w4ffl35) edited this page Nov 5, 2025
·
1 revision
- 14 tools ported from old mixin system to ToolRegistry
- 45 unit tests created and passing (100% coverage for ported tools)
- 34 eval tests created following web tool pattern
- All code quality checks passed (no errors)
-
Documentation in
/docs/TOOL_MIGRATION_SUMMARY.md
Single LangGraph Workflow
START → model → tools → model → END
↑_______|
- One workflow for ALL tasks
- All 37 tools potentially active
- Tool redundancy detection
- Database checkpoint persistence
- No Mode-Based Routing → All tools active at once
- No Trajectory Evaluation → Only checking response content
- No Intent Classification → No automatic mode switching
- No Specialized Subgraphs → One workflow handles everything
┌────────────────┐
│ Intent Router │
└────────┬───────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Author │ │ Code │ │Research │
│ Mode │ │ Mode │ │ Mode │
│ (5 tools)│ │(8 tools)│ │(6 tools)│
└─────────┘ └─────────┘ └─────────┘
Benefits:
- Focused tool sets per mode (5-10 vs 37 global)
- Better LLM performance with fewer tool choices
- Easier evaluation with clear expected paths
- Scalable - easy to add new modes
- ✅ Review LangSmith documentation - DONE
- ✅ Create architecture plan -
/wiki/Mode-Based-Agent-Architecture.md - ✅ Create eval enhancement plan -
/wiki/Evaluation-Framework-Enhancement.md - Decision needed: Proceed with mode-based architecture?
Priority: HIGH - Sets foundation for everything else
-
Create trajectory tracking utilities
-
trajectory_evaluator.py- Subsequence matching -
tracking.py- Event streaming helper
-
-
Update existing eval tests
- Add expected trajectories to all tests
- Track actual paths through nodes/tools
- Validate tool call sequences
-
Create new eval tests
- Intent classification (single-step)
- Multi-tool workflows (trajectory)
- Error recovery paths
Files to Create:
src/airunner/components/eval/
├── utils/
│ ├── __init__.py
│ ├── trajectory_evaluator.py # NEW
│ └── tracking.py # NEW
└── tests/
├── test_intent_classification_eval.py # NEW
└── test_trajectory_eval.py # NEW
Files to Update:
src/airunner/components/eval/tests/
├── test_user_data_tool_eval.py # Add trajectory tracking
├── test_agent_tool_eval.py # Add trajectory tracking
├── test_rag_tool_eval.py # Add trajectory tracking
└── test_knowledge_tool_eval.py # Add trajectory tracking
Priority: MEDIUM - Major architectural change
-
Reorganize tool categories (Week 1)
- Define mode-based categories (AUTHOR, CODE, RESEARCH, QA)
- Reclassify existing tools
- Create mode-specific tool modules
-
Implement parent routing graph (Week 2)
- Build intent classifier node
- Create parent StateGraph
- Integrate with WorkflowManager
-
Build specialized subgraphs (Week 3-4)
- Author agent (writing tools)
- Code agent (programming tools)
- Research agent (search/knowledge tools)
- QA agent (retrieval tools)
| Aspect | Current | LangSmith Pattern | Impact |
|---|---|---|---|
| Architecture | Single workflow | Parent + subgraphs | Clearer intent routing |
| Tool Access | All 37 tools | 5-10 per mode | Reduced LLM confusion |
| Evaluation | Response content | Response + trajectory | Better debugging |
| Mode Switching | Manual (tool_categories) | Automatic (intent) | Better UX |
| Scalability | Add tools → more confusion | Add modes → isolated tools | Easy to extend |
-
/wiki/Mode-Based-Agent-Architecture.md- Complete architecture plan
- 6-week implementation timeline
- File structure
- Success criteria
-
/wiki/Evaluation-Framework-Enhancement.md- Trajectory evaluation guide
- Code examples
- Priority order
- Success metrics
-
/docs/TOOL_MIGRATION_SUMMARY.md- Tool porting summary
- Testing coverage
- Running tests guide
# All tool unit tests
pytest src/airunner/components/llm/tools/tests/ -v
# Specific tests
pytest src/airunner/components/llm/tools/tests/test_user_data_tools.py -v
pytest src/airunner/components/llm/tools/tests/test_agent_tools.py -v# All eval tests
pytest src/airunner/components/eval/tests/test_*_tool_eval.py -v -m eval
# Specific eval tests
pytest src/airunner/components/eval/tests/test_user_data_tool_eval.py -v -m eval
pytest src/airunner/components/eval/tests/test_agent_tool_eval.py -v -m eval-
Hierarchical Graphs
- Parent graph routes to specialized subgraphs
- Each subgraph has focused purpose + tools
- Clean separation of concerns
-
Trajectory Evaluation
- Track path through nodes/tools
- Compare to expected sequence
- Partial credit for correct steps
-
Single-Step Testing
- Test components in isolation
- Validate intent classification separately
- Faster iteration on specific failures
-
Configurable Environments
-
config={"env": "test"}for mocking - Separate test/prod behaviors
- Easier evaluation without side effects
-
- ✅ Database Checkpoints - Via
DatabaseCheckpointSaver - ✅ Tool Registry - Via
@tooldecorator - ✅ State Management - Via
WorkflowState(TypedDict) - ✅ Mock-based Eval Tests - Avoid external dependencies
- They are CORRECT for checking tool triggering + response quality
- They follow the pattern from existing web tool evals
- We're ADDING trajectory validation, not replacing them
- They will be even better once we add trajectory tracking
- Major change but high value
-
Backward compatible via
general_agentfallback - Enables better evaluation with clear expected paths
- User benefit - faster, more focused responses
- Trajectory evaluation FIRST - Works with current architecture
- Mode-based routing SECOND - Benefits from trajectory validation
- This order minimizes risk and validates patterns early
-
Should we proceed with mode-based architecture?
- Impact: 6 weeks development
- Benefit: Better UX, easier scaling, clearer evaluation
- Risk: Architectural change, testing burden
-
Should we start with trajectory evaluation?
- Impact: 1-2 weeks
- Benefit: Better debugging, sets foundation
- Risk: Low (additive change)
-
Tool category reorganization strategy?
- Option A: Mode-based (AUTHOR, CODE, RESEARCH)
- Option B: Keep current + add mode metadata
- Option C: Hybrid approach
-
Timeline expectations?
- Aggressive: 6 weeks total (both phases)
- Conservative: 2 weeks eval + 4 weeks architecture
- Incremental: Eval first, architecture after validation