Eval Testing Quick Reference

AI Runner Eval Testing & Mode-Based Architecture - Quick Reference

📊 Current State Summary

✅ Completed Work

14 tools ported from old mixin system to ToolRegistry
45 unit tests created and passing (100% coverage for ported tools)
34 eval tests created following web tool pattern
All code quality checks passed (no errors)
Documentation in /docs/TOOL_MIGRATION_SUMMARY.md

🔄 Current Architecture

Single LangGraph Workflow

START → model → tools → model → END
        ↑_______|

One workflow for ALL tasks
All 37 tools potentially active
Tool redundancy detection
Database checkpoint persistence

❌ Missing Components (Based on LangSmith Guide)

No Mode-Based Routing → All tools active at once
No Trajectory Evaluation → Only checking response content
No Intent Classification → No automatic mode switching
No Specialized Subgraphs → One workflow handles everything

🎯 Recommended Architecture (LangSmith Pattern)

Parent Graph with Specialized Subgraphs

                    ┌────────────────┐
                    │ Intent Router  │
                    └────────┬───────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
   ┌─────────┐         ┌─────────┐         ┌─────────┐
   │ Author  │         │  Code   │         │Research │
   │  Mode   │         │  Mode   │         │  Mode   │
   │ (5 tools)│        │(8 tools)│         │(6 tools)│
   └─────────┘         └─────────┘         └─────────┘

Benefits:

Focused tool sets per mode (5-10 vs 37 global)
Better LLM performance with fewer tool choices
Easier evaluation with clear expected paths
Scalable - easy to add new modes

📋 Action Items

Immediate (This Week)

✅ Review LangSmith documentation - DONE
✅ Create architecture plan - /wiki/Mode-Based-Agent-Architecture.md
✅ Create eval enhancement plan - /wiki/Evaluation-Framework-Enhancement.md
Decision needed: Proceed with mode-based architecture?

Phase 1: Enhanced Evaluation (1-2 weeks)

Priority: HIGH - Sets foundation for everything else

Create trajectory tracking utilities
- trajectory_evaluator.py - Subsequence matching
- tracking.py - Event streaming helper
Update existing eval tests
- Add expected trajectories to all tests
- Track actual paths through nodes/tools
- Validate tool call sequences
Create new eval tests
- Intent classification (single-step)
- Multi-tool workflows (trajectory)
- Error recovery paths

Files to Create:

src/airunner/components/eval/
├── utils/
│   ├── __init__.py
│   ├── trajectory_evaluator.py    # NEW
│   └── tracking.py                 # NEW
└── tests/
    ├── test_intent_classification_eval.py  # NEW
    └── test_trajectory_eval.py             # NEW

Files to Update:

src/airunner/components/eval/tests/
├── test_user_data_tool_eval.py    # Add trajectory tracking
├── test_agent_tool_eval.py        # Add trajectory tracking
├── test_rag_tool_eval.py          # Add trajectory tracking
└── test_knowledge_tool_eval.py    # Add trajectory tracking

Phase 2: Mode-Based Architecture (3-4 weeks)

Priority: MEDIUM - Major architectural change

Reorganize tool categories (Week 1)
- Define mode-based categories (AUTHOR, CODE, RESEARCH, QA)
- Reclassify existing tools
- Create mode-specific tool modules
Implement parent routing graph (Week 2)
- Build intent classifier node
- Create parent StateGraph
- Integrate with WorkflowManager
Build specialized subgraphs (Week 3-4)
- Author agent (writing tools)
- Code agent (programming tools)
- Research agent (search/knowledge tools)
- QA agent (retrieval tools)

🔍 Key Differences from Current Implementation

Aspect	Current	LangSmith Pattern	Impact
Architecture	Single workflow	Parent + subgraphs	Clearer intent routing
Tool Access	All 37 tools	5-10 per mode	Reduced LLM confusion
Evaluation	Response content	Response + trajectory	Better debugging
Mode Switching	Manual (tool_categories)	Automatic (intent)	Better UX
Scalability	Add tools → more confusion	Add modes → isolated tools	Easy to extend

📖 Documentation Created

/wiki/Mode-Based-Agent-Architecture.md
- Complete architecture plan
- 6-week implementation timeline
- File structure
- Success criteria
/wiki/Evaluation-Framework-Enhancement.md
- Trajectory evaluation guide
- Code examples
- Priority order
- Success metrics
/docs/TOOL_MIGRATION_SUMMARY.md
- Tool porting summary
- Testing coverage
- Running tests guide

🚀 Running Current Tests

Unit Tests (All Passing)

# All tool unit tests
pytest src/airunner/components/llm/tools/tests/ -v

# Specific tests
pytest src/airunner/components/llm/tools/tests/test_user_data_tools.py -v
pytest src/airunner/components/llm/tools/tests/test_agent_tools.py -v

Eval Tests (Ready to Run)

# All eval tests
pytest src/airunner/components/eval/tests/test_*_tool_eval.py -v -m eval

# Specific eval tests
pytest src/airunner/components/eval/tests/test_user_data_tool_eval.py -v -m eval
pytest src/airunner/components/eval/tests/test_agent_tool_eval.py -v -m eval

🎓 Learning from LangSmith Guide

Key Patterns We Should Adopt

Hierarchical Graphs
- Parent graph routes to specialized subgraphs
- Each subgraph has focused purpose + tools
- Clean separation of concerns
Trajectory Evaluation
- Track path through nodes/tools
- Compare to expected sequence
- Partial credit for correct steps
Single-Step Testing
- Test components in isolation
- Validate intent classification separately
- Faster iteration on specific failures
Configurable Environments
- config={"env": "test"} for mocking
- Separate test/prod behaviors
- Easier evaluation without side effects

Patterns We Already Use

✅ Database Checkpoints - Via DatabaseCheckpointSaver
✅ Tool Registry - Via @tool decorator
✅ State Management - Via WorkflowState(TypedDict)
✅ Mock-based Eval Tests - Avoid external dependencies

⚠️ Important Notes

About Current Eval Tests

They are CORRECT for checking tool triggering + response quality
They follow the pattern from existing web tool evals
We're ADDING trajectory validation, not replacing them
They will be even better once we add trajectory tracking

About Mode-Based Architecture

Major change but high value
Backward compatible via general_agent fallback
Enables better evaluation with clear expected paths
User benefit - faster, more focused responses

About Implementation Order

Trajectory evaluation FIRST - Works with current architecture
Mode-based routing SECOND - Benefits from trajectory validation
This order minimizes risk and validates patterns early

📞 Questions to Answer

Should we proceed with mode-based architecture?
- Impact: 6 weeks development
- Benefit: Better UX, easier scaling, clearer evaluation
- Risk: Architectural change, testing burden
Should we start with trajectory evaluation?
- Impact: 1-2 weeks
- Benefit: Better debugging, sets foundation
- Risk: Low (additive change)
Tool category reorganization strategy?
- Option A: Mode-based (AUTHOR, CODE, RESEARCH)
- Option B: Keep current + add mode metadata
- Option C: Hybrid approach
Timeline expectations?
- Aggressive: 6 weeks total (both phases)
- Conservative: 2 weeks eval + 4 weeks architecture
- Incremental: Eval first, architecture after validation

Uh oh!

Eval Testing Quick Reference

AI Runner Eval Testing & Mode-Based Architecture - Quick Reference

📊 Current State Summary

✅ Completed Work

🔄 Current Architecture

❌ Missing Components (Based on LangSmith Guide)

🎯 Recommended Architecture (LangSmith Pattern)

Parent Graph with Specialized Subgraphs

📋 Action Items

Immediate (This Week)

Phase 1: Enhanced Evaluation (1-2 weeks)

Phase 2: Mode-Based Architecture (3-4 weeks)

🔍 Key Differences from Current Implementation

📖 Documentation Created

🚀 Running Current Tests

Unit Tests (All Passing)

Eval Tests (Ready to Run)

🎓 Learning from LangSmith Guide

Key Patterns We Should Adopt

Patterns We Already Use

⚠️ Important Notes

About Current Eval Tests

About Mode-Based Architecture

About Implementation Order

📞 Questions to Answer

🔗 Reference Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally