Skip to content

Eval Testing Quick Reference

Joe Curlee (w4ffl35) edited this page Nov 5, 2025 · 1 revision

AI Runner Eval Testing & Mode-Based Architecture - Quick Reference

📊 Current State Summary

✅ Completed Work

  • 14 tools ported from old mixin system to ToolRegistry
  • 45 unit tests created and passing (100% coverage for ported tools)
  • 34 eval tests created following web tool pattern
  • All code quality checks passed (no errors)
  • Documentation in /docs/TOOL_MIGRATION_SUMMARY.md

🔄 Current Architecture

Single LangGraph Workflow

START → model → tools → model → END
        ↑_______|
  • One workflow for ALL tasks
  • All 37 tools potentially active
  • Tool redundancy detection
  • Database checkpoint persistence

❌ Missing Components (Based on LangSmith Guide)

  1. No Mode-Based Routing → All tools active at once
  2. No Trajectory Evaluation → Only checking response content
  3. No Intent Classification → No automatic mode switching
  4. No Specialized Subgraphs → One workflow handles everything

🎯 Recommended Architecture (LangSmith Pattern)

Parent Graph with Specialized Subgraphs

                    ┌────────────────┐
                    │ Intent Router  │
                    └────────┬───────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
   ┌─────────┐         ┌─────────┐         ┌─────────┐
   │ Author  │         │  Code   │         │Research │
   │  Mode   │         │  Mode   │         │  Mode   │
   │ (5 tools)│        │(8 tools)│         │(6 tools)│
   └─────────┘         └─────────┘         └─────────┘

Benefits:

  • Focused tool sets per mode (5-10 vs 37 global)
  • Better LLM performance with fewer tool choices
  • Easier evaluation with clear expected paths
  • Scalable - easy to add new modes

📋 Action Items

Immediate (This Week)

  1. Review LangSmith documentation - DONE
  2. Create architecture plan - /wiki/Mode-Based-Agent-Architecture.md
  3. Create eval enhancement plan - /wiki/Evaluation-Framework-Enhancement.md
  4. Decision needed: Proceed with mode-based architecture?

Phase 1: Enhanced Evaluation (1-2 weeks)

Priority: HIGH - Sets foundation for everything else

  1. Create trajectory tracking utilities

    • trajectory_evaluator.py - Subsequence matching
    • tracking.py - Event streaming helper
  2. Update existing eval tests

    • Add expected trajectories to all tests
    • Track actual paths through nodes/tools
    • Validate tool call sequences
  3. Create new eval tests

    • Intent classification (single-step)
    • Multi-tool workflows (trajectory)
    • Error recovery paths

Files to Create:

src/airunner/components/eval/
├── utils/
│   ├── __init__.py
│   ├── trajectory_evaluator.py    # NEW
│   └── tracking.py                 # NEW
└── tests/
    ├── test_intent_classification_eval.py  # NEW
    └── test_trajectory_eval.py             # NEW

Files to Update:

src/airunner/components/eval/tests/
├── test_user_data_tool_eval.py    # Add trajectory tracking
├── test_agent_tool_eval.py        # Add trajectory tracking
├── test_rag_tool_eval.py          # Add trajectory tracking
└── test_knowledge_tool_eval.py    # Add trajectory tracking

Phase 2: Mode-Based Architecture (3-4 weeks)

Priority: MEDIUM - Major architectural change

  1. Reorganize tool categories (Week 1)

    • Define mode-based categories (AUTHOR, CODE, RESEARCH, QA)
    • Reclassify existing tools
    • Create mode-specific tool modules
  2. Implement parent routing graph (Week 2)

    • Build intent classifier node
    • Create parent StateGraph
    • Integrate with WorkflowManager
  3. Build specialized subgraphs (Week 3-4)

    • Author agent (writing tools)
    • Code agent (programming tools)
    • Research agent (search/knowledge tools)
    • QA agent (retrieval tools)

🔍 Key Differences from Current Implementation

Aspect Current LangSmith Pattern Impact
Architecture Single workflow Parent + subgraphs Clearer intent routing
Tool Access All 37 tools 5-10 per mode Reduced LLM confusion
Evaluation Response content Response + trajectory Better debugging
Mode Switching Manual (tool_categories) Automatic (intent) Better UX
Scalability Add tools → more confusion Add modes → isolated tools Easy to extend

📖 Documentation Created

  1. /wiki/Mode-Based-Agent-Architecture.md

    • Complete architecture plan
    • 6-week implementation timeline
    • File structure
    • Success criteria
  2. /wiki/Evaluation-Framework-Enhancement.md

    • Trajectory evaluation guide
    • Code examples
    • Priority order
    • Success metrics
  3. /docs/TOOL_MIGRATION_SUMMARY.md

    • Tool porting summary
    • Testing coverage
    • Running tests guide

🚀 Running Current Tests

Unit Tests (All Passing)

# All tool unit tests
pytest src/airunner/components/llm/tools/tests/ -v

# Specific tests
pytest src/airunner/components/llm/tools/tests/test_user_data_tools.py -v
pytest src/airunner/components/llm/tools/tests/test_agent_tools.py -v

Eval Tests (Ready to Run)

# All eval tests
pytest src/airunner/components/eval/tests/test_*_tool_eval.py -v -m eval

# Specific eval tests
pytest src/airunner/components/eval/tests/test_user_data_tool_eval.py -v -m eval
pytest src/airunner/components/eval/tests/test_agent_tool_eval.py -v -m eval

🎓 Learning from LangSmith Guide

Key Patterns We Should Adopt

  1. Hierarchical Graphs

    • Parent graph routes to specialized subgraphs
    • Each subgraph has focused purpose + tools
    • Clean separation of concerns
  2. Trajectory Evaluation

    • Track path through nodes/tools
    • Compare to expected sequence
    • Partial credit for correct steps
  3. Single-Step Testing

    • Test components in isolation
    • Validate intent classification separately
    • Faster iteration on specific failures
  4. Configurable Environments

    • config={"env": "test"} for mocking
    • Separate test/prod behaviors
    • Easier evaluation without side effects

Patterns We Already Use

  1. Database Checkpoints - Via DatabaseCheckpointSaver
  2. Tool Registry - Via @tool decorator
  3. State Management - Via WorkflowState(TypedDict)
  4. Mock-based Eval Tests - Avoid external dependencies

⚠️ Important Notes

About Current Eval Tests

  • They are CORRECT for checking tool triggering + response quality
  • They follow the pattern from existing web tool evals
  • We're ADDING trajectory validation, not replacing them
  • They will be even better once we add trajectory tracking

About Mode-Based Architecture

  • Major change but high value
  • Backward compatible via general_agent fallback
  • Enables better evaluation with clear expected paths
  • User benefit - faster, more focused responses

About Implementation Order

  1. Trajectory evaluation FIRST - Works with current architecture
  2. Mode-based routing SECOND - Benefits from trajectory validation
  3. This order minimizes risk and validates patterns early

📞 Questions to Answer

  1. Should we proceed with mode-based architecture?

    • Impact: 6 weeks development
    • Benefit: Better UX, easier scaling, clearer evaluation
    • Risk: Architectural change, testing burden
  2. Should we start with trajectory evaluation?

    • Impact: 1-2 weeks
    • Benefit: Better debugging, sets foundation
    • Risk: Low (additive change)
  3. Tool category reorganization strategy?

    • Option A: Mode-based (AUTHOR, CODE, RESEARCH)
    • Option B: Keep current + add mode metadata
    • Option C: Hybrid approach
  4. Timeline expectations?

    • Aggressive: 6 weeks total (both phases)
    • Conservative: 2 weeks eval + 4 weeks architecture
    • Incremental: Eval first, architecture after validation

🔗 Reference Links

Clone this wiki locally