Add benchmark infrastructure for scientific ACE evaluation #50

Lanzelot1 · 2025-12-09T17:02:56Z

Summary

Add comprehensive benchmark framework for evaluating ACE performance
Support 14 benchmark datasets (FiNER, GSM8K, MMLU, HellaSwag, ARC, SWE-Bench, etc.)
Implement train/test split with overfitting prevention for rigorous evaluation
Add Docker containerization for reproducible benchmark execution
Include specialized evaluation environments for different task types
Add 33 new processor tests for data transformation validation

Changes

New: tests/test_processors.py - comprehensive tests for all data processors
Updated: CLAUDE.md - dedicated benchmarking section with commands
Updated: README.md - added benchmarking to features list
Removed: scripts/explain_ace_performance.py (broken script using removed module)

Test plan

Run test suite: uv run pytest tests/test_benchmarks.py tests/test_processors.py (48 tests pass)
Verify benchmark list: uv run python scripts/run_benchmark.py list
Test baseline mode: uv run python scripts/run_benchmark.py simple_qa --limit 5 --skip-adaptation
Test comparison mode: uv run python scripts/run_benchmark.py simple_qa --limit 10 --compare

Features: - Comprehensive benchmark framework with 14+ datasets - Support for MMLU, GSM8K, FiNER, HellaSwag, ARC, SWE-Bench, etc. - Train/test split with overfitting prevention - Docker containerization for reproducible execution - 33 new processor tests for data transformation - Updated documentation with benchmark commands Cleanup: - Remove outdated dated documentation files (1205.md, GUIDE1130.md) - Remove broken explain_ace_performance.py script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Lanzelot1 force-pushed the feature/benchmarks branch from a2f8b5b to 9f67006 Compare December 9, 2025 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark infrastructure for scientific ACE evaluation #50

Add benchmark infrastructure for scientific ACE evaluation #50

Uh oh!

Lanzelot1 commented Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add benchmark infrastructure for scientific ACE evaluation #50

Are you sure you want to change the base?

Add benchmark infrastructure for scientific ACE evaluation #50

Uh oh!

Conversation

Lanzelot1 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lanzelot1 commented Dec 9, 2025 •

edited

Loading