This repo is the engine for the evaluations displayed in our Agents v2.0 announcement post.
You can use it to test agents on different frameworks:
On different benchmarks:
- GAIA
- our custom agent reasoning benchmark that includes tasks from GSM8K, HotpotQA and GAIA
And with different models (cf benchmark below).
We also implement LLM-judge evaluation, with parallel processing for faster results.