Benchmark agent workflows: try the models of your choice on the framework that you want

This repo is the engine for the evaluations displayed in our Agents v2.0 announcement post.

You can use it to test agents on different frameworks:

On different benchmarks:

GAIA
our custom agent reasoning benchmark that includes tasks from GSM8K, HotpotQA and GAIA

And with different models (cf benchmark below).

We also implement LLM-judge evaluation, with parallel processing for faster results.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data/gaia		data/gaia
figures		figures
scripts		scripts
LICENSE		LICENSE
README.md		README.md
benchmark_gaia.ipynb		benchmark_gaia.ipynb
benchmark_transformers_agents.ipynb		benchmark_transformers_agents.ipynb
gaia.py		gaia.py

Provide feedback