ZeroEval: A Unified Framework for Evaluating Language Models

ZeroEval is a simple unified framework for evaluating (large) language models on various tasks. This repository aims to evaluate instruction-tuned LLMs for their zero-shot performance on various reasoning tasks such as MMLU and GSM. We evaluate LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome!

Leaderboard: https://hf.co/spaces/allenai/ZeroEval
X post

Todo

Support new tasks (GPPA, AIME, etc.)
Prefix-prefill for open models such that the parsing is easier
Add other formatting options (e.g. markup language instead of json, etc.)

Installation

Click to expand

conda create -n zeroeval python=3.10
conda activate zeroeval
# pip install vllm -U # pip install -e vllm
pip install vllm -U
pip install -r requirements.txt
# export HF_HOME=/path/to/your/custom/cache_dir/

Tasks

MMLU-redux (-d mmlu-redux)
ZebraLogic (-d zebra-grid)
CRUX (-d crux)
MATH (Level 5) (-d math-l5)
GSM8K (-d gsm)
More tasks will be added soon. (e.g., ARC, MMLU-Pro, etc.)

Usage

zero_eval_local.sh and zero_eval_api.sh are the two main scripts to run the evaluation.

Examples

bash zero_eval_local.sh -d mmlu-redux -m meta-llama/Meta-Llama-3-8B-Instruct -p Meta-Llama-3-8B-Instruct -s 4 (Run Llama-3-8B-Instruct with greedy decoding on mmlu-redux)
bash zero_eval_api.sh -d gsm -f openai -m openai/gpt-4o-mini-2024-07-18 -p gpt-4o-mini-2024-07-18 -s 8 (Run gpt-4o-mini with greedy decoding on gsm)
bash zero_eval_api.sh -d zebra-grid -f openai -m deepseek-chat -p deepseek-chat -s 8 (Run deepseek-chat via openai style api, with greedy decoding on zebra-grid)

More examples can be found in the scripts folder, e.g., the scripts/_MMLU_redux.md and scripts/_GSM.md files as well as scripts/local/crux.sh.

Arguments

Command Line Arguments

Arguments	Description	Default
`-d`	DATA_NAME: `mmlu-redux`, `gsm`, `math-l5`, `zebra-grid`, `alpaca_eval`, ... (see src/task_configs.py)
`-m`	model_name
`-p`	model_pretty_name
`-s`	number of shards (When `-s 1` we'll use all your GPUs for loading the model and running the inference; When `-s K`, we'll use K GPUs and divide the data into K shards for each GPU to run the inference on a single shard, and merge the results at the end.)	1
`-f`	engine (`vllm` by default for `zero_eval_local.sh`, can be changed to `hf`; For `zero_eval_api.sh`, we can use `openai`, `anthropic`, ...)	`vllm`/`openai` for `zero_eval_local/api.sh`
`-r`	run_name (the results will be saved in a sub folder with the `run_name` when it is specified)	"default"
`-t`	temperature	0 (greedy decoding)
`-o`	top_p for nucleus sampling	1.0
`-e`	repetition penalty	1.0
`-b`	batch size	4
`-x`	max_length	4096

Results

🚨 View results on our Leaderboard: https://hf.co/spaces/allenai/ZeroEval

MMLU-Redux: python src/evaluation/mcqa_eval.py mmlu-redux --> Full results
GSM/MATH-L5: python src/evaluation/math_eval.py math-l5/gsm --> Full results
ZebraLogic: python src/evaluation/zebra_grid_eval.py --> Full results and Leaderboard
CRUX: python src/evaluation/crux_eval.py --> Full results
All: python src/evaluation/summarize.py --> Full results ⬇️

Citation

If you find ZeroEval useful, please cite it as follows in your publication:

@software{Lin_ZeroEval_A_Unified_2024,
    author = {Lin, Bill Yuchen},
    month = jul,
    title = {{ZeroEval: A Unified Framework for Evaluating Language Models}},
    url = {https://github.com/WildEval/ZeroEval},
    year = {2024}
}

Name	Name	Last commit message	Last commit date
Latest commit yuchenlin Update zebra_banner.png Feb 4, 2025 28d3c56 · Feb 4, 2025 History 369 Commits
.github/workflows	.github/workflows	Create static.yml	Mar 7, 2024
data_prep	data_prep	support crux	Jul 29, 2024
docs	docs	Update zebra_banner.png	Feb 4, 2025
result_dirs	result_dirs	add more size info in the tables	Feb 4, 2025
result_dirs_follow_up/zebra-grid	result_dirs_follow_up/zebra-grid	follow up analysis for the zebra logic	Nov 3, 2024
scripts	scripts	deepseek R1 and V3; O1 full	Jan 21, 2025
src	src	add more size info in the tables	Feb 4, 2025
state_of_limit	state_of_limit	rm the files	Sep 24, 2024
zebra_logic_analysis	zebra_logic_analysis	figures	Jan 31, 2025
.gitignore	.gitignore	add wildbench v2 -hard	Nov 5, 2024
CITATION.cff	CITATION.cff	add CITATION file	Jul 19, 2024
LICENSE	LICENSE	Initial commit	Mar 6, 2024
README.md	README.md	deepseek R1 and V3; O1 full	Jan 21, 2025
requirements.txt	requirements.txt	Add mixtralai<1.0	Aug 10, 2024
zero_eval_api.sh	zero_eval_api.sh	add bon=32 for zebralogic with gpt-4o-mini	Oct 23, 2024
zero_eval_local.sh	zero_eval_local.sh	add phi-3.5	Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZeroEval: A Unified Framework for Evaluating Language Models

Todo

Installation

Tasks

Usage

Examples

Arguments

Results

Citation

Star History

About

Releases

Packages

Languages

License

WildEval/ZeroEval

Folders and files

Latest commit

History

Repository files navigation

ZeroEval: A Unified Framework for Evaluating Language Models

Todo

Installation

Tasks

Usage

Examples

Arguments

Results

Citation

Star History

About

Resources

License

Citation

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages