Skip to content

WildEval/ZeroEval

This branch is 369 commits ahead of, 199 commits behind allenai/WildBench:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

28d3c56 · Feb 4, 2025
Mar 7, 2024
Jul 29, 2024
Feb 4, 2025
Feb 4, 2025
Nov 3, 2024
Jan 21, 2025
Feb 4, 2025
Sep 24, 2024
Jan 31, 2025
Nov 5, 2024
Jul 19, 2024
Mar 6, 2024
Jan 21, 2025
Aug 10, 2024
Oct 23, 2024
Sep 9, 2024

Repository files navigation

ZeroEval: A Unified Framework for Evaluating Language Models

ZeroEval is a simple unified framework for evaluating (large) language models on various tasks. This repository aims to evaluate instruction-tuned LLMs for their zero-shot performance on various reasoning tasks such as MMLU and GSM. We evaluate LLMs with a unified setup by controlling the factors such as prompting, sampling, output parsing, etc. In ZeroEval, we perform zero-shot prompting, and instruct LM to output both reasoning and answer in a json-formatted output. We are actively adding new tasks. Contributions are welcome!

Todo

  • Support new tasks (GPPA, AIME, etc.)
  • Prefix-prefill for open models such that the parsing is easier
  • Add other formatting options (e.g. markup language instead of json, etc.)

Installation

Click to expand
conda create -n zeroeval python=3.10
conda activate zeroeval
# pip install vllm -U # pip install -e vllm
pip install vllm -U
pip install -r requirements.txt
# export HF_HOME=/path/to/your/custom/cache_dir/

Tasks

Usage

zero_eval_local.sh and zero_eval_api.sh are the two main scripts to run the evaluation.

Examples

  • bash zero_eval_local.sh -d mmlu-redux -m meta-llama/Meta-Llama-3-8B-Instruct -p Meta-Llama-3-8B-Instruct -s 4 (Run Llama-3-8B-Instruct with greedy decoding on mmlu-redux)

  • bash zero_eval_api.sh -d gsm -f openai -m openai/gpt-4o-mini-2024-07-18 -p gpt-4o-mini-2024-07-18 -s 8 (Run gpt-4o-mini with greedy decoding on gsm)

  • bash zero_eval_api.sh -d zebra-grid -f openai -m deepseek-chat -p deepseek-chat -s 8 (Run deepseek-chat via openai style api, with greedy decoding on zebra-grid)

More examples can be found in the scripts folder, e.g., the scripts/_MMLU_redux.md and scripts/_GSM.md files as well as scripts/local/crux.sh.

Arguments

Command Line Arguments
Arguments Description Default
-d DATA_NAME: mmlu-redux, gsm, math-l5, zebra-grid, alpaca_eval, ... (see src/task_configs.py)
-m model_name
-p model_pretty_name
-s number of shards (When -s 1 we'll use all your GPUs for loading the model and running the inference; When -s K, we'll use K GPUs and divide the data into K shards for each GPU to run the inference on a single shard, and merge the results at the end.) 1
-f engine (vllm by default for zero_eval_local.sh, can be changed to hf; For zero_eval_api.sh, we can use openai, anthropic, ...) vllm/openai for zero_eval_local/api.sh
-r run_name (the results will be saved in a sub folder with the run_name when it is specified) "default"
-t temperature 0 (greedy decoding)
-o top_p for nucleus sampling 1.0
-e repetition penalty 1.0
-b batch size 4
-x max_length 4096

Results

🚨 View results on our Leaderboard: https://hf.co/spaces/allenai/ZeroEval

  • MMLU-Redux: python src/evaluation/mcqa_eval.py mmlu-redux --> Full results
  • GSM/MATH-L5: python src/evaluation/math_eval.py math-l5/gsm --> Full results
  • ZebraLogic: python src/evaluation/zebra_grid_eval.py --> Full results and Leaderboard
  • CRUX: python src/evaluation/crux_eval.py --> Full results
  • All: python src/evaluation/summarize.py --> Full results ⬇️

Citation

If you find ZeroEval useful, please cite it as follows in your publication:

@software{Lin_ZeroEval_A_Unified_2024,
    author = {Lin, Bill Yuchen},
    month = jul,
    title = {{ZeroEval: A Unified Framework for Evaluating Language Models}},
    url = {https://github.com/WildEval/ZeroEval},
    year = {2024}
}

Star History

Star History Chart

About

A simple unified framework for evaluating LLMs

Resources

License

Citation

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 98.0%
  • Python 1.8%
  • Shell 0.2%