This codebase stores the complete artifacts and describes how to reproduce or extend the results from the paper "Style Outweighs Substance: Failure modes of LLM judges in alignment benchmarking".
In this table, you can find the complete list of benchmarks we use in SOS-Bench, along with the codebase necessary to run them. Below, we describe how to work with each codebase.
Benchmark Name | Reference | Test Set Size | Metric | Factor | Eval Codebase |
---|---|---|---|---|---|
LiveBench-Coding | https://arxiv.org/abs/2406.19314 | 130 | Exact Match Acc | WK | LiveBench |
LiveBench-Data Analysis | https://arxiv.org/abs/2406.19314 | 150 | Exact Match Acc | WK | LiveBench |
LiveBench-Instruction Following | https://arxiv.org/abs/2406.19314 | 200 | Exact Match Acc | IF | LiveBench |
LiveBench-Language | https://arxiv.org/abs/2406.19314 | 140 | Exact Match Acc | WK | LiveBench |
LiveBench-Math | https://arxiv.org/abs/2406.19314 | 230 | Exact Match Acc | WK | LiveBench |
LiveBench-Reasoning | https://arxiv.org/abs/2406.19314 | 150 | Exact Match Acc | WK | LiveBench |
IFEval | https://arxiv.org/abs/2311.07911 | 540 | Avg of Custom Metrics | IF | Eleuther |
MATH Lvl 5 | https://arxiv.org/abs/2103.03874 | 1000 | Exact Match Acc | WK | Eleuther |
MuSR | https://arxiv.org/abs/2310.16049 | 750 | Acc | WK | Eleuther |
GPQA | https://arxiv.org/abs/2311.12022 | 1250 | Acc | WK | Eleuther |
MMLU-Pro | https://arxiv.org/abs/2406.01574 | 12000 | Acc | WK | Eleuther |
BBH | https://arxiv.org/abs/2210.09261 | 6750 | Acc | WK | Eleuther |
BeaverTails | https://arxiv.org/abs/2307.04657 | 1400 | Acc | Safety | Eleuther |
CDNA | https://huggingface.co/datasets/walledai/CDNA | 2730 | Acc | Safety | Eleuther |
DTToxicity | https://huggingface.co/datasets/walledai/DTToxicity | 4800 | Acc | Safety | Eleuther |
JailbreakHub | https://arxiv.org/abs/2308.03825 | 15100 | Acc | Safety | Eleuther |
BBQ | https://arxiv.org/abs/2110.08193 | 58500 | Acc | Safety | Eleuther |
WMDP | https://arxiv.org/abs/2403.03218 | 3670 | Inverse Acc | Safety | Eleuther |
XSTest | https://arxiv.org/abs/2308.01263 | 450 | Acc | Safety | Eleuther |
WildGuardTest | https://arxiv.org/abs/2406.18495 | 1730 | Acc | Safety | Eleuther |
Toxigen | https://arxiv.org/abs/2203.09509 | 9900 | Acc | Safety | Eleuther |
StrongREJECT | https://arxiv.org/abs/2402.10260 | 310 | Acc | Safety | Eleuther |
SGXSTest | https://huggingface.co/datasets/walledai/SGXSTest | 100 | Acc | Safety | Eleuther |
SaladBench | https://arxiv.org/abs/2402.05044 | 30400 | Acc | Safety | Eleuther |
Here is a brief description of our result artifacts.
Filenames: eleuther_wandb.csv
Fields: Name (describes the name of the dataset and preference optimization method, if any), Date Created, Runtime, Github Link, GPU Count, GPU Type, Batch Size, Parameter Count, Random Seed, Raw Scores (normalized and non-normalized, stderr)
Filenames: arena_hard_auto.csv
Fields: model (describes the name of the dataset and preference optimization method, if any), score, rating_q025, rating_q975, CI (describe the raw score and variations of the bootstrapped confidence intervals)
Filenames: livebench_groups.csv, livebench_tasks.csv
Fields: model (describes the name of the dataset and preference optimization method, if any), scores (either task-wise or group-wise)
The entirety of SOS-Bench can be run as a two-stage process; the first set of benchmarks can be completed using a fork of the Eleuther AI Harness, and the second set can be run using the LiveBench codebase.
pip install lm_eval[wandb,vllm,math,ifeval], sentencepiece
python install_nltk_punkt.py
- Git clone our Eleuther AI Harness fork which contains additional tasks
cd lm-evaluation-harness
pip install -e .
lm_eval --model hf --wandb_args project=<YOUR_PROJECT> --log_samples --output_path results --model_args pretrained=<YOUR_MODEL>,dtype=bfloat16 --tasks leaderboard,safety,bbq,wmdp --device cuda:0 --batch_size auto;
- Git clone the LiveBench repository
- Follow the instructions provided in the repository readme.