🔵 NeurIPS 2024, Safe Generative AI
This repository contains the MMLU-Pro+ dataset and evaluation scripts, an enhanced version of the MMLU-Pro benchmark designed to assess higher-order reasoning capabilities in Large Language Models (LLMs).
MMLU-Pro+ introduces multi-choice questions with multiple correct answers, and distractors probing the higher-order reasoning capabilities and potential shortcut learning of LLMs. It is also available through EleutherAI's lm-evaluation-harness.
We evaluated the following state-of-the-art LLMs:
- O1-preview
- GPT-4o
- Claude-Sonnet-3.5
- Gemini-1.5-Pro
- Llama-3.1-405B-Instruct
- Qwen-2-72B-Instruct
We used API calls to evaluate the models:
- Gemini 1.5 Pro, GPT-4o, and O1-preview: We used their original APIs.
- Llama 3.1 405B Instruct and Qwen 2 72B Instruct: We used the API from DeepInfra.
There are three main evaluation scripts:
evaluate_from_api.py: Can be used for all models.evaluate_from_api_multiprocess.py: Same as above, but supports multi-processing if the API allows.evaluate_from_api_claude_rate_limit.py: Specifically for Claude API, which has rate limits. You can set your own limits in the code.
Here are the accuracy results (%) on MMLU-Pro+ categories with performance drop from MMLU-Pro:
If you use MMLU-Pro+ in your research, please cite our paper:
@article{taghanaki2024mmlu,
title={MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs},
author={Taghanaki, Saeid Asgari and Khani, Aliasgahr and Khasahmadi, Amir},
journal={arXiv preprint arXiv:2409.02257},
year={2024}
}We thank the creators of MMLU-Pro for their foundational work, which made MMLU-Pro+ possible.