MMLU-Pro+ Dataset and Evaluation

🔵 NeurIPS 2024, Safe Generative AI

This repository contains the MMLU-Pro+ dataset and evaluation scripts, an enhanced version of the MMLU-Pro benchmark designed to assess higher-order reasoning capabilities in Large Language Models (LLMs).

Overview

MMLU-Pro+ introduces multi-choice questions with multiple correct answers, and distractors probing the higher-order reasoning capabilities and potential shortcut learning of LLMs. It is also available through EleutherAI's lm-evaluation-harness.

Models Tested

We evaluated the following state-of-the-art LLMs:

O1-preview
GPT-4o
Claude-Sonnet-3.5
Gemini-1.5-Pro
Llama-3.1-405B-Instruct
Qwen-2-72B-Instruct

Evaluation Methods

We used API calls to evaluate the models:

Gemini 1.5 Pro, GPT-4o, and O1-preview: We used their original APIs.
Llama 3.1 405B Instruct and Qwen 2 72B Instruct: We used the API from DeepInfra.

Evaluation Scripts

There are three main evaluation scripts:

evaluate_from_api.py: Can be used for all models.
evaluate_from_api_multiprocess.py: Same as above, but supports multi-processing if the API allows.
evaluate_from_api_claude_rate_limit.py: Specifically for Claude API, which has rate limits. You can set your own limits in the code.

Results

Here are the accuracy results (%) on MMLU-Pro+ categories with performance drop from MMLU-Pro:

Additional Analyses

Shortcut Learning Analysis

Correct Pair Identification (CPI) Analysis

Citation

If you use MMLU-Pro+ in your research, please cite our paper:

@article{taghanaki2024mmlu,
  title={MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs},
  author={Taghanaki, Saeid Asgari and Khani, Aliasgahr and Khasahmadi, Amir},
  journal={arXiv preprint arXiv:2409.02257},
  year={2024}
}

Acknowledgements

We thank the creators of MMLU-Pro for their foundational work, which made MMLU-Pro+ possible.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
cot_prompt_lib		cot_prompt_lib
README.md		README.md
evaluate_from_api.py		evaluate_from_api.py
evaluate_from_api_claude_rate_limit.py		evaluate_from_api_claude_rate_limit.py
evaluate_from_api_multiprocess.py		evaluate_from_api_multiprocess.py
mmlupp.parquet		mmlupp.parquet
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMLU-Pro+ Dataset and Evaluation

Overview

Models Tested

Evaluation Methods

Evaluation Scripts

Results

Additional Analyses

Shortcut Learning Analysis

Correct Pair Identification (CPI) Analysis

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

asgsaeid/mmlu-pro-plus

Folders and files

Latest commit

History

Repository files navigation

MMLU-Pro+ Dataset and Evaluation

Overview

Models Tested

Evaluation Methods

Evaluation Scripts

Results

Additional Analyses

Shortcut Learning Analysis

Correct Pair Identification (CPI) Analysis

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages