# MMLU Evaluation This repository contains code for running the [MMLU](https://arxiv.org/abs/2009.03300) (Massive Multitask Language Understanding) evaluation of large language models. It is recoded from scratch following the logic of the [original repo](https://github.com/hendrycks/test) with following imrovements: - **Accellerated inference**: Using multithreaded API calls. - **Enhanced stability**: Added timeouts and retries for API calls. - **Modularity**: You can easily evaluate your custom LLM (see [Evaluate your custom model](#evaluate-custom)). ## Setup 1. Download the dataset [here](https://people.eecs.berkeley.edu/~hendrycks/data.tar) 2. Install the required dependencies: ```bash pip install -r requirements.txt ``` 3. Ensure you have the necessary environment variables if you want to use OpenAI or Azure models. E.g. for Azure: ```bash export OPENAI_API_BASE=https://your-azure-endpoint.com export OPENAI_API_KEY=your-azure-key ``` ## Usage Run the evaluation code. The results are stored as *.csv files in the given directory. ```bash python evaluate_azure.py --data_dir path-to-data --result_dir path-to-results --k_shot 0 ``` ## Evaluate your custom model <a id="evaluate-custom"></a> To evaluate a custom LLM simply use the following template and replace the predict_function by your own Callable: ```python from pathlib import Path from mmlu.evaluation import predict_dataset, evaluate_results def predict_function(prompt: str) -> str: return 'A' if __name__ == '__main__': data_dir = Path('data') result_dir = Path('results') predict_dataset(data_dir=data_dir, result_dir=result_dir, predict_function=predict_function, k_shot=0) evaluate_results(result_dir=result_dir) ``` ## Languages other than English ### Evaluating on other languages We will provide additional datasets (starting with German) that are translated via Azure and can be used ad-hoc with the standard evaluation script - simply point to the translated data. A translated dataset is formatted in the same way as the original dataset but contains an additional file ```subjects.json``` that includes the translated prompt header and subjects: ``` data_de/ ├── dev/ ├── test/ ├── subjects.json ``` For German, the ```subjects.json``` looks like: ```json { "header": "Im Folgenden finden Sie Multiple-Choice-Fragen (mit Antworten) zum Thema", "answer": "Antwort", "subjects": { "abstract_algebra": "abstrakte Algebra", "astronomy": "Astronomie", ... } } ``` ### Translating the dataset to another language You can use the translation script that calls the Azure translation service: ```bash export AZURE_ENDPOINT=your-azure-translation-endpoint export AZURE_KEY=your-azure-key export AZURE_REGION=your-azure-region PYTHONPATH=. python mmlu/translate --data_dir data --target_dir /tmp/data_de --lang de ``` The translated data will be stored in ```target_dir``` in the format described above. Note that only ```dev``` and ```test``` data will be translated. ## References * [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) * [Original Implementation](https://github.com/hendrycks/test)