Open LLM Leaderboard Report

Development Status :: 5 - Production/Stable
Copyright (c) 2023 MinWoo Park

Updates officially ceased on November 13, 2023.

The updates for the Open LLM LeaderBoard Report(This Repository) will officially cease on November 13, 2023. Due to concerns of contamination and leaks in the test dataset, I have determined that the rankings on Hugging Face's Open LLM Leaderboard can no longer be fully trusted. Users referring to the Open LLM Leaderboard should now carefully assess not only the rankings of models but also whether models have artificially boosted benchmark scores using contaminated training data. Additionally, it is advisable to consider benchmark datasets tailored for different purposes and to conduct qualitative evaluations as well.

Nevertheless, Hugging Face's Open LLM LeaderBoard, with its free GPU instances, can still provide a rough estimate of model performance for many users and serve as one aspect of quantitative validation. We appreciate Hugging Face for their contributions.

Although updates will no longer be carried out, the code used to generate the corresponding plots remains valid, allowing you to configure and analyze the data as needed.

Open LLM Leaderboard Report

Latest update: 20231031

This repository offers visualizations that showcase the performance of open-source Large Language Models (LLMs), based on evaluation metrics sourced from Hugging Face's Open-LLM-Leaderboard.

Source data

You can refer to this CSV file for the underlying data used for visualization. Raw data is 2d-list formatted JSON file. You can find all images and back data at assets folder.

Revision

Discussion and analysis during the revision

Run

Set using config.py

git clone https://github.com/dsdanielpark/open_llm_leaderboard
cd open_llm_leaderboard

python main.py

Top 5 Summary

Total Summary

Parameters: The largest parameter model achieved so far has been converted to 100 for percentage representation. When there is no parameter information in model name, it is displayed as 0, making the graph appear to suddenly connect to 0.

Average Ranking

What is Open-LLM-Leaderboard?

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

The Open LLM Leaderboard tracks, ranks, and evaluates large language models and chatbots. It evaluates models based on benchmarks from the Eleuther AI Language Model Evaluation Harness, covering science questions, commonsense inference, multitask accuracy, and truthfulness in generating answers.

The benchmarks aim to test reasoning and general knowledge in different fields using 0-shot and few-shot settings.

Evaluation is performed against 4 popular benchmarks:

AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
MMLU (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
TruthfulQA (0-shot) - a benchmark to measure whether a language model is truthful in generating answers to questions.

It is chosed benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.

Top 5

Top 10

Performance by Metric

Average

HellaSwag (10-shot)

MMLU (5-shot)

AI2 Reasoning Challenge (25-shot)

TruthfulQA (0-shot)

Parameters

Parameters: The largest parameter model achieved so far has been converted to 100 for percentage representation.

Citation

@software{Open-LLM-Leaderboard-Report-2023,
  author = {Daniel Park},
  title = {Open-LLM-Leaderboard-Report},
  url = {https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report},
  year = {2023}
}

Reference

[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
.github		.github
assets		assets
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REVISION.md		REVISION.md
config.py		config.py
execute_steps.py		execute_steps.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updates officially ceased on November 13, 2023.

Open LLM Leaderboard Report

Latest update: 20231031

Source data

Revision

Run

Top 5 Summary

Total Summary

Average Ranking

What is Open-LLM-Leaderboard?

Top 5

Top 10

Performance by Metric

Average

HellaSwag (10-shot)

MMLU (5-shot)

AI2 Reasoning Challenge (25-shot)

TruthfulQA (0-shot)

Parameters

Citation

Reference

About

Releases

Packages

Languages

License

dsdanielpark/open-llm-leaderboard-report

Folders and files

Latest commit

History

Repository files navigation

Updates officially ceased on November 13, 2023.

Open LLM Leaderboard Report

Latest update: 20231031

Source data

Revision

Run

Top 5 Summary

Total Summary

Average Ranking

What is Open-LLM-Leaderboard?

Top 5

Top 10

Performance by Metric

Average

HellaSwag (10-shot)

MMLU (5-shot)

AI2 Reasoning Challenge (25-shot)

TruthfulQA (0-shot)

Parameters

Citation

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages