pip install mteb
- Using a python script (see scripts/run_mteb_english.py and mteb/mtebscripts for more):
from mteb import MTEB
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")
- Using CLI
mteb --available_tasks
mteb -m average_word_embeddings_komninos \
-t Banking77Classification \
--output_folder results/average_word_embeddings_komninos \
--verbosity 3
Datasets can be selected by providing the list of datasets, but also
- by their task (e.g. "Clustering" or "Classification")
evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks
- by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)
evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence datasets
- by their languages
evaluation = MTEB(task_langs=["en", "de"]) # Only select datasets which are "en", "de" or "en-de"
You can also specify which languages to load for multilingual/crosslingual tasks like below:
from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining
evaluation = MTEB(tasks=[
AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC
])
You can evaluate only on test
splits of all tasks by doing the following:
evaluation.run(model, eval_splits=["test"])
Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used.
Models should implement the following interface, implementing an encode
function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array
, torch.tensor
, etc.). For inspiration, you can look at the mteb/mtebscripts repo used for running diverse models via SLURM scripts for the paper.
class MyModel():
def encode(self, sentences, batch_size=32, **kwargs):
""" Returns a list of embeddings for the given sentences.
Args:
sentences (`List[str]`): List of sentences to encode
batch_size (`int`): Batch size for the encoding
Returns:
`List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
"""
pass
model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)
If you'd like to use different encoding functions for query and corpus when evaluating a Dense Retrieval Exact Search (DRES) model on retrieval tasks from BeIR, you can make your model DRES compatible. If compatible like the below example, it will be used for BeIR upon evaluation.
from mteb import AbsTaskRetrieval, DRESModel
class MyModel(DRESModel):
# Refer to the code of DRESModel for the methods to overwrite
pass
assert AbsTaskRetrieval.is_dres_compatible(MyModel)
To add a new task, you need to implement a new class that inherits from the AbsTask
associated with the task type (e.g. AbsTaskReranking
for reranking tasks). You can find the supported task types in here.
from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer
class MindSmallReranking(AbsTaskReranking):
@property
def description(self):
return {
"name": "MindSmallReranking",
"hf_hub_name": "mteb/mind_small",
"description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
"reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
"type": "Reranking",
"category": "s2s",
"eval_splits": ["validation"],
"eval_langs": ["en"],
"main_score": "map",
}
model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)
Note: for multilingual tasks, make sure your class also inherits from the
MultilingualTask
class like in this example.
The MTEB Leaderboard is available here. To submit:
- Run on MTEB: You can reference scripts/run_mteb_english.py for all MTEB English datasets used in the main ranking. Advanced scripts with different models are available in the mteb/mtebscripts repo.
- Format the json files into metadata using the script at
scripts/mteb_meta.py
. For examplepython scripts/mteb_meta.py path_to_results_folder
, which will create amteb_metadata.md
file. If you ran CQADupstack retrieval, make sure to merge the results first withpython scripts/merge_cqadupstack.py path_to_results_folder
. - Copy the content of the
mteb_metadata.md
file to the top of aREADME.md
file of your model on the Hub. See here for an example. - Refresh the leaderboard and you should see your scores 🥇
- To have the scores appear without refreshing, you can open an issue on the Community Tab of the LB and someone will restart the Space to cache your average scores.
Name | Hub URL | Description | Type | Category | #Languages | Train #Samples | Dev #Samples | Test #Samples | Avg. chars / train | Avg. chars / dev | Avg. chars / test |
---|---|---|---|---|---|---|---|---|---|---|---|
BUCC | mteb/bucc-bitext-mining | BUCC bitext mining dataset | BitextMining | s2s | 4 | 0 | 0 | 641684 | 0 | 0 | 101.3 |
Tatoeba | mteb/tatoeba-bitext-mining | 1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus | BitextMining | s2s | 112 | 0 | 0 | 2000 | 0 | 0 | 39.4 |
AmazonCounterfactualClassification | mteb/amazon_counterfactual | A collection of Amazon customer reviews annotated for counterfactual detection pair classification. | Classification | s2s | 4 | 4018 | 335 | 670 | 107.3 | 109.2 | 106.1 |
AmazonPolarityClassification | mteb/amazon_polarity | Amazon Polarity Classification Dataset. | Classification | s2s | 1 | 3600000 | 0 | 400000 | 431.6 | 0 | 431.4 |
AmazonReviewsClassification | mteb/amazon_reviews_multi | A collection of Amazon reviews specifically designed to aid research in multilingual text classification. | Classification | s2s | 6 | 1200000 | 30000 | 30000 | 160.5 | 159.2 | 160.4 |
Banking77Classification | mteb/banking77 | Dataset composed of online banking queries annotated with their corresponding intents. | Classification | s2s | 1 | 10003 | 0 | 3080 | 59.5 | 0 | 54.2 |
EmotionClassification | mteb/emotion | Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. | Classification | s2s | 1 | 16000 | 2000 | 2000 | 96.8 | 95.3 | 96.6 |
ImdbClassification | mteb/imdb | Large Movie Review Dataset | Classification | p2p | 1 | 25000 | 0 | 25000 | 1325.1 | 0 | 1293.8 |
MassiveIntentClassification | mteb/amazon_massive_intent | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Classification | s2s | 51 | 11514 | 2033 | 2974 | 35.0 | 34.8 | 34.6 |
MassiveScenarioClassification | mteb/amazon_massive_scenario | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Classification | s2s | 51 | 11514 | 2033 | 2974 | 35.0 | 34.8 | 34.6 |
MTOPDomainClassification | mteb/mtop_domain | MTOP: Multilingual Task-Oriented Semantic Parsing | Classification | s2s | 6 | 15667 | 2235 | 4386 | 36.6 | 36.5 | 36.8 |
MTOPIntentClassification | mteb/mtop_intent | MTOP: Multilingual Task-Oriented Semantic Parsing | Classification | s2s | 6 | 15667 | 2235 | 4386 | 36.6 | 36.5 | 36.8 |
ToxicConversationsClassification | mteb/toxic_conversations_50k | Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not. | Classification | s2s | 1 | 50000 | 0 | 50000 | 298.8 | 0 | 296.6 |
TweetSentimentExtractionClassification | mteb/tweet_sentiment_extraction | Classification | s2s | 1 | 27481 | 0 | 3534 | 68.3 | 0 | 67.8 | |
ArxivClusteringP2P | mteb/arxiv-clustering-p2p | Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category | Clustering | p2p | 1 | 0 | 0 | 732723 | 0 | 0 | 1009.9 |
ArxivClusteringS2S | mteb/arxiv-clustering-s2s | Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category | Clustering | s2s | 1 | 0 | 0 | 732723 | 0 | 0 | 74.0 |
BiorxivClusteringP2P | mteb/biorxiv-clustering-p2p | Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category. | Clustering | p2p | 1 | 0 | 0 | 75000 | 0 | 0 | 1666.2 |
BiorxivClusteringS2S | mteb/biorxiv-clustering-s2s | Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category. | Clustering | s2s | 1 | 0 | 0 | 75000 | 0 | 0 | 101.6 |
BlurbsClusteringP2P | slvnwhrl/blurbs-clustering-p2p | Clustering of book titles+blurbs. Clustering of 28 sets, either on the main or secondary genre | Clustering | p2p | 1 | 0 | 0 | 174637 | 0 | 0 | 664.09 |
BlurbsClusteringS2S | slvnwhrl/blurbs-clustering-s2s | Clustering of book titles. Clustering of 28 sets, either on the main or secondary genre. | Clustering | s2s | 1 | 0 | 0 | 174637 | 0 | 0 | 23.02 |
MedrxivClusteringP2P | mteb/medrxiv-clustering-p2p | Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category. | Clustering | p2p | 1 | 0 | 0 | 37500 | 0 | 0 | 1981.2 |
MedrxivClusteringS2S | mteb/medrxiv-clustering-s2s | Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category. | Clustering | s2s | 1 | 0 | 0 | 37500 | 0 | 0 | 114.7 |
RedditClustering | mteb/reddit-clustering | Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. | Clustering | s2s | 1 | 0 | 0 | 420464 | 0 | 0 | 64.7 |
RedditClusteringP2P | mteb/reddit-clustering-p2p | Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. | Clustering | p2p | 1 | 0 | 0 | 459399 | 0 | 0 | 727.7 |
StackExchangeClustering | mteb/stackexchange-clustering | Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. | Clustering | s2s | 1 | 0 | 417060 | 373850 | 0 | 56.8 | 57.0 |
StackExchangeClusteringP2P | mteb/stackexchange-clustering-p2p | Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs. | Clustering | p2p | 1 | 0 | 0 | 75000 | 0 | 0 | 1090.7 |
TenKGnadClusteringP2P | slvnwhrl/tenkgnad-clustering-p2p | Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category. | Clustering | p2p | 1 | 0 | 0 | 45914 | 0 | 0 | 2641.03 |
TenKGnadClusteringS2S | slvnwhrl/tenkgnad-clustering-s2s | Clustering of news article titles. Clustering of 10 splits on the news article category. | Clustering | s2s | 1 | 0 | 0 | 45914 | 0 | 0 | 50.96 |
TwentyNewsgroupsClustering | mteb/twentynewsgroups-clustering | Clustering of the 20 Newsgroups dataset (subject only). | Clustering | s2s | 1 | 0 | 0 | 59545 | 0 | 0 | 32.0 |
SprintDuplicateQuestions | mteb/sprintduplicatequestions-pairclassification | Duplicate questions from the Sprint community. | PairClassification | s2s | 1 | 0 | 101000 | 101000 | 0 | 65.2 | 67.9 |
TwitterSemEval2015 | mteb/twittersemeval2015-pairclassification | Paraphrase-Pairs of Tweets from the SemEval 2015 workshop. | PairClassification | s2s | 1 | 0 | 0 | 16777 | 0 | 0 | 38.3 |
TwitterURLCorpus | mteb/twitterurlcorpus-pairclassification | Paraphrase-Pairs of Tweets. | PairClassification | s2s | 1 | 0 | 0 | 51534 | 0 | 0 | 79.5 |
AskUbuntuDupQuestions | mteb/askubuntudupquestions-reranking | AskUbuntu Question Dataset - Questions from AskUbuntu with manual annotations marking pairs of questions as similar or non-similar | Reranking | s2s | 1 | 0 | 0 | 2255 | 0 | 0 | 52.5 |
MindSmallReranking | mteb/mind_small | Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research | Reranking | s2s | 1 | 231530 | 0 | 107968 | 69.0 | 0 | 70.9 |
SciDocsRR | mteb/scidocs-reranking | Ranking of related scientific papers based on their title. | Reranking | s2s | 1 | 0 | 19594 | 19599 | 0 | 69.4 | 69.0 |
StackOverflowDupQuestions | mteb/stackoverflowdupquestions-reranking | Stack Overflow Duplicate Questions Task for questions with the tags Java, JavaScript and Python | Reranking | s2s | 1 | 23018 | 0 | 3467 | 49.6 | 0 | 49.8 |
ArguAna | BeIR/arguana | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | p2p | 1 | 0 | 0 | 10080 | 0 | 0 | 1052.9 |
ClimateFEVER | BeIR/climate-fever | CLIMATE-FEVER is a dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change. | Retrieval | s2p | 1 | 0 | 0 | 5418128 | 0 | 0 | 539.1 |
CQADupstackAndroidRetrieval | BeIR/cqadupstack/android | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 23697 | 0 | 0 | 578.7 |
CQADupstackEnglishRetrieval | BeIR/cqadupstack/english | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 41791 | 0 | 0 | 467.1 |
CQADupstackGamingRetrieval | BeIR/cqadupstack/gaming | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 46896 | 0 | 0 | 474.7 |
CQADupstackGisRetrieval | BeIR/cqadupstack/gis | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 38522 | 0 | 0 | 991.1 |
CQADupstackMathematicaRetrieval | BeIR/cqadupstack/mathematica | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 17509 | 0 | 0 | 1103.7 |
CQADupstackPhysicsRetrieval | BeIR/cqadupstack/physics | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 39355 | 0 | 0 | 799.4 |
CQADupstackProgrammersRetrieval | BeIR/cqadupstack/programmers | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 33052 | 0 | 0 | 1030.2 |
CQADupstackStatsRetrieval | BeIR/cqadupstack/stats | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 42921 | 0 | 0 | 1041.0 |
CQADupstackTexRetrieval | BeIR/cqadupstack/tex | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 71090 | 0 | 0 | 1246.9 |
CQADupstackUnixRetrieval | BeIR/cqadupstack/unix | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 48454 | 0 | 0 | 984.7 |
CQADupstackWebmastersRetrieval | BeIR/cqadupstack/webmasters | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 17911 | 0 | 0 | 689.8 |
CQADupstackWordpressRetrieval | BeIR/cqadupstack/wordpress | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 49146 | 0 | 0 | 1111.9 |
DBPedia | BeIR/dbpedia-entity | DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base | Retrieval | s2p | 1 | 0 | 4635989 | 4636322 | 0 | 310.2 | 310.1 |
FEVER | BeIR/fever | FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. | Retrieval | s2p | 1 | 0 | 0 | 5423234 | 0 | 0 | 538.6 |
FiQA2018 | BeIR/fiqa | Financial Opinion Mining and Question Answering | Retrieval | s2p | 1 | 0 | 0 | 58286 | 0 | 0 | 760.4 |
HotpotQA | BeIR/hotpotqa | HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. | Retrieval | s2p | 1 | 0 | 0 | 5240734 | 0 | 0 | 288.6 |
MSMARCO | BeIR/msmarco | MS MARCO is a collection of datasets focused on deep learning in search. Note that the dev set is used for the leaderboard. | Retrieval | s2p | 1 | 0 | 8848803 | 8841866 | 0 | 336.6 | 336.8 |
MSMARCOv2 | BeIR/msmarco-v2 | MS MARCO is a collection of datasets focused on deep learning in search | Retrieval | s2p | 1 | 138641342 | 138368101 | 0 | 341.4 | 342.0 | 0 |
NFCorpus | BeIR/nfcorpus | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2p | 1 | 0 | 0 | 3956 | 0 | 0 | 1462.7 |
NQ | BeIR/nq | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2p | 1 | 0 | 0 | 2684920 | 0 | 0 | 492.7 |
QuoraRetrieval | BeIR/quora | QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions. | Retrieval | s2s | 1 | 0 | 0 | 532931 | 0 | 0 | 62.9 |
SCIDOCS | BeIR/scidocs | SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. | Retrieval | s2p | 1 | 0 | 0 | 26657 | 0 | 0 | 1161.9 |
SciFact | BeIR/scifact | SciFact verifies scientific claims using evidence from the research literature containing scientific paper abstracts. | Retrieval | s2p | 1 | 0 | 0 | 5483 | 0 | 0 | 1422.3 |
Touche2020 | BeIR/webis-touche2020 | Touché Task 1: Argument Retrieval for Controversial Questions | Retrieval | s2p | 1 | 0 | 0 | 382594 | 0 | 0 | 1720.1 |
TRECCOVID | BeIR/trec-covid | TRECCOVID is an ad-hoc search challenge based on the CORD-19 dataset containing scientific articles related to the COVID-19 pandemic | Retrieval | s2p | 1 | 0 | 0 | 171382 | 0 | 0 | 1117.4 |
BIOSSES | mteb/biosses-sts | Biomedical Semantic Similarity Estimation. | STS | s2s | 1 | 0 | 0 | 200 | 0 | 0 | 156.6 |
SICK-R | mteb/sickr-sts | Semantic Textual Similarity SICK-R dataset as described here: | STS | s2s | 1 | 0 | 0 | 19854 | 0 | 0 | 46.1 |
STS12 | mteb/sts12-sts | SemEval STS 2012 dataset. | STS | s2s | 1 | 4468 | 0 | 6216 | 100.7 | 0 | 64.7 |
STS13 | mteb/sts13-sts | SemEval STS 2013 dataset. | STS | s2s | 1 | 0 | 0 | 3000 | 0 | 0 | 54.0 |
STS14 | mteb/sts14-sts | SemEval STS 2014 dataset. Currently only the English dataset | STS | s2s | 1 | 0 | 0 | 7500 | 0 | 0 | 54.3 |
STS15 | mteb/sts15-sts | SemEval STS 2015 dataset | STS | s2s | 1 | 0 | 0 | 6000 | 0 | 0 | 57.7 |
STS16 | mteb/sts16-sts | SemEval STS 2016 dataset | STS | s2s | 1 | 0 | 0 | 2372 | 0 | 0 | 65.3 |
STS17 | mteb/sts17-crosslingual-sts | STS 2017 dataset | STS | s2s | 11 | 0 | 0 | 500 | 0 | 0 | 43.3 |
STS22 | mteb/sts22-crosslingual-sts | SemEval 2022 Task 8: Multilingual News Article Similarity | STS | s2s | 18 | 0 | 0 | 8060 | 0 | 0 | 1992.8 |
STSBenchmark | mteb/stsbenchmark-sts | Semantic Textual Similarity Benchmark (STSbenchmark) dataset. | STS | s2s | 1 | 11498 | 3000 | 2758 | 57.6 | 64.0 | 53.6 |
SummEval | mteb/summeval | News Article Summary Semantic Similarity Estimation. | Summarization | s2s | 1 | 0 | 0 | 2800 | 0 | 0 | 359.8 |
If you find MTEB useful, feel free to cite our publication MTEB: Massive Text Embedding Benchmark:
@article{muennighoff2022mteb,
doi = {10.48550/ARXIV.2210.07316},
url = {https://arxiv.org/abs/2210.07316},
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
}