MTEB: Massive Text Embedding Benchmark #627

irthomasthomas · 2024-02-27T19:07:43Z

blog/mteb.md at main · huggingface/blog

Title: blog/mteb.md at main · huggingface/blog

Description:
"---
title: "MTEB: Massive Text Embedding Benchmark"
thumbnail: /blog/assets/110_mteb/thumbnail.png
authors:

user: Muennighoff

MTEB: Massive Text Embedding Benchmark

MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks.

The 🥇 leaderboard provides a holistic view of the best text embedding models out there on a variety of tasks.

The 📝 paper gives background on the tasks and datasets in MTEB and analyzes leaderboard results!

The 💻 Github repo contains the code for benchmarking and submitting any model of your choice to the leaderboard.

Why Text Embeddings?

Text Embeddings are vector representations of text that encode semantic information. As machines require numerical inputs to perform computations, text embeddings are a crucial component of many downstream NLP applications. For example, Google uses text embeddings to power their search engine. Text Embeddings can also be used for finding patterns in large amount of text via clustering or as inputs to text classification models, such as in our recent SetFit work. The quality of text embeddings, however, is highly dependent on the embedding model used. MTEB is designed to help you find the best embedding model out there for a variety of tasks!

MTEB

🐋 Massive: MTEB includes 56 datasets across 8 tasks and currently summarizes >2000 results on the leaderboard.

🌎 Multilingual: MTEB contains up to 112 different languages! We have benchmarked several multilingual models on Bitext Mining, Classification, and STS.

🦚 Extensible: Be it new tasks, datasets, metrics, or leaderboard additions, any contribution is very welcome. Check out the GitHub repository to submit to the leaderboard or solve open issues. We hope you join us on the journey of finding the best text embedding model!

Overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade.

Models

For the initial benchmarking of MTEB, we focused on models claiming state-of-the-art results and popular models on the Hub. This led to a high representation of transformers. 🤖

Models by average English MTEB score (y) vs speed (x) vs embedding size (circle size).

We grouped models into the following three attributes to simplify finding the best model for your task:

🏎 Maximum speed Models like Glove offer high speed, but suffer from a lack of context awareness resulting in low average MTEB scores.

⚖️ Speed and performance Slightly slower, but significantly stronger, all-mpnet-base-v2 or all-MiniLM-L6-v2 provide a good balance between speed and performance.

💪 Maximum performance Multi-billion parameter models like ST5-XXL, GTR-XXL or SGPT-5.8B-msmarco dominate on MTEB. They tend to also produce bigger embeddings like SGPT-5.8B-msmarco which produces 4096 dimensional embeddings requiring more storage!

Model performance varies a lot depending on the task and dataset, so we recommend checking the various tabs of the leaderboard before deciding which model to use!

Benchmark your model

Using the MTEB library, you can benchmark any model that produces embeddings and add its results to the public leaderboard. Let's run through a quick example!

First, install the library:

pip install mteb

Next, benchmark a model on a dataset, for example komninos word embeddings on Banking77.

from mteb import MTEB
from sentence_transformers import SentenceTransformer

model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)

evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")

This should produce a results/average_word_embeddings_komninos/Banking77Classification.json file!

Now you can submit the results to the leaderboard by adding it to the metadata of the README.md of any model on the Hub.

Run our automatic script to generate the metadata:

python mteb_meta.py results/average_word_embeddings_komninos

The script will produce a mteb_metadata.md file that looks like this:

---
tags:
- mteb
model-index:
- name: average_word_embeddings_komninos
  results:
  - task:
      type: Classification
    dataset:
      type: mteb/banking77
      name: MTEB Banking77Classification
      config: default
      split: test
      revision: 0fd18e25b25c072e09e0d92ab615fda904d66300
    metrics:
    - type: accuracy
      value: 66.76623376623377
    - type: f1
      value: 66.59096432882667
---

Now add the metadata to the top of a README.md of any model on the Hub, like this SGPT-5.8B-msmarco model, and it will show up on the leaderboard after refreshing!

Next steps

Go out there and benchmark any model you like! Let us know if you have questions or feedback by opening an issue on our GitHub repo or the leaderboard community tab 🤗

Happy embedding!
"

URL: blog/mteb.md at main · huggingface/blog

Suggested labels

The text was updated successfully, but these errors were encountered:

irthomasthomas · 2024-02-27T19:07:45Z

Related issues

#455: Embeddings - OpenAI API

### Details

Similarity score: 0.86 - [ ] [Embeddings - OpenAI API](https://platform.openai.com/docs/guides/embeddings)

Embeddings

Learn how to turn text into numbers, unlocking use cases like search.

New embedding models

Our newest and most performant embedding models, text-embedding-3-small and text-embedding-3-large, are now available. These models offer lower costs, higher multilingual performance, and new parameters to control the overall size.

Suggested labels

null

#552: LargeWorldModel/LWM-Text-Chat-1M · Hugging Face

### Details

Similarity score: 0.86 - [ ] [LargeWorldModel/LWM-Text-Chat-1M · Hugging Face](https://huggingface.co/LargeWorldModel/LWM-Text-Chat-1M)

LargeWorldModel/LWM-Text-Chat-1M · Hugging Face

DESCRIPTION:

LWM-Text-1M-Chat Model Card

Model details

Model type: LWM-Text-1M-Chat is an open-source model trained from LLaMA-2 on a subset of Books3 filtered data. It is an auto-regressive language model, based on the transformer architecture.

Model date: LWM-Text-1M-Chat was trained in December 2023.

Paper or resources for more information: https://largeworldmodel.github.io/

URL: https://huggingface.co/LargeWorldModel/LWM-Text-Chat-1M

Suggested labels

{'label-name': 'Open-source Models', 'label-description': 'Models that are publicly available and open-source for usage and exploration.', 'gh-repo': 'huggingfaceco/LargeWorldModel/LWM-Text-Chat-1M', 'confidence': 56.11}

#625: unsloth/README.md at main · unslothai/unsloth

### Details

Similarity score: 0.86 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
Gemma 7b	▶️ Start on Colab	2.4x faster	58% less
Mistral 7b	▶️ Start on Colab	2.2x faster	62% less
Llama-2 7b	▶️ Start on Colab	2.2x faster	43% less
TinyLlama	▶️ Start on Colab	3.9x faster	74% less
CodeLlama 34b A100	▶️ Start on Colab	1.9x faster	27% less
Mistral 7b 1xT4	▶️ Start on Kaggle	5x faster*	62% less
DPO - Zephyr	▶️ Start on Colab	1.9x faster	19% less

This conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

🦥 Unsloth.ai News

📣 Gemma 7b on 6T tokens now works. And Gemma 2b notebook
📣 Added conversational notebooks and raw text notebooks
📣 2x faster inference added for all our models
📣 DPO support is now included. More info on DPO
📣 We did a blog with 🤗Hugging Face and are in their official docs! Check out the SFT docs and DPO docs
📣 Download models 4x faster from 🤗Hugging Face. Eg: unsloth/mistral-7b-bnb-4bit

🔗 Links and Resources

Type	Links
📚 Wiki & FAQ	Read Our Wiki
📜 Documentation	Read The Doc
💾 Installation	unsloth/README.md
Twitter (aka X)	Follow us on X
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

⭐ Key Features

All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!

🥇 Performance Benchmarking

For the full list of reproducable benchmarking tables, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

Benchmarking table below was conducted by 🤗Hugging Face.

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

View on GitHub

Suggested labels

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

### Details

Similarity score: 0.85 - [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)

Getting Started

The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!

To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:

from datasets import load_dataset
import json
import numpy as np

# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)

# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)

# To process the entries:
for entry in ds:
    embeddings = np.frombuffer(
        entry['embeddings'], dtype=np.float32
    ).reshape(-1, 768)
    text_chunks = json.loads(entry['text_chunks'])
    metadata = json.loads(entry['metadata'])
    print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
    break

A full set of scripts to recreate the dataset from scratch can be found here. Further, you may check the docs for details on how to perform RAG over AgentSearch.

Languages

English.

Dataset Structure

The raw dataset structure is as follows:

{
    "url": ...,
    "title": ...,
    "metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
    "text_chunks": ...,
    "embeddings": ...,
    "dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}

Dataset Creation

This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.

To cite our work, please use the following:

@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}

Source Data

@online{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}

@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}

@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

Open-Web (Common Crawl Foundation Terms of Use)
Books: the_pile_books3 license and pg19 license
ArXiv Terms of Use
Wikipedia License
StackExchange license on the Internet Archive

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#626: classifiers/README.md at main · blockentropy/classifiers

### Details

Similarity score: 0.85 - [ ] [classifiers/README.md at main · blockentropy/classifiers](https://github.com/blockentropy/classifiers/blob/main/README.md?plain=1)

classifiers/README.md

Fast Classifiers for Prompt Routing

Routing and controlling the information flow is a core component in optimizing machine learning tasks. While some architectures focus on internal routing of data within a model, we focus on the external routing of data between models. This enables the combination of open source, proprietary, API based, and software based approaches to work together behind a smart router. We investigate three different ways of externally routing the prompt - cosine similarity via embeddings, zero-shot classification, and small classifiers.

Implementation of Fast Classifiers

The code-class.ipynb Jupyter notebook walks through the process of creating a fast prompt classifier for smart routing. For the fast classifiers, we utilize the model DistilBERT, a smaller language representation model designed for efficient on-the-edge operation and training under computational constraints. DistilBERT is not only less costly to pre-train but also well-suited for on-device computations, as demonstrated through experiments and comparative studies.

We quantize the model using Optimum, enabling the model to run extremely fast on a CPU router. Each classifier takes 5-8ms to run. An ensemble of 8 prompt classifiers takes about 50ms in total. Thus, each endpoint can route about 20 requests per second.

In the example code-class, we are deciding between prompts of code and not code prompts. The two datasets used are the 52K instruction-following data generated by GPT-4 with prompts in Alpaca. And the 20K instruction-following data used for fine-tuning the Code Alpaca model.

Train test split of 80/20 yields an accuracy of 95.49% and f1 score of 0.9227.

Comparison vs other Routing methods

The most popular alternative to routing is via embedding similarity. For example, if one were to try to route a programming question, one might set up the set of target classes as ["coding", "not coding"]. Each one of these strings is then transformed into an embedding and compared against a prompt query like, "write a bubble sort in python". Given the computed pair-wise cosine similarity between the query and class, we can then label the prompt as a coding question and route the prompt to a coding-specific model. These do not scale well with larger numbers of embeddings. Nor are they able to capture non-semantic type classes (like is the response likely to be more or less than 200 tokens). However, they are adaptable and comparably fast and thus provide a good alternative to the trained fast classifiers.

Quantifying different methods of routing in terms of execution time. As the prompt size increases, the query time also increases as shown in (a). There is also a close to linear increase in the time as the number of classes increase as shown in (b). However, the small classifiers do not increase in time as the class examples increase in the number of tokens (c). This is due to the upfront cost of training the binary classifier, reducing cost at inference.

Reproducibility

The timing_tests.js and complexity.js files can be used for reproducibility. Note that only the code classifier is currently available in this repo. One will need to install the appropriate models from the Transformers.js repo.

View on GitHub

Suggested labels

{'label-name': 'Prompt-Routing', 'label-description': 'Focuses on external routing of data between models to optimize machine learning tasks.', 'confidence': 50.24}

#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

### Details

Similarity score: 0.84 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models:

Knowledge Distillation
Network Pruning
Quantization
Inference Acceleration
Efficient MOE
Text Compression
Low-Rank Decomposition
Hardware/System Tuning
Survey
Leaderboard
🚀 Updates
Contributing

Inference Acceleration

…
Add your paper here, generate the required format, and submit a pull request.

Updates

Sep 27, 2023: Add tag for papers accepted at NeurIPS'23.
Sep 6, 2023: Add a new subdirectory project/ to organize those projects designed for developing a lightweight LLM.
July 11, 2023: Create a new subdirectory efficient_plm/ for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs.

Contributing

If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py and execute python generate_item.py. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.

URL: https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

irthomasthomas added dataset public datasets and embeddings embeddings vector embeddings and related tools github gh tools like cli, Actions, Issues, Pages Models LLM and ML model repos and links Papers Research papers RAG Retrieval Augmented Generation for LLMs labels Feb 27, 2024

irthomasthomas mentioned this issue Mar 6, 2024

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering #704

Open

1 task

irthomasthomas changed the title ~~blog/mteb.md at main · huggingface/blog~~ MTEB: Massive Text Embedding Benchmark Mar 6, 2024

This was referenced Mar 16, 2024

Transformer Models - a brief guide by Cohere #728

Open

sentence-transformers/README.md at master · liuyukid/sentence-transformers #751

Open

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? #758

Open

ShellLM mentioned this issue Jun 19, 2024

FlagEmbedding: RAG LLMs, Long-Context, Embedding, Reranking models (BGE Reranker.) #838

Open

1 task

MTEB: Massive Text Embedding Benchmark #627

MTEB: Massive Text Embedding Benchmark #627

Comments

irthomasthomas commented Feb 27, 2024

Title: blog/mteb.md at main · huggingface/blog

MTEB: Massive Text Embedding Benchmark

Why Text Embeddings?

MTEB

Models

Benchmark your model

Next steps

Suggested labels

irthomasthomas commented Feb 27, 2024

Related issues

#455: Embeddings - OpenAI API

Embeddings

New embedding models

Suggested labels

null

#552: LargeWorldModel/LWM-Text-Chat-1M · Hugging Face

LargeWorldModel/LWM-Text-Chat-1M · Hugging Face

Suggested labels

{'label-name': 'Open-source Models', 'label-description': 'Models that are publicly available and open-source for usage and exploration.', 'gh-repo': 'huggingfaceco/LargeWorldModel/LWM-Text-Chat-1M', 'confidence': 56.11}

#625: unsloth/README.md at main · unslothai/unsloth

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

🦥 Unsloth.ai News

🔗 Links and Resources

⭐ Key Features

🥇 Performance Benchmarking

Suggested labels

#386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face

Getting Started

Languages

Dataset Structure

Dataset Creation

Source Data

License

Suggested labels

{ "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }

#626: classifiers/README.md at main · blockentropy/classifiers

classifiers/README.md

Fast Classifiers for Prompt Routing

Implementation of Fast Classifiers

Comparison vs other Routing methods

Reproducibility

Suggested labels

{'label-name': 'Prompt-Routing', 'label-description': 'Focuses on external routing of data between models to optimize machine learning tasks.', 'confidence': 50.24}

#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

Awesome-Efficient-LLM

Inference Acceleration

Updates

Contributing

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }