Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining zero-shot for MTEB #1760

Open
KennethEnevoldsen opened this issue Jan 11, 2025 · 2 comments
Open

Defining zero-shot for MTEB #1760

KennethEnevoldsen opened this issue Jan 11, 2025 · 2 comments
Labels
leaderboard issues related to the leaderboard

Comments

@KennethEnevoldsen
Copy link
Contributor

The next version of the MTEB leaderboard will soon be released and with it a new zero-shot filter. However, we are currently planning to use the following definition of zero-shot. This issue is to open that discussion up to the community to ensure that we consider all relevant views.

Zero Shot
A model is considered zero-shot if it is not trained on other splits of the dataset used to derive the task.
E.g., if a model is trained on Natural Questions, it cannot be considered zero-shot on benchmarks containing the task “NQ” which is derived from Natural Questions.
This definition creates a few edge cases. For instance, multiple models are typically trained on Wikipedia title and body pairs, but we do not define this as leakage on, e.g., “WikipediaRetrievalMultilingual” and “WikiClusteringP2P” as these datasets are not based on title-body pairs.
Distilled, further fine-tunes or in other ways, derivative models inherit the datasets of their parent models.
Based on community feedback and research findings, This definition could change in the future.

@williambarberjr
Copy link

williambarberjr commented Jan 11, 2025

How about a "thought experiment":
I'm an enterprise picking which embedding model to use. I have two options:

  1. A retrieval model trained by billion dollars of funding AI lab number 3. It's reportedly been trained on every piece of human text ever written, likely including all the text in the MTEB training set, but who can say for sure? It is on the top of the MTEB leaderboard and reports from other enterprises using this model are that it's working quite well for their use case.
  2. A late interaction model exclusively trained on MS Marco that generalizes much more effectively to MTEB data not seen during training than other models that I can also isolate with a zero shot filter. This model represents a major achievement in cutting edge techniques for sample efficiency, better architecture etc, as demonstrated in it's performance against models only trained on MS Marco that are then tested on MTEB.

Assume the inference cost, storage costs of the embeddings, and complexity of the retrieval implementations are the same for the sake of the thought experiment. Which model do you want to use?

My primary point is that - if the goal is for MTEB to be more useful for researchers trying to determine what training techniques are more sample efficient and produce a model that generalizes better, then this zero shot filter would be helpful. But for businesses trying to decide which model to implement in production, the waters are permanently muddied by competition in model training pivoting to centering around scaling up self-supervised and supervised learning to the largest, cleanest datasets possible.

I would note however, that Voyage just benchmarked a model on MTEB that they don't recommend customers use. They're saying Voyage 3 Large is what you should use because you'll get better performance with it. But we took a checkpoint from that process and trained it on the MTEB training sets and benchmarked it here so you can see that if we were interested in playing this maximizing scores on MTEB game, we could win at this game.

I read that move as - ultimately if you want the best model for the tasks we're seeing customer's use our models for (like code search), MTEB score isn't the best way to pick a model. The issue isn't just that MTEB has been overfit to, but also that the benchmark no longer accurately reflects performance on the wide variety of customer use cases.

If the goal is a better benchmark for both researchers and businesses, I think we need the benchmark to continue to grow to cover more real business use cases, as well as to include this kind of zero shot filter for identifying good training techniques when fine tuning etc. but not necessarily the model you should use off the shelf...

@KennethEnevoldsen KennethEnevoldsen added the leaderboard issues related to the leaderboard label Jan 11, 2025
@KennethEnevoldsen
Copy link
Contributor Author

I would say that businesses as well as research are both interested in finding out how well a models performs in an unbiased way. However people will still be able to remove the zero-shot filter (and in fact we will probably keep cases where the dataset is unknown there by default, albeit with a warning).

There will be both a looser "allow everything" and a stricter "zero shot with known data" (probably more research focused) option.

For instance as it is now you might have a model that performs well on MTEB (NV-embed-v2 to take an example), however since it is trained on the training subsections of MTEB's test sets it overperforms compared to its actual performance. A business might then just choose to use it instead of an alternative that would generalize better to the business case.

However, we have not wish to remove models from the leaderboard to remedy this we have e.g. created an updated benchmark (mteb(eng, beta)) where MS MARCO and NQ is not included as it is common to train on these.

model training pivoting to centering around scaling up self-supervised and supervised learning to the largest, cleanest datasets possible.

Our definition as it stands allows you to pre-train on e.g. wikipedia data without this being considered leakage to MTEB. Even for open sourced models I believe this is quite hard to control for (and would remove everything from the leaderboard, no-one is interested in that).

we were interested in playing this maximizing scores on MTEB game, we could win at this game

The hope is that you don't need to play a maximizing game (I could imagine that you have better ways to spend your time)

I think we need the benchmark to continue to grow to cover more real business use cases

Completely agree. I am actually hoping that some companies would be interested in voluntereering data for a closed (similar to how we have seen it with ARC-AGI) MTEB evaluation benchmark for real world use-cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leaderboard issues related to the leaderboard
Projects
None yet
Development

No branches or pull requests

2 participants