-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defining zero-shot for MTEB #1760
Comments
How about a "thought experiment":
Assume the inference cost, storage costs of the embeddings, and complexity of the retrieval implementations are the same for the sake of the thought experiment. Which model do you want to use? My primary point is that - if the goal is for MTEB to be more useful for researchers trying to determine what training techniques are more sample efficient and produce a model that generalizes better, then this zero shot filter would be helpful. But for businesses trying to decide which model to implement in production, the waters are permanently muddied by competition in model training pivoting to centering around scaling up self-supervised and supervised learning to the largest, cleanest datasets possible. I would note however, that Voyage just benchmarked a model on MTEB that they don't recommend customers use. They're saying Voyage 3 Large is what you should use because you'll get better performance with it. But we took a checkpoint from that process and trained it on the MTEB training sets and benchmarked it here so you can see that if we were interested in playing this maximizing scores on MTEB game, we could win at this game. I read that move as - ultimately if you want the best model for the tasks we're seeing customer's use our models for (like code search), MTEB score isn't the best way to pick a model. The issue isn't just that MTEB has been overfit to, but also that the benchmark no longer accurately reflects performance on the wide variety of customer use cases. If the goal is a better benchmark for both researchers and businesses, I think we need the benchmark to continue to grow to cover more real business use cases, as well as to include this kind of zero shot filter for identifying good training techniques when fine tuning etc. but not necessarily the model you should use off the shelf... |
I would say that businesses as well as research are both interested in finding out how well a models performs in an unbiased way. However people will still be able to remove the zero-shot filter (and in fact we will probably keep cases where the dataset is unknown there by default, albeit with a warning). There will be both a looser "allow everything" and a stricter "zero shot with known data" (probably more research focused) option. For instance as it is now you might have a model that performs well on MTEB (NV-embed-v2 to take an example), however since it is trained on the training subsections of MTEB's test sets it overperforms compared to its actual performance. A business might then just choose to use it instead of an alternative that would generalize better to the business case. However, we have not wish to remove models from the leaderboard to remedy this we have e.g. created an updated benchmark (
Our definition as it stands allows you to pre-train on e.g. wikipedia data without this being considered leakage to MTEB. Even for open sourced models I believe this is quite hard to control for (and would remove everything from the leaderboard, no-one is interested in that).
The hope is that you don't need to play a maximizing game (I could imagine that you have better ways to spend your time)
Completely agree. I am actually hoping that some companies would be interested in voluntereering data for a closed (similar to how we have seen it with ARC-AGI) MTEB evaluation benchmark for real world use-cases. |
The next version of the MTEB leaderboard will soon be released and with it a new zero-shot filter. However, we are currently planning to use the following definition of zero-shot. This issue is to open that discussion up to the community to ensure that we consider all relevant views.
The text was updated successfully, but these errors were encountered: