Replies: 10 comments 28 replies
-
Wow that was awesome and so fast! Thanks team! |
Beta Was this translation helpful? Give feedback.
-
Just moved this to a discussion. Another discussion on this was in #1953 We have had this up before, but I don't think it is settled so it is fine that we raise it again. @Muennighoff has also raised this concern and as far as I know, @x-tabdeveloping is on the side that we should be more restrictive (not allow unknown by default). @orion I believe is quite close to where we are now or potentially a bit closer to @x-tabdeveloping (do correct me if I am wrong). Just to outline the pros and cons for each point and argue for why the default is as it currently is:
current solution
|
Beta Was this translation helpful? Give feedback.
-
Well, to be fair, I can see why model makers, like @jxmorris12 would think that the default option is not good, and we've talked about this before, but essentially the current way it's set up, our system penalizes honesty, which really shouldn't be the case. In fact, not knowing what the training data is is the worst scenario for the user, since they have no idea where the good/bad scores come from. I'm still of the opinion that the hard zero-shot filter should be the default. That way we:
Especially since now we've made a bunch more new benchmarks, we should really avoid a bunch of people submitting models that will show up on the leaderboard but have a I also don't think allowing all is a good option, because then, to be on top of the leaderboard, model makers will have to keep training on the benchmark datasets, and the zero-shot thing will be some small thing some people might care about, but it won't incentivise the behaviour which we ultimately want (model makers training on other things so that our benchmark can measure how well their model generalizes) |
Beta Was this translation helpful? Give feedback.
-
Moved this comment from #2076 to here I agree with @bwanglzu here. The default of "Allow Unknown" has some pretty negative effects:
In my opinion, the emoji immediately draws attention, so if we update the default to "Allow all", then people will still be informed to be wary of certain models. See for example here, the Zero-shot column draws the eye pretty quickly: Beyond that, the "Zero-shot" option is among all kinds of model options that don't need updating much, so it's very easy to miss. I've had to explain to at least 4 separate people why they can't find their favourite model on the leaderboard anymore. In short: I think that users will be more than sufficiently informed about the zero-shot-ness of a model even if we use "Allow all" as the default, and they will be more informed than if we keep using "Allow unknown". Having said that, I do recognize the purpose of the "Allow Unknown" default - discouraging overfitting on MTEB.
|
Beta Was this translation helpful? Give feedback.
-
I think we could also add per-task identification for models to indicate whether they are ZeroShot or not. |
Beta Was this translation helpful? Give feedback.
-
Maybe we should change |
Beta Was this translation helpful? Give feedback.
-
The default needs to be changed as soon as possible because it is misleading leaderboard users and creates a perverse incentive against open science. I have spoken about the "zero shot" topic before in the PR that introduced zero shot evaluation and I will include a copy of those comments below for visibility. I have some additional thoughts to share which pertain to not just introducing zero-shot but also to the current mess of the three-label situation. The Arctic Embed models are perhaps the most laughable illustration of the problems with the new three-class annotation and default filter choiceBecause we published reports on Arctic Embed 1.0 and 2.0, these models have been marked as confirmed to be partially in-domain on the original MTEB retrieval benchmark and now hidden by default on the new benchmark. Since we only issued a higher-level blogpost with our 1.5 model (a post which for brevity did not explicitly clarify that the used largely the same data as we used for our 1.0 model), this model has been labeled as "unknown data" and shows up in the default view. To a reader of the benchmark, it appears as though there is a significant difference between the data used in 1.0 and 1.5, despite this not being the case. This has a direct and negative impact on the utility of the benchmark page to readers who are trying to inform themselves about the embedding model space. Differentiating between "unknown" and "certified in-domain" appears generally harmful to both model developers and leaderboard readers no matter what the default choice isThe new MTEB design incorrectly implies there is an important difference between the "unknown data" and "known in-domain data" models that have submitted evaluations to MTEB. In truth, the difference between these models is primarily just a measure of data availability and whether a work provides rigorous and transparent documentation to "earn" the right of being marked with an ❌ . For example, in addition to the Arctic Embed 1.5 case, it currently appears that the Stella models also are tagged as "unknown data" despite the Stella & Jasper report discussing distillation of in-domain models as the training methodology as the backbone for both Stella and Jasper. However, only the Jasper models are marked as "known in domain" because the paper never officially clarifies that an in-domain model was used as the teacher for Stella. As long as the differentiation persists, MTEB is creating a perverse incentive against open science. The only way this could be flipped around is if some punitive treatment given to "unknown data" submissions that is not given to "known in domain" submissions, but doing so seems like it would again confuse readers of the leaderboard more than it would help them. Comments from Jan 7:
|
Beta Was this translation helpful? Give feedback.
-
I don't think I'm the person who should make the last call on this, but to me it seems that the general sentiment indicates that keeping the current default is the worst thing we could do. As far as I can tell, everyone in the above discussion values honesty more than being zero-shot, and I'm inclined to agree. I think there are a number of steps we can take to get to a reasonable compromise:
Lmk what you think. @tomaarsen @KennethEnevoldsen @orionw @Muennighoff |
Beta Was this translation helpful? Give feedback.
-
So after some discussion with @x-tabdeveloping we came to this as the conclusion (While the coloring and formatting are not a 100% set)
This would come with a default show all and have the following options: We think that this will:
The only issue we currently see is that some models will appear at the top which we would not recommend (e.g. zero-shot < 0.80). We could consider filtering these out. The worst offenders are on the MTEB(eng, v1): (note that the voyage exp and nvidia-embed were missing some annotation which I have fixed after this screenshot was taken) And there does seem to be a performance benefit from overfitting:
|
Beta Was this translation helpful? Give feedback.
-
The new system is now live on the leaderboard. |
Beta Was this translation helpful? Give feedback.
-
The default value for the ZeroShot filter seems to be confusing people. Maybe we should change it to
Allow all
? #2114Originally posted by @jxmorris12 in #2076 (comment)
Also issue #1953
Beta Was this translation helpful? Give feedback.
All reactions