Zero-shot default #1953

Muennighoff · 2025-02-04T15:16:54Z

I worry that the Zero-shot default of Allow Unknown in the leaderboard will incentivize people not to share/talk about their training data if it includes any benchmark data. I think maybe just doing Allow all but making the cross a bit clearer? Can we somehow make it red like ❌

The text was updated successfully, but these errors were encountered:

x-tabdeveloping · 2025-02-04T15:22:31Z

As it is now, we use a UTF-8 emoji, and there is no cross that renders red in Gradio :(

I have also tried a markdown field with emoji names between colons (:cross:) but it doesn't work in the table.

I agree with the incentivization part, but I would prefer not giving up on the default filter. I have heard some people have even stronger opinions about it. It's an easy fix though if we collectively decide that something else is a better move.

KennethEnevoldsen · 2025-02-04T17:52:57Z

We could do❓: Not known and ⚠️: Not zero.-shot

x-tabdeveloping · 2025-02-05T07:55:33Z

I'd prefer the cross, but we can change it if that's the vibe

Muennighoff · 2025-02-06T15:26:53Z

Noting that the Open LLM leaderboard also allows contaminated models by default (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/); Maybe we do a community vote on this?

Muennighoff · 2025-02-06T15:27:56Z

The leaderboard also looks much more alive with Allow all as it is way more models 🤔

KennethEnevoldsen · 2025-02-07T09:44:32Z

Maybe we do a community vote on this?

I believe we already had this discussion on Slack

Noting that the Open LLM leaderboard also allows contaminated models by default

Can't seem to find any contaminated models on their leaderboard, only:

We have also had specific complaints from model developers about known zero-shot models being on the leaderboard (reducing trust in the scores).

E.g. the NV-Embed-v2 scores aren't comparable to e5. The voyage-3-exp is a quite clear result of this (not recommended for use but is at the top). I would say the ranking should (to the extent possible) reflect something that we would recommend.

I think Allow Unknown makes a reasonable compromise between encouraging good practices, fair comparison and building trust, while not punishing APIs.

However, from this, it might be too harsh with the ⚠️ and instead using the ❓would be more appropriate.

Let me know what you think @Muennighoff

KennethEnevoldsen · 2025-02-20T21:09:03Z

I have converted another thread on this to a discussion: #2119 - so to keep it in one place I will close this one, but seems like this is def. worth discussing

Muennighoff mentioned this issue Feb 4, 2025

Better support for icons/emojis (displaying red cross in gradio doesn't work) gradio-app/gradio#10499

Open

1 task

KennethEnevoldsen added enhancement New feature or request leaderboard issues related to the leaderboard labels Feb 4, 2025

Muennighoff mentioned this issue Feb 20, 2025

[v2] Integrate cde models #2076

Draft

4 tasks

KennethEnevoldsen closed this as completed Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-shot default #1953

Zero-shot default #1953

Muennighoff commented Feb 4, 2025

x-tabdeveloping commented Feb 4, 2025

KennethEnevoldsen commented Feb 4, 2025 •

edited

Loading

x-tabdeveloping commented Feb 5, 2025

Muennighoff commented Feb 6, 2025

Muennighoff commented Feb 6, 2025

KennethEnevoldsen commented Feb 7, 2025

KennethEnevoldsen commented Feb 20, 2025

Zero-shot default #1953

Zero-shot default #1953

Comments

Muennighoff commented Feb 4, 2025

x-tabdeveloping commented Feb 4, 2025

KennethEnevoldsen commented Feb 4, 2025 • edited Loading

x-tabdeveloping commented Feb 5, 2025

Muennighoff commented Feb 6, 2025

Muennighoff commented Feb 6, 2025

KennethEnevoldsen commented Feb 7, 2025

KennethEnevoldsen commented Feb 20, 2025

KennethEnevoldsen commented Feb 4, 2025 •

edited

Loading