-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create walkthrough example with data #102
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## trunk #102 +/- ##
=======================================
Coverage 96.92% 96.92%
=======================================
Files 34 34
Lines 1465 1465
=======================================
Hits 1420 1420
Misses 45 45 ☔ View full report in Codecov by Sentry. |
examples/assets/autojudge.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor, but considering that these are checked into the source tree, smaller is better — would you mind transcoding to JPEG?
examples/assets/elo_scores.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the number of votes versus the number of responses, it looks like there are many unmatched prompts between these models. Can we pare down the CSVs to only include matched prompts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can pare down the CSVs, but unfortunately there are very few prompts shared amongst all models.
This is unexpected. Opponents shouldn't be allowed to "farm" Elo off of each other (gpt-4-1106-preview is untouchable). |
This PR provides a quick guide with all needed data to get a useful leaderboard on AutoArena. The notebook explains the core principles of why AutoArena uses an Elo leaderboard, and shows how it works.