-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calibration standards for French prompt sets #818
base: main
Are you sure you want to change the base?
Conversation
rogthefrog
commented
Jan 25, 2025
•
edited
Loading
edited
- Generated calibration standards using the new official fr_fr prompt sets
- Added calibration standards for the demo prompt set
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
src/modelbench/standards.json
Outdated
"safe_hazard-1.0-sxc-fr_fr-practice": 0.52, | ||
"safe_hazard-1.0-vcr-fr_fr-practice": 0.68 | ||
} | ||
"safe_hazard-1.0-cse-fr_fr-practice_fr_fr": 0.72, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a little odd to me that the practice calibration all have only two significant digits. And, come to think of it, that the demo have three. Shouldn't it be the other way around?
Also, for consistency with other uids, should we really have the duplicate fr_fr in these UIDs? I would expect them to be more like safe_hazard-1.0-dfm-fr_fr-practice
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support multiple locales per prompt set. So a (locale, prompt-set-type) pair isn't sufficient to uniquely identify a prompt set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. the significant digits, I did have the labels backwards.
I'll rerun the calibration with the demo set and update once they're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example of having multiple locales in a prompt set? I was thinking they'd be separate.
And thanks for the update. FYI, there's no need to calibrate specifically for the demo set; the conclusion from Kurt in today's meeting was that the demo set was too small for statistical comfort, so we should just use the calibration for the set they're drawn from. @dhosterman was going to add code for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example of having multiple locales in a prompt set? I was thinking they'd be separate.
Currently, the prompt set files contain one locale, but the code and associated design comments indicate there may be more than one locale per prompt set file.
- There many be multiple personas and locales in one file. |
If we want to get rid of that support, I'm all for it, because it would simplify things a lot.