Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calibration standards for French prompt sets #818

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rogthefrog
Copy link
Contributor

@rogthefrog rogthefrog commented Jan 25, 2025

  • Generated calibration standards using the new official fr_fr prompt sets
  • Added calibration standards for the demo prompt set

@rogthefrog rogthefrog requested a review from a team as a code owner January 25, 2025 09:18
@rogthefrog rogthefrog temporarily deployed to Scheduled Testing January 25, 2025 09:18 — with GitHub Actions Inactive
@rogthefrog rogthefrog temporarily deployed to Scheduled Testing January 25, 2025 09:18 — with GitHub Actions Inactive
@rogthefrog rogthefrog temporarily deployed to Scheduled Testing January 25, 2025 09:18 — with GitHub Actions Inactive
Copy link

github-actions bot commented Jan 25, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

"safe_hazard-1.0-sxc-fr_fr-practice": 0.52,
"safe_hazard-1.0-vcr-fr_fr-practice": 0.68
}
"safe_hazard-1.0-cse-fr_fr-practice_fr_fr": 0.72,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little odd to me that the practice calibration all have only two significant digits. And, come to think of it, that the demo have three. Shouldn't it be the other way around?

Also, for consistency with other uids, should we really have the duplicate fr_fr in these UIDs? I would expect them to be more like safe_hazard-1.0-dfm-fr_fr-practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support multiple locales per prompt set. So a (locale, prompt-set-type) pair isn't sufficient to uniquely identify a prompt set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. the significant digits, I did have the labels backwards.

I'll rerun the calibration with the demo set and update once they're done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example of having multiple locales in a prompt set? I was thinking they'd be separate.

And thanks for the update. FYI, there's no need to calibrate specifically for the demo set; the conclusion from Kurt in today's meeting was that the demo set was too small for statistical comfort, so we should just use the calibration for the set they're drawn from. @dhosterman was going to add code for that.

Copy link
Contributor Author

@rogthefrog rogthefrog Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example of having multiple locales in a prompt set? I was thinking they'd be separate.

Currently, the prompt set files contain one locale, but the code and associated design comments indicate there may be more than one locale per prompt set file.

- There many be multiple personas and locales in one file.

If we want to get rid of that support, I'm all for it, because it would simplify things a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants