-
Notifications
You must be signed in to change notification settings - Fork 95
[CLI] visualize categorical scores eval results with bars #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I don't think we have a very clear distinction on "app" v.s. "benchmark" evals.
E.g. Both MMLU & SimpleQA are benchmarks. However:
Option 1:
Option 2:
cc @raghotham @ashwinb for thoughts? |
|
For option (1), would you feel OK to have these be registered from yamls from an alternate place? Currently the registry is populated from run.yaml but we could decide that it is also populated from special .yaml files that a distribution can bake in? (e.g., the entry in run.yaml could be a directory path and then we'd go and read all .yaml files from that directory and register them all) |
@ashwinb I have thought about these being populated from special .yaml's from the server side. These will be very similar to having explicit prompt templates in regex_parser_multiple_choice_answer. I think the question is where we put the judge prompts for LLM as judge for benchmark evals: I think it depends on the types of user usually use & interact w/ these scoring functions on which feels more natural. |
|
@yanxi0830 do you have a recommendation there for (1) vs (2) -- what would be easiest for now? |
@ashwinb (3) is what I'm currently going with for SimpleQA in this PR. (1) would be easiest and less pollution to the run.yaml file: we keep benchmarking scoring functions fixed in code. |
|
I agree. For benchmarks, let's do (1) |
TL;DR
Test
SimpleQA
MMLU