[CLI] visualize categorical scores eval results with bars #31

yanxi0830 · 2024-11-14T23:45:18Z

TL;DR

fixes typo w/ together api key for client
visualize eval results for categorical scores
Note: SimpleQA has "app" as type b/c we do not pre-register any SimpleQA specific scoring functions, and uses LLMAsJudge w/ prompt template as scoring function.

Test

SimpleQA

llama-stack-client eval run_benchmark meta-reference-simpleqa --num-examples 5 --output-dir ./ --eval-task-config ~/eval_task_config_simpleqa.json --visualize

{
    "type": "app",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.1-405B-Instruct",
        "sampling_params": {
            "strategy": "greedy",
            "temperature": 0,
            "top_p": 0.95,
            "top_k": 0,
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    },
    "scoring_params": {
        "llm-as-judge::llm_as_judge_base": {
            "type": "llm_as_judge",
            "judge_model": "Llama3.1-405B-Instruct",
            "prompt_template": "Your job is to look at a question, a gold target ........",
            "judge_score_regexes": [
                "(A|B|C)"
            ]
        }
    }
}

MMLU

llama-stack-client eval run_benchmark meta-reference-mmlu --num-examples 5 --output-dir ./ --eval-task-config ~/eval_task_config.json --visualize

{
    "type": "benchmark",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.2-3B-Instruct",
        "sampling_params": {
            "strategy": "greedy",
            "temperature": 0,
            "top_p": 0.95,
            "top_k": 0,
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    }
}

src/llama_stack_client/lib/cli/common/utils.py

yanxi0830 · 2024-11-15T03:55:41Z

I don't think we have a very clear distinction on "app" v.s. "benchmark" evals.

"benchmark" evals are task_config without parameterized scoring function parameters.
"app" evals are task_config with parameterized scoring function parameters.

E.g. Both MMLU & SimpleQA are benchmarks. However:

MMLU is specified is scored with a fixed scoring function --> "benchmark".
SimpleQA is scored with a LLMAsJudge scoring function w/ parameterized judge_model/judge_prompt --> "app".

Option 1:

We explicitly register a dedicated "llm-as-judge::llm_as_judge_simpleqa". This will make our run.yaml w/ pre-registration being bloated with bunch of judge_prompts.

Option 2:

We do not have the distinction b/w benchmark/app. We need _base scoring functions as scoring functions that can be parameterized. Users are able to run benchmarks with different judge prompts.

cc @raghotham @ashwinb for thoughts?

ashwinb · 2024-11-15T05:47:54Z

For option (1), would you feel OK to have these be registered from yamls from an alternate place? Currently the registry is populated from run.yaml but we could decide that it is also populated from special .yaml files that a distribution can bake in? (e.g., the entry in run.yaml could be a directory path and then we'd go and read all .yaml files from that directory and register them all)

yanxi0830 · 2024-11-15T16:04:27Z

For option (1), would you feel OK to have these be registered from yamls from an alternate place? Currently the registry is populated from run.yaml but we could decide that it is also populated from special .yaml files that a distribution can bake in? (e.g., the entry in run.yaml could be a directory path and then we'd go and read all .yaml files from that directory and register them all)

@ashwinb I have thought about these being populated from special .yaml's from the server side. These will be very similar to having explicit prompt templates in regex_parser_multiple_choice_answer.

I think the question is where we put the judge prompts for LLM as judge for benchmark evals:
(1) Pre-registered in the distribution via code ScoringFn (e.g. regex_parser_multiple_choice_answer)
(2) Pre-registered in the distribution via special .yaml / run.yaml
(3) On-the-fly from client, following the "app" eval flow.

I think it depends on the types of user usually use & interact w/ these scoring functions on which feels more natural.

ashwinb · 2024-11-15T21:56:14Z

@yanxi0830 do you have a recommendation there for (1) vs (2) -- what would be easiest for now?

yanxi0830 · 2024-11-15T22:01:17Z

@yanxi0830 do you have a recommendation there for (1) vs (2) -- what would be easiest for now?

@ashwinb (3) is what I'm currently going with for SimpleQA in this PR. (1) would be easiest and less pollution to the run.yaml file: we keep benchmarking scoring functions fixed in code.

ashwinb · 2024-11-15T22:16:42Z

I agree. For benchmarks, let's do (1)

pretty outputs

febab10

facebook-github-bot added the cla signed label Nov 14, 2024

yanxi0830 marked this pull request as ready for review November 14, 2024 23:52

ashwinb reviewed Nov 14, 2024

View reviewed changes

src/llama_stack_client/lib/cli/common/utils.py Show resolved Hide resolved

yanxi0830 added 2 commits November 14, 2024 22:26

requirements, fix

82db984

format

b6d8d10

yanxi0830 requested review from dineshyv, dltn and raghotham November 15, 2024 03:40

yanxi0830 merged commit ecf6a48 into main Nov 15, 2024
3 checks passed

yanxi0830 deleted the pretty_table branch November 15, 2024 20:49

yanxi0830 mentioned this pull request Nov 18, 2024

[Agentic Eval] add ability to run agents generation llamastack/llama-stack#469

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CLI] visualize categorical scores eval results with bars #31

[CLI] visualize categorical scores eval results with bars #31

Uh oh!

yanxi0830 commented Nov 14, 2024 •

edited

Loading

Uh oh!

Uh oh!

yanxi0830 commented Nov 15, 2024

Uh oh!

ashwinb commented Nov 15, 2024

Uh oh!

yanxi0830 commented Nov 15, 2024 •

edited

Loading

Uh oh!

Uh oh!

ashwinb commented Nov 15, 2024

Uh oh!

yanxi0830 commented Nov 15, 2024 •

edited

Loading

Uh oh!

ashwinb commented Nov 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CLI] visualize categorical scores eval results with bars #31

[CLI] visualize categorical scores eval results with bars #31

Uh oh!

Conversation

yanxi0830 commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Test

SimpleQA

MMLU

Uh oh!

Uh oh!

yanxi0830 commented Nov 15, 2024

Uh oh!

ashwinb commented Nov 15, 2024

Uh oh!

yanxi0830 commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ashwinb commented Nov 15, 2024

Uh oh!

yanxi0830 commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashwinb commented Nov 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yanxi0830 commented Nov 14, 2024 •

edited

Loading

yanxi0830 commented Nov 15, 2024 •

edited

Loading

yanxi0830 commented Nov 15, 2024 •

edited

Loading