Skip to content

Conversation

@yanxi0830
Copy link
Contributor

@yanxi0830 yanxi0830 commented Nov 14, 2024

TL;DR

  • fixes typo w/ together api key for client
  • visualize eval results for categorical scores
  • Note: SimpleQA has "app" as type b/c we do not pre-register any SimpleQA specific scoring functions, and uses LLMAsJudge w/ prompt template as scoring function.
image

Test

SimpleQA

llama-stack-client eval run_benchmark meta-reference-simpleqa --num-examples 5 --output-dir ./ --eval-task-config ~/eval_task_config_simpleqa.json --visualize
{
    "type": "app",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.1-405B-Instruct",
        "sampling_params": {
            "strategy": "greedy",
            "temperature": 0,
            "top_p": 0.95,
            "top_k": 0,
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    },
    "scoring_params": {
        "llm-as-judge::llm_as_judge_base": {
            "type": "llm_as_judge",
            "judge_model": "Llama3.1-405B-Instruct",
            "prompt_template": "Your job is to look at a question, a gold target ........",
            "judge_score_regexes": [
                "(A|B|C)"
            ]
        }
    }
}

MMLU

llama-stack-client eval run_benchmark meta-reference-mmlu --num-examples 5 --output-dir ./ --eval-task-config ~/eval_task_config.json --visualize
{
    "type": "benchmark",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.2-3B-Instruct",
        "sampling_params": {
            "strategy": "greedy",
            "temperature": 0,
            "top_p": 0.95,
            "top_k": 0,
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    }
}

@yanxi0830 yanxi0830 marked this pull request as ready for review November 14, 2024 23:52
@yanxi0830
Copy link
Contributor Author

I don't think we have a very clear distinction on "app" v.s. "benchmark" evals.

  • "benchmark" evals are task_config without parameterized scoring function parameters.
  • "app" evals are task_config with parameterized scoring function parameters.

E.g. Both MMLU & SimpleQA are benchmarks. However:

  • MMLU is specified is scored with a fixed scoring function --> "benchmark".
  • SimpleQA is scored with a LLMAsJudge scoring function w/ parameterized judge_model/judge_prompt --> "app".

Option 1:

  • We explicitly register a dedicated "llm-as-judge::llm_as_judge_simpleqa". This will make our run.yaml w/ pre-registration being bloated with bunch of judge_prompts.

Option 2:

  • We do not have the distinction b/w benchmark/app. We need _base scoring functions as scoring functions that can be parameterized. Users are able to run benchmarks with different judge prompts.

cc @raghotham @ashwinb for thoughts?

@ashwinb
Copy link
Contributor

ashwinb commented Nov 15, 2024

For option (1), would you feel OK to have these be registered from yamls from an alternate place? Currently the registry is populated from run.yaml but we could decide that it is also populated from special .yaml files that a distribution can bake in? (e.g., the entry in run.yaml could be a directory path and then we'd go and read all .yaml files from that directory and register them all)

@yanxi0830
Copy link
Contributor Author

yanxi0830 commented Nov 15, 2024

For option (1), would you feel OK to have these be registered from yamls from an alternate place? Currently the registry is populated from run.yaml but we could decide that it is also populated from special .yaml files that a distribution can bake in? (e.g., the entry in run.yaml could be a directory path and then we'd go and read all .yaml files from that directory and register them all)

@ashwinb I have thought about these being populated from special .yaml's from the server side. These will be very similar to having explicit prompt templates in regex_parser_multiple_choice_answer.

I think the question is where we put the judge prompts for LLM as judge for benchmark evals:
(1) Pre-registered in the distribution via code ScoringFn (e.g. regex_parser_multiple_choice_answer)
(2) Pre-registered in the distribution via special .yaml / run.yaml
(3) On-the-fly from client, following the "app" eval flow.

I think it depends on the types of user usually use & interact w/ these scoring functions on which feels more natural.

@yanxi0830 yanxi0830 merged commit ecf6a48 into main Nov 15, 2024
3 checks passed
@yanxi0830 yanxi0830 deleted the pretty_table branch November 15, 2024 20:49
@ashwinb
Copy link
Contributor

ashwinb commented Nov 15, 2024

@yanxi0830 do you have a recommendation there for (1) vs (2) -- what would be easiest for now?

@yanxi0830
Copy link
Contributor Author

yanxi0830 commented Nov 15, 2024

@yanxi0830 do you have a recommendation there for (1) vs (2) -- what would be easiest for now?

@ashwinb (3) is what I'm currently going with for SimpleQA in this PR. (1) would be easiest and less pollution to the run.yaml file: we keep benchmarking scoring functions fixed in code.

@ashwinb
Copy link
Contributor

ashwinb commented Nov 15, 2024

I agree. For benchmarks, let's do (1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants