refine the benchmark eval UX #156

SLR722 · 2025-02-21T00:09:24Z

What does this PR do?

Refine the benchmark eval CLI to have a better user experience to run benchmark eval on some standard benchmarks.

The benchmarks need to be defined as resource in the distro template

improvements include:

user don't need to pass in arbitrary eval-task-config, they only need to pass in the list of benchmarks they'd like to eval, the model id to be evaluated on and the output dir to store the eval results
output aggregate results to the output file. aggregate results are typically what user care most

Test Plan

spin up a llama stack server with eval benchmarks defined
run llama-stack-client --endpoint xxxx eval run-benchmark "meta-reference-simpleqa" --model_id "meta-llama/Llama-3.1-8B-Instruct" --output_dir "/home/markchen1015/" --num_examples 5

return

what are inside the output file

yanxi0830

Thank you!

src/llama_stack_client/lib/cli/eval/utils.py

src/llama_stack_client/lib/cli/eval/run_benchmark.py

SLR722 added 3 commits February 19, 2025 23:17

temp commit

7b9bbad

init commit

a639585

refine

b74e46a

facebook-github-bot added the cla signed label Feb 21, 2025

SLR722 changed the title ~~[WIP] refine the benchmark eva; UX~~ [WIP] refine the benchmark eval UX Feb 21, 2025

pre-commit

31296fb

SLR722 marked this pull request as ready for review February 21, 2025 00:23

SLR722 requested review from ashwinb, dineshyv, dltn, ehhuang, hardikjshah, raghotham, vladimirivic and yanxi0830 as code owners February 21, 2025 00:23

SLR722 changed the title ~~[WIP] refine the benchmark eval UX~~ refine the benchmark eval UX Feb 21, 2025

yanxi0830 approved these changes Feb 21, 2025

View reviewed changes

yanxi0830 reviewed Feb 21, 2025

View reviewed changes

src/llama_stack_client/lib/cli/eval/utils.py Show resolved Hide resolved

raghotham reviewed Feb 21, 2025

View reviewed changes

src/llama_stack_client/lib/cli/eval/run_benchmark.py Outdated Show resolved Hide resolved

SLR722 added 2 commits February 20, 2025 17:12

address comment

9324dfe

address comment

d708ef5

SLR722 merged commit c645726 into main Feb 21, 2025
2 checks passed

SLR722 deleted the open_benchmark branch February 21, 2025 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refine the benchmark eval UX #156

refine the benchmark eval UX #156

Uh oh!

SLR722 commented Feb 21, 2025 •

edited

Loading

Uh oh!

yanxi0830 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refine the benchmark eval UX #156

refine the benchmark eval UX #156

Uh oh!

Conversation

SLR722 commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

yanxi0830 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SLR722 commented Feb 21, 2025 •

edited

Loading