Skip to content

Conversation

SLR722
Copy link
Contributor

@SLR722 SLR722 commented Feb 21, 2025

What does this PR do?

Refine the benchmark eval CLI to have a better user experience to run benchmark eval on some standard benchmarks.

The benchmarks need to be defined as resource in the distro template

improvements include:

  • user don't need to pass in arbitrary eval-task-config, they only need to pass in the list of benchmarks they'd like to eval, the model id to be evaluated on and the output dir to store the eval results
  • output aggregate results to the output file. aggregate results are typically what user care most

Test Plan

spin up a llama stack server with eval benchmarks defined
run llama-stack-client --endpoint xxxx eval run-benchmark "meta-reference-simpleqa" --model_id "meta-llama/Llama-3.1-8B-Instruct" --output_dir "/home/markchen1015/" --num_examples 5

return
Screenshot 2025-02-20 at 4 29 35 PM

what are inside the output file

Screenshot 2025-02-20 at 4 30 08 PM Screenshot 2025-02-20 at 4 17 05 PM

@SLR722 SLR722 changed the title [WIP] refine the benchmark eva; UX [WIP] refine the benchmark eval UX Feb 21, 2025
@SLR722 SLR722 marked this pull request as ready for review February 21, 2025 00:23
@SLR722 SLR722 changed the title [WIP] refine the benchmark eval UX refine the benchmark eval UX Feb 21, 2025
Copy link
Contributor

@yanxi0830 yanxi0830 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@SLR722 SLR722 merged commit c645726 into main Feb 21, 2025
2 checks passed
@SLR722 SLR722 deleted the open_benchmark branch February 21, 2025 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants