[CLI] eval run_benchmark CLI #30

yanxi0830 · 2024-11-14T17:16:12Z

TL;DR

CLI for running eval benchmarks

llama-stack-client eval run_benchmark --eval-task-id meta-reference-mmlu --num-examples 10 --output-dir ./ --eval-task-config ~/eval_task_config.json

Test

Pre-register eval_tasks/datasets/scoring_functions via run.yaml
Start server w/ llama stack run
llama-stack-client eval run_benchmark --eval-task-id meta-reference-mmlu --num-examples 10 --output-dir ./ --eval-task-config ~/eval_task_config.json

version: '2'
built_at: '2024-11-11T21:59:52.074753'
image_name: fireworks
docker_image: null
conda_env: fireworks
apis:
- inference
- telemetry
- datasetio
- eval
- scoring
providers:
  scoring:
  - provider_id: basic-0
    provider_type: inline::basic
    config: {}
  - provider_id: llm-as-judge-0
    provider_type: inline::llm-as-judge
    config: {}
  - provider_id: braintrust-0
    provider_type: inline::braintrust
    config: {}
  datasetio:
  - provider_id: huggingface-0
    provider_type: remote::huggingface
    config: {}
  - provider_id: localfs-0
    provider_type: inline::localfs
    config: {}
  eval:
  - provider_id: meta-reference-0
    provider_type: inline::meta-reference
    config: {}
  inference:
  - provider_id: fireworks-0
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inference
      api_key: null
  telemetry:
  - provider_id: meta-reference-0
    provider_type: inline::meta-reference
    config: {}
metadata_store: null
models: 
  - model_id: Llama3.2-3B-Instruct
    provider_id: fireworks-0
  - model_id: Llama3.1-8B-Instruct
    provider_id: fireworks-0
  - model_id: Llama3.1-405B-Instruct
    provider_id: fireworks-0
datasets:
  - dataset_id: mmlu
    provider_id: huggingface-0
    url:
      uri: https://huggingface.co/datasets/llamastack/evals
    metadata:
      path: llamastack/evals
      name: evals__mmlu__details
      split: train
    dataset_schema:
      input_query:
        type: string
      expected_answer:
        type: string
eval_tasks:
  - eval_task_id: meta-reference-mmlu
    provider_id: meta-reference-0
    dataset_id: mmlu
    scoring_functions:
      - basic::regex_parser_multiple_choice_answer

src/llama_stack_client/lib/cli/eval/run_benchmark.py

yanxi0830 added 3 commits November 14, 2024 11:23

wip

1a7eddf

Merge branch 'main' into cli_eval_benchmark

4ec2f5e

run benchmark cli

c638d40

facebook-github-bot added the cla signed label Nov 14, 2024

yanxi0830 added 2 commits November 14, 2024 12:16

precommit

3b35205

doc

35cdfb3

yanxi0830 marked this pull request as ready for review November 14, 2024 17:20

dineshyv reviewed Nov 14, 2024

View reviewed changes

src/llama_stack_client/lib/cli/eval/run_benchmark.py Outdated Show resolved Hide resolved

dineshyv approved these changes Nov 14, 2024

View reviewed changes

yanxi0830 added 2 commits November 14, 2024 12:47

list of eval tasks

91f0a67

update doc

5bbb609

yanxi0830 merged commit b8a050b into main Nov 14, 2024
3 checks passed

yanxi0830 deleted the cli_eval_benchmark branch November 14, 2024 18:27

yanxi0830 mentioned this pull request Nov 14, 2024

[WIP] eval CLI #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CLI] eval run_benchmark CLI #30

[CLI] eval run_benchmark CLI #30

yanxi0830 commented Nov 14, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CLI] eval run_benchmark CLI #30

[CLI] eval run_benchmark CLI #30

Conversation

yanxi0830 commented Nov 14, 2024

TL;DR

Test

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants