Skip to content

Conversation

yanxi0830
Copy link
Contributor

TL;DR

  • CLI for running eval benchmarks
llama-stack-client eval run_benchmark --eval-task-id meta-reference-mmlu --num-examples 10 --output-dir ./ --eval-task-config ~/eval_task_config.json
image

Test

  1. Pre-register eval_tasks/datasets/scoring_functions via run.yaml
  2. Start server w/ llama stack run
  3. llama-stack-client eval run_benchmark --eval-task-id meta-reference-mmlu --num-examples 10 --output-dir ./ --eval-task-config ~/eval_task_config.json
version: '2'
built_at: '2024-11-11T21:59:52.074753'
image_name: fireworks
docker_image: null
conda_env: fireworks
apis:
- inference
- telemetry
- datasetio
- eval
- scoring
providers:
  scoring:
  - provider_id: basic-0
    provider_type: inline::basic
    config: {}
  - provider_id: llm-as-judge-0
    provider_type: inline::llm-as-judge
    config: {}
  - provider_id: braintrust-0
    provider_type: inline::braintrust
    config: {}
  datasetio:
  - provider_id: huggingface-0
    provider_type: remote::huggingface
    config: {}
  - provider_id: localfs-0
    provider_type: inline::localfs
    config: {}
  eval:
  - provider_id: meta-reference-0
    provider_type: inline::meta-reference
    config: {}
  inference:
  - provider_id: fireworks-0
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inference
      api_key: null
  telemetry:
  - provider_id: meta-reference-0
    provider_type: inline::meta-reference
    config: {}
metadata_store: null
models: 
  - model_id: Llama3.2-3B-Instruct
    provider_id: fireworks-0
  - model_id: Llama3.1-8B-Instruct
    provider_id: fireworks-0
  - model_id: Llama3.1-405B-Instruct
    provider_id: fireworks-0
datasets:
  - dataset_id: mmlu
    provider_id: huggingface-0
    url:
      uri: https://huggingface.co/datasets/llamastack/evals
    metadata:
      path: llamastack/evals
      name: evals__mmlu__details
      split: train
    dataset_schema:
      input_query:
        type: string
      expected_answer:
        type: string
eval_tasks:
  - eval_task_id: meta-reference-mmlu
    provider_id: meta-reference-0
    dataset_id: mmlu
    scoring_functions:
      - basic::regex_parser_multiple_choice_answer

@yanxi0830 yanxi0830 marked this pull request as ready for review November 14, 2024 17:20
@yanxi0830 yanxi0830 merged commit b8a050b into main Nov 14, 2024
3 checks passed
@yanxi0830 yanxi0830 deleted the cli_eval_benchmark branch November 14, 2024 18:27
@yanxi0830 yanxi0830 mentioned this pull request Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants