test: revamp eval related integration tests #1433

yanxi0830 · 2025-03-05T23:11:05Z

What does this PR do?

revamp and clean up datasets/scoring/eval integration tests
closes Migrate providers/tests into tests/api for evals, datasets, scorings API #1396

Test Plan

dataset

LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/integration/datasetio/

scoring

LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct

eval

LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/eval --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct

yanxi0830 · 2025-03-06T00:26:03Z

llama_stack/providers/inline/scoring/llm_as_judge/scoring_fn/llm_as_judge_scoring_fn.py

-                    "role": "user",
-                    "content": judge_input_msg,
-                }
+                UserMessage(


bug uncovered from unit tests :)

ashwinb · 2025-03-06T17:41:55Z

tests/integration/scoring/test_scoring.py

-                prompt_template=sample_judge_prompt_template,
-                judge_score_regexes=[r"Score: (\d+)"],
+
+    scoring_fn = scoring_fns_list[0]


why the first one?

We test 1 scoring function per provider, as braintrust has 10+ scoring functions (each having multiple LLM calls), and its slow to loop over all.

Will look into having mocks for scoring as well s.t. running can be within reasonable time.

yanxi0830 added 2 commits March 5, 2025 14:59

datasetio pass

7f34968

unregister fix

2091585

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 5, 2025

yanxi0830 changed the title ~~tests (wip): revamp eval related integration tests~~ test(wip): revamp eval related integration tests Mar 5, 2025

yanxi0830 added 5 commits March 5, 2025 15:18

update provider id

0385309

add registeration test

f246405

fix scoring

546a417

fix scoring

5d43b91

default text model

54abeee

yanxi0830 commented Mar 6, 2025

View reviewed changes

tmp eval

fd68b0d

yanxi0830 added this to the v0.1.6 milestone Mar 6, 2025

work eval

6e65b92

yanxi0830 changed the title ~~test(wip): revamp eval related integration tests~~ test: revamp eval related integration tests Mar 6, 2025

yanxi0830 marked this pull request as ready for review March 6, 2025 01:35

yanxi0830 requested review from SLR722, ashwinb, dineshyv, dltn, ehhuang, hardikjshah, raghotham, sixianyi0721, terrytangyuan and vladimirivic as code owners March 6, 2025 01:35

yanxi0830 added 4 commits March 5, 2025 17:36

fix eval

2541dcc

fix eval

62a844c

fix eval

9066b2a

merge

72dee96

ehhuang approved these changes Mar 6, 2025

View reviewed changes

ashwinb reviewed Mar 6, 2025

View reviewed changes

yanxi0830 merged commit bcb13c4 into main Mar 6, 2025
4 checks passed

yanxi0830 deleted the revive_eval_integration_test branch March 6, 2025 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: revamp eval related integration tests #1433

test: revamp eval related integration tests #1433

Uh oh!

yanxi0830 commented Mar 5, 2025 •

edited

Loading

Uh oh!

yanxi0830 Mar 6, 2025

Uh oh!

ashwinb Mar 6, 2025

Uh oh!

yanxi0830 Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

test: revamp eval related integration tests #1433

test: revamp eval related integration tests #1433

Uh oh!

Conversation

yanxi0830 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

yanxi0830 Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

ashwinb Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

yanxi0830 Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yanxi0830 commented Mar 5, 2025 •

edited

Loading

yanxi0830 Mar 6, 2025 •

edited

Loading