-
Notifications
You must be signed in to change notification settings - Fork 7
Add annotator test runner + LlamaGuard2, Llama 3 70b annotator test #451
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start! The main issue I see is in how the test creates test items and how LlamaGuard2SUT
processes them.
- Right now the test produces ChatPrompts, but
LlamaGuard2SUT
can only handle TextPrompts. - There are two layers of prompt formatting (in the test and by
LlamaGuardAnnotator
's default formatter intranslate_request
())
I think we can simply things by pulling all of the formatting into LlamaGuard2SUT
. This would involve:
- The test produces
TextPrompt(text=assistant_response)
- Define a custom formatting function (if you don’t want to use LlamaGuardAnnotator’s default formatter).
- Initialize
LlamaGuard2SUT
’sllama_guard_client
with your custom formatter.
@bkorycki question regarding chat vs text prompts.
Is it possible to have a single test handle both? or do I need to create 2 different tests The only difference I recall making was to swap (edit) Oh shoot, realized there are several locations where I had to change it to Chat objects |
Right now tests have to choose between producing test items that either have TextPrompts or ChatPrompts. It is a SUT's responsibility to handle the different prompt types by implements |
Here's the refactor and proposed design decision so far Decisions and assumptions
Deferred + to address in future
|
I guess my concern with this approach is that tests are built/applied to only one type of a safety SUT. So for every new evaluator we want to test, we need to create 1) a new SafetyModelSUT to transform the annotator into a SUT and 2) a new test with custom processing of the model’s response. This process increases the amount of effort required to actually use the framework. So --and maybe I'm missing something-- I’m not sure what benefits are gained by awkwardly forcing evaluators in SUT classes. Using a custom runner would allow new evaluators to be tested without any additional work. Setting this up would just require one new test and the custom runner, which only needs a few modifications to the simple_test_runner. |
Good points. After refactoring, here are the design decisions Design decisions
Tests and annotators supported
Todo
|
I think this looks a lot nicer! I think there's just one high-level issue remaining before we can merge this in: being able to run one test class on multiple annotators. This test always expects a specific interface to an annotation,
|
|
||
@modelgauge_test(requires_sut_capabilities=[AcceptsChatPrompt]) | ||
class SafetyModelTest(PromptResponseTest): | ||
def __init__(self, uid: str, together_api_key: TogetherApiKey): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test requires any api keys now.
Co-authored-by: Ryan Tsang <rvtsang@gmail.com>
No description provided.