-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Evals with explanations #1699
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this approach working in practice, given that most of our default prompts contain the explicit instructions "Your response should be a single word" or something similar?
It seems like we're moving in the direction of supporting several different output formats:
- the predicted rail (e.g., "relevant") and nothing else
- the predicted rail and an explanation on a new line
- a JSON object containing the predicted rail and possibly an explanation
I'm beginning to think it might make sense to decouple the prompt describing the objective of the classification task from the prompt describing the output format. The former prompt would be visible to the user, the latter prompt would be invisible to the user. That's essentially what you're doing with EXPLANATION_PROMPT_TEMPLATE_STR
, but I'm wondering if we should take it a step further and add in another prompt for describing the output format without explanations. The default prompt templates and user-provided templates would then only describe the classification objective and the meaning of each rail without needing to explicitly describe the format of the output.
This would help for function calling as well. Our default prompt templates at the moment aren't ideal for function calling, because they explicitly specify a format that does not conform to the spec provided to the OpenAI API. If we make this change, there would be a single default prompt per task that could be used with or without function calling and with or without explanations.
The downside is that we would need to change our default templates and recompute performance on our test matrix.
@anticorrelator Let me know if this is ready for review! |
I think these are all very good points. We should discuss what's possible at the LLM classify level and what's not possible. Since the prompt is a param, we are kinda in a tough spot. I think I'm starting to regret not having a separate explanation method now because as you say, the instructions are baked in to the prompt. Feels like we should map it out a bit. We could build templates for with explanations and make this work potentially too, it's just gonna require a bit of manual swapping |
I think if we want robust isolation, we should really consider separating the classification and explanation. Combining them in one prompt both has some nontrivial impact on the response itself but leads to the ambiguities described above.
I think you're referring to our function calling interface here and I think we might want to treat that as explicitly different, given its capabilities are quite different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! I think it moves us in the right direction - with the minor thought that the parsing of the label assumes pretty heavily as to how the output needs to be parsed. It might make sense for this to be hoisted an parameterized in the same way that the rails were hoisted to the template.
def _search_for_label(raw_string: str) -> str: | ||
label_delimiter = r"\W*label\W*" | ||
parts = re.split(label_delimiter, raw_string, maxsplit=1, flags=re.IGNORECASE) | ||
if len(parts) == 2: | ||
return parts[1] | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: we could make these types of "parsers" be injectable into llm_classify
- because this is inherently tied to what prompt you use - if you don't use a prompt that doesn't prompt for structure things like this, you are going to want to change this.
e.g.
labels_df = llm_classify(source_df, ...., label_parser: lambda (raw_str) -> str)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this after reading through everything - it's really a parser for the with_explanation
template so maybe it should live with that template. We can default to this one but give the affordance for it to be overridden if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also probably worth threading through the verbose
logging here so the end-user knows when their LLM is producing un-parsable code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction. The main question I have is that I am not clear on why ClassificationTemplate
ought to be a sub-class of PromptTemplate
. I can see that it makes it more easily compatible with llm_classify
, but I think a ClassificationTask
or ClassificationConfig
that is not itself a sub-class of PromptTemplate
feels more natural to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this diverges a bit from how I think we should formulate the qestion. Let's grab time to chat.
printif( | ||
verbose and unrailed_label == NOT_PARSABLE, | ||
f"- Could not parse {repr(response)}", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documentation: I would add a bit more color here so the user understands what part of the execution is failing. Including maybe the response
RAG_RELEVANCY_PROMPT_TEMPLATE_STR, | ||
TOXICITY_PROMPT_RAILS_MAP, | ||
TOXICITY_PROMPT_TEMPLATE_STR, | ||
CODE_READABILITY_PROMPT_RAILS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to preserve the binary True/False of the labels in the case of binary classification so we should not touch these mappings.
def parse_label(self, raw_string: str) -> str: | ||
label_delimiter = r"\W*label\W*" | ||
parts = re.split(label_delimiter, raw_string, maxsplit=1, flags=re.IGNORECASE) | ||
if len(parts) == 2: | ||
return parts[1] | ||
return NOT_PARSABLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is specific to the output parsing of the explanation so I probably wouldn't leave this generic. Unfortunately we kinda need a parser with and without explanations. We can fall back to this - e.g. make this parser be the default for explanation templates but I wouldn't bake it into the class since it's going to be wrong for some cases.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…into dustin/evals-with-explanations
* feat(evals): add an output_parser param for structured data extraction * remove brittle test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Add explanation template * Spike out explanations * Ruff 🐶 * Use tailored explanation prompt * Add explanation templates for all evals * Wire up prompt template objects * Update models to use new template object * Ruff 🐶 * Resolve type and linter issues * Fix more typing issues * Address first round of feedback * Extract `ClassificationTemplate` ABC * Label extraction belongs to the "template" object * Add logging for unparseable labels * Patch in openai key environment variable for tests * Refactor to address feedback * Evaluators should use PromptTemplates * Pair with Mikyo * Fix for CI * `PROMPT_TEMPLATE_STR` -> `PROMPT_TEMPLATE` * Print prompt if verbose * Add __repr__ to `PromptTemplate` * fix relevance notebook * docs: update evals * Normalize prompt templates in llm_classify * Ruff 🐶 * feat(evals): add an output_parser to llm_generate (#1736) * feat(evals): add an output_parser param for structured data extraction * remove brittle test * docs(evals): document llm_generate with output parser (#1741) --------- Co-authored-by: Mikyo King <mikyo@arize.com>
resolves #1587
Adds an option to provide explanations alongside evals when calling
llm_classify
.