Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Evals with explanations #1699

Merged
merged 33 commits into from
Nov 14, 2023
Merged

feat: Evals with explanations #1699

merged 33 commits into from
Nov 14, 2023

Conversation

anticorrelator
Copy link
Contributor

resolves #1587

Adds an option to provide explanations alongside evals when calling llm_classify.

Copy link
Contributor

@axiomofjoy axiomofjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this approach working in practice, given that most of our default prompts contain the explicit instructions "Your response should be a single word" or something similar?

It seems like we're moving in the direction of supporting several different output formats:

  • the predicted rail (e.g., "relevant") and nothing else
  • the predicted rail and an explanation on a new line
  • a JSON object containing the predicted rail and possibly an explanation

I'm beginning to think it might make sense to decouple the prompt describing the objective of the classification task from the prompt describing the output format. The former prompt would be visible to the user, the latter prompt would be invisible to the user. That's essentially what you're doing with EXPLANATION_PROMPT_TEMPLATE_STR, but I'm wondering if we should take it a step further and add in another prompt for describing the output format without explanations. The default prompt templates and user-provided templates would then only describe the classification objective and the meaning of each rail without needing to explicitly describe the format of the output.

This would help for function calling as well. Our default prompt templates at the moment aren't ideal for function calling, because they explicitly specify a format that does not conform to the spec provided to the OpenAI API. If we make this change, there would be a single default prompt per task that could be used with or without function calling and with or without explanations.

The downside is that we would need to change our default templates and recompute performance on our test matrix.

@axiomofjoy
Copy link
Contributor

@anticorrelator Let me know if this is ready for review!

@mikeldking
Copy link
Contributor

How is this approach working in practice, given that most of our default prompts contain the explicit instructions "Your response should be a single word" or something similar?

It seems like we're moving in the direction of supporting several different output formats:

  • the predicted rail (e.g., "relevant") and nothing else

  • the predicted rail and an explanation on a new line

  • a JSON object containing the predicted rail and possibly an explanation

I'm beginning to think it might make sense to decouple the prompt describing the objective of the classification task from the prompt describing the output format. The former prompt would be visible to the user, the latter prompt would be invisible to the user. That's essentially what you're doing with EXPLANATION_PROMPT_TEMPLATE_STR, but I'm wondering if we should take it a step further and add in another prompt for describing the output format without explanations. The default prompt templates and user-provided templates would then only describe the classification objective and the meaning of each rail without needing to explicitly describe the format of the output.

This would help for function calling as well. Our default prompt templates at the moment aren't ideal for function calling, because they explicitly specify a format that does not conform to the spec provided to the OpenAI API. If we make this change, there would be a single default prompt per task that could be used with or without function calling and with or without explanations.

The downside is that we would need to change our default templates and recompute performance on our test matrix.

I think these are all very good points. We should discuss what's possible at the LLM classify level and what's not possible. Since the prompt is a param, we are kinda in a tough spot. I think I'm starting to regret not having a separate explanation method now because as you say, the instructions are baked in to the prompt. Feels like we should map it out a bit. We could build templates for with explanations and make this work potentially too, it's just gonna require a bit of manual swapping

@anticorrelator
Copy link
Contributor Author

I think if we want robust isolation, we should really consider separating the classification and explanation. Combining them in one prompt both has some nontrivial impact on the response itself but leads to the ambiguities described above.

a JSON object containing the predicted rail and possibly an explanation

I think you're referring to our function calling interface here and I think we might want to treat that as explicitly different, given its capabilities are quite different.

Copy link
Contributor

@mikeldking mikeldking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! I think it moves us in the right direction - with the minor thought that the parsing of the label assumes pretty heavily as to how the output needs to be parsed. It might make sense for this to be hoisted an parameterized in the same way that the rails were hoisted to the template.

Comment on lines 343 to 348
def _search_for_label(raw_string: str) -> str:
label_delimiter = r"\W*label\W*"
parts = re.split(label_delimiter, raw_string, maxsplit=1, flags=re.IGNORECASE)
if len(parts) == 2:
return parts[1]
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: we could make these types of "parsers" be injectable into llm_classify - because this is inherently tied to what prompt you use - if you don't use a prompt that doesn't prompt for structure things like this, you are going to want to change this.

e.g.

labels_df = llm_classify(source_df, ...., label_parser: lambda (raw_str) -> str)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this after reading through everything - it's really a parser for the with_explanation template so maybe it should live with that template. We can default to this one but give the affordance for it to be overridden if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also probably worth threading through the verbose logging here so the end-user knows when their LLM is producing un-parsable code.

src/phoenix/experimental/evals/functions/classify.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
Copy link
Contributor

@axiomofjoy axiomofjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction. The main question I have is that I am not clear on why ClassificationTemplate ought to be a sub-class of PromptTemplate. I can see that it makes it more easily compatible with llm_classify, but I think a ClassificationTask or ClassificationConfig that is not itself a sub-class of PromptTemplate feels more natural to me.

src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
@anticorrelator anticorrelator marked this pull request as ready for review November 10, 2023 02:57
Copy link
Contributor

@mikeldking mikeldking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this diverges a bit from how I think we should formulate the qestion. Let's grab time to chat.

src/phoenix/experimental/evals/evaluators.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/functions/generate.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/evaluators.py Outdated Show resolved Hide resolved
Comment on lines +112 to +115
printif(
verbose and unrailed_label == NOT_PARSABLE,
f"- Could not parse {repr(response)}",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation: I would add a bit more color here so the user understands what part of the execution is failing. Including maybe the response

RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
TOXICITY_PROMPT_RAILS_MAP,
TOXICITY_PROMPT_TEMPLATE_STR,
CODE_READABILITY_PROMPT_RAILS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to preserve the binary True/False of the labels in the case of binary classification so we should not touch these mappings.

src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved
Comment on lines 81 to 86
def parse_label(self, raw_string: str) -> str:
label_delimiter = r"\W*label\W*"
parts = re.split(label_delimiter, raw_string, maxsplit=1, flags=re.IGNORECASE)
if len(parts) == 2:
return parts[1]
return NOT_PARSABLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is specific to the output parsing of the explanation so I probably wouldn't leave this generic. Unfortunately we kinda need a parser with and without explanations. We can fall back to this - e.g. make this parser be the default for explanation templates but I wouldn't bake it into the class since it's going to be wrong for some cases.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@mikeldking mikeldking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeldking mikeldking merged commit 2db8141 into main Nov 14, 2023
9 checks passed
@mikeldking mikeldking deleted the dustin/evals-with-explanations branch November 14, 2023 20:07
mikeldking added a commit that referenced this pull request Nov 15, 2023
* Add explanation template

* Spike out explanations

* Ruff 🐶

* Use tailored explanation prompt

* Add explanation templates for all evals

* Wire up prompt template objects

* Update models to use new template object

* Ruff 🐶

* Resolve type and linter issues

* Fix more typing issues

* Address first round of feedback

* Extract `ClassificationTemplate` ABC

* Label extraction belongs to the "template" object

* Add logging for unparseable labels

* Patch in openai key environment variable for tests

* Refactor to address feedback

* Evaluators should use PromptTemplates

* Pair with Mikyo

* Fix for CI

* `PROMPT_TEMPLATE_STR` -> `PROMPT_TEMPLATE`

* Print prompt if verbose

* Add __repr__ to `PromptTemplate`

* fix relevance notebook

* docs: update evals

* Normalize prompt templates in llm_classify

* Ruff 🐶

* feat(evals): add an output_parser to llm_generate (#1736)

* feat(evals): add an output_parser param for structured data extraction

* remove brittle test

* docs(evals): document llm_generate with output parser (#1741)

---------

Co-authored-by: Mikyo King <mikyo@arize.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[ENHANCEMENT] Evals with Explanation
3 participants