feat: Evals with explanations #1699

anticorrelator · 2023-11-02T15:17:59Z

resolves #1587

Adds an option to provide explanations alongside evals when calling llm_classify.

axiomofjoy

How is this approach working in practice, given that most of our default prompts contain the explicit instructions "Your response should be a single word" or something similar?

It seems like we're moving in the direction of supporting several different output formats:

the predicted rail (e.g., "relevant") and nothing else
the predicted rail and an explanation on a new line
a JSON object containing the predicted rail and possibly an explanation

I'm beginning to think it might make sense to decouple the prompt describing the objective of the classification task from the prompt describing the output format. The former prompt would be visible to the user, the latter prompt would be invisible to the user. That's essentially what you're doing with EXPLANATION_PROMPT_TEMPLATE_STR, but I'm wondering if we should take it a step further and add in another prompt for describing the output format without explanations. The default prompt templates and user-provided templates would then only describe the classification objective and the meaning of each rail without needing to explicitly describe the format of the output.

This would help for function calling as well. Our default prompt templates at the moment aren't ideal for function calling, because they explicitly specify a format that does not conform to the spec provided to the OpenAI API. If we make this change, there would be a single default prompt per task that could be used with or without function calling and with or without explanations.

The downside is that we would need to change our default templates and recompute performance on our test matrix.

src/phoenix/experimental/evals/templates/template.py

axiomofjoy · 2023-11-02T22:43:11Z

@anticorrelator Let me know if this is ready for review!

mikeldking · 2023-11-03T06:31:03Z

How is this approach working in practice, given that most of our default prompts contain the explicit instructions "Your response should be a single word" or something similar?

It seems like we're moving in the direction of supporting several different output formats:

the predicted rail (e.g., "relevant") and nothing else

the predicted rail and an explanation on a new line

a JSON object containing the predicted rail and possibly an explanation

I'm beginning to think it might make sense to decouple the prompt describing the objective of the classification task from the prompt describing the output format. The former prompt would be visible to the user, the latter prompt would be invisible to the user. That's essentially what you're doing with EXPLANATION_PROMPT_TEMPLATE_STR, but I'm wondering if we should take it a step further and add in another prompt for describing the output format without explanations. The default prompt templates and user-provided templates would then only describe the classification objective and the meaning of each rail without needing to explicitly describe the format of the output.

This would help for function calling as well. Our default prompt templates at the moment aren't ideal for function calling, because they explicitly specify a format that does not conform to the spec provided to the OpenAI API. If we make this change, there would be a single default prompt per task that could be used with or without function calling and with or without explanations.

The downside is that we would need to change our default templates and recompute performance on our test matrix.

I think these are all very good points. We should discuss what's possible at the LLM classify level and what's not possible. Since the prompt is a param, we are kinda in a tough spot. I think I'm starting to regret not having a separate explanation method now because as you say, the instructions are baked in to the prompt. Feels like we should map it out a bit. We could build templates for with explanations and make this work potentially too, it's just gonna require a bit of manual swapping

anticorrelator · 2023-11-03T06:59:22Z

I think if we want robust isolation, we should really consider separating the classification and explanation. Combining them in one prompt both has some nontrivial impact on the response itself but leads to the ambiguities described above.

a JSON object containing the predicted rail and possibly an explanation

I think you're referring to our function calling interface here and I think we might want to treat that as explicitly different, given its capabilities are quite different.

…ions

src/phoenix/experimental/evals/functions/classify.py

mikeldking

I like it! I think it moves us in the right direction - with the minor thought that the parsing of the label assumes pretty heavily as to how the output needs to be parsed. It might make sense for this to be hoisted an parameterized in the same way that the rails were hoisted to the template.

mikeldking · 2023-11-08T17:41:15Z

src/phoenix/experimental/evals/functions/classify.py

+def _search_for_label(raw_string: str) -> str:
+    label_delimiter = r"\W*label\W*"
+    parts = re.split(label_delimiter, raw_string, maxsplit=1, flags=re.IGNORECASE)
+    if len(parts) == 2:
+        return parts[1]
+    return ""


thought: we could make these types of "parsers" be injectable into llm_classify - because this is inherently tied to what prompt you use - if you don't use a prompt that doesn't prompt for structure things like this, you are going to want to change this.

e.g.

labels_df = llm_classify(source_df, ...., label_parser: lambda (raw_str) -> str)

Coming back to this after reading through everything - it's really a parser for the with_explanation template so maybe it should live with that template. We can default to this one but give the affordance for it to be overridden if needed.

It's also probably worth threading through the verbose logging here so the end-user knows when their LLM is producing un-parsable code.

src/phoenix/experimental/evals/functions/classify.py

src/phoenix/experimental/evals/templates/template.py

axiomofjoy

I like the direction. The main question I have is that I am not clear on why ClassificationTemplate ought to be a sub-class of PromptTemplate. I can see that it makes it more easily compatible with llm_classify, but I think a ClassificationTask or ClassificationConfig that is not itself a sub-class of PromptTemplate feels more natural to me.

src/phoenix/experimental/evals/templates/template.py

…ions

mikeldking

I think this diverges a bit from how I think we should formulate the qestion. Let's grab time to chat.

src/phoenix/experimental/evals/evaluators.py

src/phoenix/experimental/evals/functions/generate.py

src/phoenix/experimental/evals/templates/template.py

src/phoenix/experimental/evals/templates/default_templates.py

src/phoenix/experimental/evals/templates/template.py

src/phoenix/experimental/evals/evaluators.py

mikeldking · 2023-11-13T23:07:12Z

src/phoenix/experimental/evals/functions/classify.py

+                printif(
+                    verbose and unrailed_label == NOT_PARSABLE,
+                    f"- Could not parse {repr(response)}",
+                )


documentation: I would add a bit more color here so the user understands what part of the execution is failing. Including maybe the response

mikeldking · 2023-11-13T23:11:59Z

src/phoenix/experimental/evals/templates/__init__.py

-    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
-    TOXICITY_PROMPT_RAILS_MAP,
-    TOXICITY_PROMPT_TEMPLATE_STR,
+    CODE_READABILITY_PROMPT_RAILS,


we need to preserve the binary True/False of the labels in the case of binary classification so we should not touch these mappings.

src/phoenix/experimental/evals/templates/template.py

mikeldking · 2023-11-13T23:17:19Z

src/phoenix/experimental/evals/templates/template.py

+    def parse_label(self, raw_string: str) -> str:
+        label_delimiter = r"\W*label\W*"
+        parts = re.split(label_delimiter, raw_string, maxsplit=1, flags=re.IGNORECASE)
+        if len(parts) == 2:
+            return parts[1]
+        return NOT_PARSABLE


this is specific to the output parsing of the explanation so I probably wouldn't leave this generic. Unfortunately we kinda need a parser with and without explanations. We can fall back to this - e.g. make this parser be the default for explanation templates but I wouldn't bake it into the class since it's going to be wrong for some cases.

review-notebook-app · 2023-11-14T19:11:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…into dustin/evals-with-explanations

* feat(evals): add an output_parser param for structured data extraction * remove brittle test

mikeldking

* Add explanation template * Spike out explanations * Ruff 🐶 * Use tailored explanation prompt * Add explanation templates for all evals * Wire up prompt template objects * Update models to use new template object * Ruff 🐶 * Resolve type and linter issues * Fix more typing issues * Address first round of feedback * Extract `ClassificationTemplate` ABC * Label extraction belongs to the "template" object * Add logging for unparseable labels * Patch in openai key environment variable for tests * Refactor to address feedback * Evaluators should use PromptTemplates * Pair with Mikyo * Fix for CI * `PROMPT_TEMPLATE_STR` -> `PROMPT_TEMPLATE` * Print prompt if verbose * Add __repr__ to `PromptTemplate` * fix relevance notebook * docs: update evals * Normalize prompt templates in llm_classify * Ruff 🐶 * feat(evals): add an output_parser to llm_generate (#1736) * feat(evals): add an output_parser param for structured data extraction * remove brittle test * docs(evals): document llm_generate with output parser (#1741) --------- Co-authored-by: Mikyo King <mikyo@arize.com>

anticorrelator added 2 commits November 1, 2023 18:30

Add explanation template

7956822

Spike out explanations

763c481

anticorrelator requested a review from axiomofjoy November 2, 2023 15:18

Ruff 🐶

38d2be9

axiomofjoy reviewed Nov 2, 2023

View reviewed changes

src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved

src/phoenix/experimental/evals/templates/template.py Outdated Show resolved Hide resolved

anticorrelator added 8 commits November 6, 2023 10:29

Merge remote-tracking branch 'origin' into dustin/evals-with-explanat…

c314c74

…ions

Use tailored explanation prompt

a66d6eb

Add explanation templates for all evals

85092ca

Wire up prompt template objects

2b0f29b

Update models to use new template object

f0aa75f

Ruff 🐶

6c6140a

Resolve type and linter issues

a11a4ba

Fix more typing issues

9c5af4e

mikeldking reviewed Nov 8, 2023

View reviewed changes

src/phoenix/experimental/evals/functions/classify.py Outdated Show resolved Hide resolved

mikeldking reviewed Nov 8, 2023

View reviewed changes

axiomofjoy reviewed Nov 8, 2023

View reviewed changes

anticorrelator added 3 commits November 9, 2023 16:48

Address first round of feedback

a551d60

Extract ClassificationTemplate ABC

67e9e13

Label extraction belongs to the "template" object

75a027c

anticorrelator marked this pull request as ready for review November 10, 2023 02:57

anticorrelator added 4 commits November 10, 2023 10:17

Add logging for unparseable labels

6fc6fc6

Merge remote-tracking branch 'origin' into dustin/evals-with-explanat…

eb11ebb

…ions

Merge remote-tracking branch 'origin' into dustin/evals-with-explanat…

59d9ded

…ions

Patch in openai key environment variable for tests

a2509c9

mikeldking requested changes Nov 13, 2023

View reviewed changes

Refactor to address feedback

eaff46d

Evaluators should use PromptTemplates

b8e13e3

mikeldking reviewed Nov 13, 2023

View reviewed changes

anticorrelator added 3 commits November 13, 2023 19:12

Pair with Mikyo

d0f1d8b

Fix for CI

888f223

PROMPT_TEMPLATE_STR -> PROMPT_TEMPLATE

cebda8c

anticorrelator and others added 10 commits November 14, 2023 14:15

Print prompt if verbose

093e59c

Add __repr__ to PromptTemplate

17025ef

fix relevance notebook

29ff6b4

docs: update evals

cc8e7e2

Normalize prompt templates in llm_classify

e564db0

Ruff 🐶

6cdbecb

Merge remote-tracking branch 'origin/dustin/evals-with-explanations' …

ad1ef59

…into dustin/evals-with-explanations

feat(evals): add an output_parser to llm_generate (#1736)

2b257d2

* feat(evals): add an output_parser param for structured data extraction * remove brittle test

docs(evals): document llm_generate with output parser (#1741)

00d9cb4

Merge branch 'main' into dustin/evals-with-explanations

8ac5201

mikeldking approved these changes Nov 14, 2023

View reviewed changes

mikeldking merged commit 2db8141 into main Nov 14, 2023
9 checks passed

mikeldking deleted the dustin/evals-with-explanations branch November 14, 2023 20:07

github-actions bot mentioned this pull request Nov 14, 2023

chore(main): release 1.1.0 #1738

Merged

github-actions bot mentioned this pull request Feb 16, 2024

chore(main): release phoenix 4.0.0 #2321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Evals with explanations #1699

feat: Evals with explanations #1699

anticorrelator commented Nov 2, 2023

axiomofjoy left a comment •

edited

Loading

axiomofjoy commented Nov 2, 2023

mikeldking commented Nov 3, 2023

anticorrelator commented Nov 3, 2023

mikeldking left a comment

mikeldking Nov 8, 2023

mikeldking Nov 8, 2023

mikeldking Nov 8, 2023

axiomofjoy left a comment

mikeldking left a comment

mikeldking Nov 13, 2023

mikeldking Nov 13, 2023

mikeldking Nov 13, 2023

review-notebook-app bot commented Nov 14, 2023

mikeldking left a comment

feat: Evals with explanations #1699

feat: Evals with explanations #1699

Conversation

anticorrelator commented Nov 2, 2023

axiomofjoy left a comment • edited Loading

Choose a reason for hiding this comment

axiomofjoy commented Nov 2, 2023

mikeldking commented Nov 3, 2023

anticorrelator commented Nov 3, 2023

mikeldking left a comment

Choose a reason for hiding this comment

mikeldking Nov 8, 2023

Choose a reason for hiding this comment

mikeldking Nov 8, 2023

Choose a reason for hiding this comment

mikeldking Nov 8, 2023

Choose a reason for hiding this comment

axiomofjoy left a comment

Choose a reason for hiding this comment

mikeldking left a comment

Choose a reason for hiding this comment

mikeldking Nov 13, 2023

Choose a reason for hiding this comment

mikeldking Nov 13, 2023

Choose a reason for hiding this comment

mikeldking Nov 13, 2023

Choose a reason for hiding this comment

review-notebook-app bot commented Nov 14, 2023

mikeldking left a comment

Choose a reason for hiding this comment

axiomofjoy left a comment •

edited

Loading