feat: Verbose evals #1558

anticorrelator · 2023-10-04T19:08:27Z

Adds a verbose flag that can be passed to the evals functions, llm_generate and llm_eval_binary. When set, these functions will print informative messages to stdout.

For this PR we're adding verbose logging to these parts:

the tenacity wrappers to indicate when model calls fail and must be retried
the BaseEvalModel base class
model-specific messages that show invocation parameters in both OpenAI and VertexAI implementations
additional messages indicating the status of snapping LLM evals to rails

For example:

Generating responses for 4 prompts...
OpenAI invocation parameters: {'model': 'gpt-4', 'temperature': 0.0, 'max_tokens': 256, 'frequency_penalty': 0, 'presence_penalty': 0, 'top_p': 1, 'n': 1, 'request_timeout': None}
Snapping 4 responses to rails: {'relevant', 'irrelevant'}
- Snapped 'relevant' to rail: relevant
- Snapped 'irrelevant' to rail: irrelevant
- Snapped '\nrelevant ' to rail: relevant
- Cannot snap 'unparsable' to rails: {'relevant', 'irrelevant'}

closes #1480

- less noisy - do not print OpenAI credentials information

anticorrelator · 2023-10-04T19:34:48Z

src/phoenix/experimental/evals/models/base.py

 @dataclass
 class BaseEvalModel(ABC):
+    _verbose: bool = False
+
+    def retry(


This is a rough attempt to clean up an abstractions: the create_base_retry_decorator function was imported from the base module into both of our concrete implementations. After attaching a verbose state to the model it also meant we needed to feed a property on the base model back into this function so I'm moving the decorator directly into model as an instance method. Please let me know if this feels unpleasant.

I also think it also makes sense if we use the factory directly as a decorator

src/phoenix/experimental/evals/functions/binary.py

mikeldking · 2023-10-05T02:26:25Z

src/phoenix/experimental/evals/functions/binary.py

@@ -120,6 +125,7 @@ def run_relevance_eval(
        be parsed.
    """

+    model._verbose = verbose


I'm not sure I fully like the way a field is getting set this way because it's going to be inevitable that this flag gets left set - seems like it would be cleaner to parameterize as a kwarg on certain calls? That way there's no magic. There are going to be other code paths that will need to set this and doing it via parameters seems more scalable.

ah yeah, that's a good point. If we want to be able to ask the model if it should emit verbose messages maybe we can hold the state in a context manager? Passing the arguments around is also a fine idea but I feel sometimes a flag that's passed around a lot of times can be really confusing to keep track of.

axiomofjoy

Thanks for adding tests!

axiomofjoy · 2023-10-05T04:22:44Z

src/phoenix/utilities/logging.py

+def printif(condition: bool, *args, **kwargs):
+    if condition:
+        print(*args, **kwargs)


@mikeldking What are your thoughts on the eventual packaging strategy for evals and other Phoenix sub-modules such as our tracers? Are we going to deploy them as distinct packages, e.g., arize-evals or phoenix-evals? If so, we should be careful about introducing dependencies between the sub-modules and the rest of the codebase.

@anticorrelator This is a non-blocking comment. We can always move things if needed.

Yeah it ideally doesn't sit in phoenix long term so treating it more as a sub-module could be a benefit. I think the verbose logging ask could be evals specific so it could make more sense sitting under evals, though I think this is a trivial change if we do split it so not concerned either way

pyproject.toml

axiomofjoy · 2023-10-05T17:54:53Z

src/phoenix/experimental/evals/functions/binary.py

@@ -47,18 +48,22 @@ def llm_eval_binary(

        system_instruction (Optional[str], optional): An optional system message.

+        verbose (bool, optional): If True, prints detailed info to stdout. Default False.


Suggestion: Give an example of what kind of information is being printed, e.g., prompts and prompt templates.

E.g., If True, prints detailed information including invocation parameters, formatted prompts, etc., to std out.

src/phoenix/experimental/evals/functions/binary.py

src/phoenix/experimental/evals/models/base.py

src/phoenix/experimental/evals/models/openai.py

src/phoenix/experimental/evals/models/base.py

RogerHYang · 2023-10-05T21:13:52Z

src/phoenix/experimental/evals/functions/binary.py

+    else:
+        printif(verbose, f"- Snapped {repr(string)} to rail: {rail}")


Suggested change

else:

printif(verbose, f"- Snapped {repr(string)} to rail: {rail}")

printif(verbose, f"- Snapped {repr(string)} to rail: {rail}")

Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com>

anticorrelator added 6 commits October 3, 2023 15:10

Add _verbose flag to BaseEvalModel

6f002d2

Start adding basic verbose-mode logging

b410128

Add verbose mode to retries

73c93a0

Only print when verbose flag is set

a3e6a1b

Refactor verbose mode

b2ecf6d

- less noisy - do not print OpenAI credentials information

Continue refining verbose mode output

931ce4a

anticorrelator requested review from axiomofjoy and mikeldking October 4, 2023 19:08

anticorrelator added 3 commits October 4, 2023 15:26

Prefer absolute imports

5bcc0d9

Fix type hint

778c459

Try to clean up abstractions

9dee5b5

anticorrelator requested a review from fjcasti1 October 4, 2023 19:28

anticorrelator changed the title ~~feature: Verbose evals~~ feat: Verbose evals Oct 4, 2023

anticorrelator commented Oct 4, 2023

View reviewed changes

anticorrelator added 3 commits October 4, 2023 17:34

Add printif utility

feeeda9

Prefer absolute imports

e51ab9e

Add verbose test for llm_eval_binary

6eaf28f

mikeldking reviewed Oct 5, 2023

View reviewed changes

src/phoenix/experimental/evals/functions/binary.py Outdated Show resolved Hide resolved

mikeldking reviewed Oct 5, 2023

View reviewed changes

Test retrying with verbose mode

d73a211

axiomofjoy reviewed Oct 5, 2023

View reviewed changes

anticorrelator and others added 5 commits October 5, 2023 00:38

Test that the "verbose" state does not get persisted

398ad79

Implement verbose flag as a context manager

8d15d6d

Add docstrings

50e28bc

Merge branch 'main' into dustin/verbose-evals

459e7fb

Improve verbosity statefulness test

5c6ec9d

anticorrelator marked this pull request as ready for review October 5, 2023 05:36

anticorrelator added 3 commits October 5, 2023 01:37

Lint imports

44e2038

Add blankline

dadcc9b

Shorten docstrings

bf93cbc

anticorrelator and others added 4 commits October 5, 2023 11:08

Enforce formatter settings

1b1bdbf

Appease mypy

0918108

Add verbose flag test for generate

d21ca4c

Merge branch 'main' into dustin/verbose-evals

d4015b7

axiomofjoy approved these changes Oct 5, 2023

View reviewed changes

Use better dummy variable name

9f863e6

RogerHYang reviewed Oct 5, 2023

View reviewed changes

src/phoenix/experimental/evals/models/base.py Outdated Show resolved Hide resolved

RogerHYang reviewed Oct 5, 2023

View reviewed changes

anticorrelator and others added 7 commits October 5, 2023 17:17

Update src/phoenix/experimental/evals/models/base.py

1d56240

Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com>

Add more details to docstrings

89566b1

Merge remote-tracking branch 'origin' into dustin/verbose-evals

dda49ae

Update bedrock model

6d2ceda

Restore missing import

61ad169

Merge branch 'main' into dustin/verbose-evals

6cb1c95

Merge branch 'main' into dustin/verbose-evals

1960d1c

anticorrelator merged commit 50e765b into main Oct 6, 2023
9 checks passed

anticorrelator deleted the dustin/verbose-evals branch October 6, 2023 20:00

github-actions bot locked and limited conversation to collaborators Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Verbose evals #1558

feat: Verbose evals #1558

anticorrelator commented Oct 4, 2023 •

edited

Loading

anticorrelator Oct 4, 2023

mikeldking Oct 5, 2023

anticorrelator Oct 5, 2023

axiomofjoy left a comment

axiomofjoy Oct 5, 2023

mikeldking Oct 5, 2023

axiomofjoy Oct 5, 2023

axiomofjoy Oct 5, 2023

RogerHYang Oct 5, 2023

		@@ -47,18 +48,22 @@ def llm_eval_binary(

		system_instruction (Optional[str], optional): An optional system message.

		verbose (bool, optional): If True, prints detailed info to stdout. Default False.

		else:
		printif(verbose, f"- Snapped {repr(string)} to rail: {rail}")

feat: Verbose evals #1558

feat: Verbose evals #1558

Conversation

anticorrelator commented Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axiomofjoy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anticorrelator commented Oct 4, 2023 •

edited

Loading