Add evaluation notebook #31

hemajv · 2024-01-30T20:46:24Z

This addresses #24

review-notebook-app · 2024-01-30T20:46:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

aakankshaduggal

Thanks @hemajv for adding the notebook. Presents good metrics for evaluation. Would we want all these evaluation metrics to be added to the pipeline and UI or we want to select which ones we want to ship further.

notebooks/evaluation/evaluation_metrics.ipynb

oindrillac · 2024-02-01T13:30:48Z

Thanks @hemajv for adding the notebook. Presents good metrics for evaluation. Would we want all these evaluation metrics to be added to the pipeline and UI or we want to select which ones we want to ship further.

We have already added some of these to the UI but they may not be actually relevant or well performing. What we could look into from here is

A quantitative eval of whether how often these scores are valid.

 - We can look at cases where we know the generated output is deliberately wrong and see how the allotted scores perform
 - Do this over a number of outputs (say 10 outputs) for each relevant default criteria and some custom criteria

That will help us determine which out of the langchain eval criteria are doing well and also whether or not this is working most of the time.

We can do the same evaluation also of the custom gpt 3 prompt evaluation and see how well that is performing.

oindrillac

Thanks for the notebook @hemajv 🎉 This is a great start

I added a comment on how we can follow this up with some quantitative eval to get a sense of which criteria we should include and whether this is actually working most of the time.

notebooks/evaluation/evaluation_metrics.ipynb

hemajv · 2024-02-05T17:22:26Z

Thanks @hemajv for adding the notebook. Presents good metrics for evaluation. Would we want all these evaluation metrics to be added to the pipeline and UI or we want to select which ones we want to ship further.

We have already added some of these to the UI but they may not be actually relevant or well performing. What we could look into from here is

A quantitative eval of whether how often these scores are valid.
 - We can look at cases where we know the generated output is deliberately wrong and see how the allotted scores perform
 - Do this over a number of outputs (say 10 outputs) for each relevant default criteria and some custom criteria
That will help us determine which out of the langchain eval criteria are doing well and also whether or not this is working most of the time.

We can do the same evaluation also of the custom gpt 3 prompt evaluation and see how well that is performing.

Yes, definitely agree on this 👍 Seems like some of the Langchain criteria are doing a pretty decent job in evaluating the outputs accurately. As you mentioned, we need to try this out with a certain set of cases to get a better understanding on how well its doing.

oindrillac · 2024-02-05T17:25:16Z

Yes, definitely agree on this 👍 Seems like some of the Langchain criteria are doing a pretty decent job in evaluating the outputs accurately. As you mentioned, we need to try this out with a certain set of cases to get a better understanding on how well its doing.

I started creating a framework to quantitatively evaluate this as an extension to this notebook. Will also add that in a separate PR

hemajv · 2024-02-05T19:28:23Z

@oindrillac addressed the comments and pushed new commits, ready for your review

oindrillac

Thanks for the changes @hemajv LGTM 👍

hemajv requested review from aakankshaduggal and oindrillac January 30, 2024 20:46

aakankshaduggal approved these changes Jan 31, 2024

View reviewed changes

oindrillac reviewed Feb 1, 2024

View reviewed changes

notebooks/evaluation/evaluation_metrics.ipynb Show resolved Hide resolved

notebooks/evaluation/evaluation_metrics.ipynb Show resolved Hide resolved

notebooks/evaluation/evaluation_metrics.ipynb Show resolved Hide resolved

oindrillac reviewed Feb 1, 2024

View reviewed changes

oindrillac reviewed Feb 2, 2024

View reviewed changes

notebooks/evaluation/evaluation_metrics.ipynb Show resolved Hide resolved

Add evaluation notebook

a9b37f4

oindrillac approved these changes Feb 5, 2024

View reviewed changes

oindrillac merged commit 406fcec into redhat-et:main Feb 5, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluation notebook #31

Add evaluation notebook #31

hemajv commented Jan 30, 2024

review-notebook-app bot commented Jan 30, 2024

aakankshaduggal left a comment

oindrillac commented Feb 1, 2024

oindrillac left a comment

hemajv commented Feb 5, 2024

oindrillac commented Feb 5, 2024

hemajv commented Feb 5, 2024

oindrillac left a comment

Add evaluation notebook #31

Add evaluation notebook #31

Conversation

hemajv commented Jan 30, 2024

review-notebook-app bot commented Jan 30, 2024

aakankshaduggal left a comment

Choose a reason for hiding this comment

oindrillac commented Feb 1, 2024

oindrillac left a comment

Choose a reason for hiding this comment

hemajv commented Feb 5, 2024

oindrillac commented Feb 5, 2024

hemajv commented Feb 5, 2024

oindrillac left a comment

Choose a reason for hiding this comment