Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add evaluation notebook #31

Merged
merged 1 commit into from
Feb 5, 2024
Merged

Add evaluation notebook #31

merged 1 commit into from
Feb 5, 2024

Conversation

hemajv
Copy link
Collaborator

@hemajv hemajv commented Jan 30, 2024

This addresses #24

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemajv for adding the notebook. Presents good metrics for evaluation. Would we want all these evaluation metrics to be added to the pipeline and UI or we want to select which ones we want to ship further.

@oindrillac
Copy link
Contributor

Thanks @hemajv for adding the notebook. Presents good metrics for evaluation. Would we want all these evaluation metrics to be added to the pipeline and UI or we want to select which ones we want to ship further.

We have already added some of these to the UI but they may not be actually relevant or well performing. What we could look into from here is

A quantitative eval of whether how often these scores are valid.

 - We can look at cases where we know the generated output is deliberately wrong and see how the allotted scores perform
 - Do this over a number of outputs (say 10 outputs) for each relevant default criteria and some custom criteria

That will help us determine which out of the langchain eval criteria are doing well and also whether or not this is working most of the time.

We can do the same evaluation also of the custom gpt 3 prompt evaluation and see how well that is performing.

Copy link
Contributor

@oindrillac oindrillac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the notebook @hemajv 🎉 This is a great start

I added a comment on how we can follow this up with some quantitative eval to get a sense of which criteria we should include and whether this is actually working most of the time.

@hemajv
Copy link
Collaborator Author

hemajv commented Feb 5, 2024

Thanks @hemajv for adding the notebook. Presents good metrics for evaluation. Would we want all these evaluation metrics to be added to the pipeline and UI or we want to select which ones we want to ship further.

We have already added some of these to the UI but they may not be actually relevant or well performing. What we could look into from here is

A quantitative eval of whether how often these scores are valid.

 - We can look at cases where we know the generated output is deliberately wrong and see how the allotted scores perform
 - Do this over a number of outputs (say 10 outputs) for each relevant default criteria and some custom criteria

That will help us determine which out of the langchain eval criteria are doing well and also whether or not this is working most of the time.

We can do the same evaluation also of the custom gpt 3 prompt evaluation and see how well that is performing.

Yes, definitely agree on this 👍 Seems like some of the Langchain criteria are doing a pretty decent job in evaluating the outputs accurately. As you mentioned, we need to try this out with a certain set of cases to get a better understanding on how well its doing.

@oindrillac
Copy link
Contributor

Yes, definitely agree on this 👍 Seems like some of the Langchain criteria are doing a pretty decent job in evaluating the outputs accurately. As you mentioned, we need to try this out with a certain set of cases to get a better understanding on how well its doing.

I started creating a framework to quantitatively evaluate this as an extension to this notebook. Will also add that in a separate PR

@hemajv
Copy link
Collaborator Author

hemajv commented Feb 5, 2024

@oindrillac addressed the comments and pushed new commits, ready for your review

Copy link
Contributor

@oindrillac oindrillac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @hemajv LGTM 👍

@oindrillac oindrillac merged commit 406fcec into redhat-et:main Feb 5, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants