-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation strategy #15
Comments
Quick braindump, I just worked on evaluation strategy for my Meedan/Rockefeller thing. Basically we crafted 20 test cases and then an editor reviewed them and gave gold standard / human ground truth for all 20. At first I just used a spreadsheet and did it manually (a professional editor made the classification judgments.) (Screenshot shows: the test of a classifier which responds yes or no, and is sometimes incorrect, with interesting caveats about the surprising behavior of the LLM. The classification is "is this article an example of solutions journalism" which is itself a specific evaluation standard.) Then I wanted to automate the evaluation. I was able set up promptable and run an executable process that could be part of a build pipeline. Though it is still "just a spreadsheet" and all the evaluation is still human. I was basically editing JSON instead of Sheets. Takeaway: We should definitely figure this out and do semi-automated evaluation, but just doing any regular, subjective check for overall quality, tone, etc. seems way more important at this step than having it be completely automated or even having a very large test suite. The interesting stuff is how the model might evade the question, or give too much detail, etc. It helps enumerate unexpected behaviors. Defending a single overall metric does not realistically make the model better or prevent regressions unless you have specifically crafted a reliable test for that case. I do think it is possible to design test cases that are more reliable, but it's going to be a flaky type of test that can't really block the build. I think ideally we would have 100+ tests that would run in CI and have reasonably deterministic output (less than 5% flakiness in the test). Each test would give a specific input context and then ask a question and the natural language response would need to contain certain character string (regex) or some correctly-ranked search result, or some reliable indicator that it got the query correct (?) If we are trying to test nondeterministic natural language, I hope we don't have to resort to using another LLM call to interpret the result, because then we need a test for THAT model. Slippery stuff. |
Promptable shut down in July 🤦 |
|
Note there is a question closely related to evaluation, the core user journey — #11 |
[Merging in a thread from Slack here] @Quantisan I realize you are asking about “benchmarking” and I am talking about “evaluation” As I am thinking of the relationship between these terms, and the HF How-To Guide has these 3 high-level categories of metrics:
So for me that was helpful clarification of terminology. Benchmarks are one of three categories of evaluation. Benchmarks are the “dataset-specific” type. (My experience as UX researcher makes me want to also add "qualitative evaluation" to the list, but the main focus here is on automated evaluation.) Based on what I am learning from the HF docs, it seems there are aspects of our pipeline that can be test existing metrics directly with no custom code, like NER. But although these are probably easiest to test for but the least useful, since they are already tested at the model level (Assuming our RAG system will not cause regressions on NER benchmarks for the models we use.) More likely we want to test recognition of things that are in our custom corpus. Also I think we don't want to run a large benchmark which would probably take forever to set up and run for example I'm looking at SuperGLUE and it is impressive but intimidating in scope. https://gluebenchmark.com/leaderboard As I was saying to Paul this morning in Slack I think we just need a GitHub Action that asks a few questions and verifies the result is accurate.
I agreed with Pauls instinct to keep it simple:
|
On reflection, just taking notes, it seems like the key test we need at first is a list of say 10 questions which cleverly exercise the RAG function. I am seeing these phases of evaluation:
For an initial test we can generate the questions because we are the experts in describing what the system can do. At this stage, we can track our questions in markdown and just review the 10 questions ourselves as a checklist at this stage, let's not get stuck on automation. We don't need to defend against regressions yet since we have nothing to regress; most of our learning will probably come from open-ended exploration. This list of verifiable questions is very similar to the task analysis we need to do with the team. Of course, the tasks they do should inform what we provide. And these questions need to be transparent to Sonothera. As they find use cases which they can not answer, we want to capture those as this type of basic accuracy metric. But on the other hand, they are depending on us to tell them what this system can do, so I think we
One thing I am getting my head around: Correct answers would be nondeterministic so if we want to automate it, we would need to probably regex for entity matches or make the output executable somehow. The questions would have to be designed to elicit an entity or terminology that can be reliably detected by "something dumber than an LLM." It has to output something that is verifiable right. (I think this will be interesting and mildly difficult to do in a way that is useful as a test; it will probably be easy to think of questions that always pass or always fail, but we want to find the tractable "edge" of functionality where our changes are sometimes failing if we do it wrong. We want to find tests that fail if we don't have the right documents loaded and parsed correctly. That is, stock GPT-4 should fail the entire suite.) |
This looks promising https://github.com/explodinggradients/ragas
|
Apparently the Promptfoo effort is not shut down they just archived one of their repos. Just found this "factuality evaluation" section in their docs, and it explains how to test with question/answer pairs and an openAI model that does the evaluation. |
https://github.com/openai/evals/blob/main/docs/build-eval.md
|
https://gpt-index.readthedocs.io/en/v0.6.33/how_to/evaluation/evaluation.html
|
https://github.com/anyscale/ray-summit-2023-training/blob/main/Ray-LlamaIndex/notebooks/02_evaluation.ipynb this is a useful step-by-step guide to evaluating a RAG workflow. the last part on Evaluation without Golden Responses overlaps with a couple of the concepts Chris shared but put them into practice together. |
I'm sort of imagining a qualititative process here rather than some sort of automated harness connected to one of the more generic NLP eval metics. Those are great for LLM-level broad comparisons, but seem pretty useless for specialized use cases like ours.
The text was updated successfully, but these errors were encountered: