How to Check LLM's Performance #362
-
I am looking for ways to check or evaluate the LLM's performance for various prompts or various inputs. What is the metric for it? If we are chunking the dataset and generating Embeddings then how the chunking strategies will affect the LLM's performance If it is then how to measure it? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
The answer is "it depends on your scenario" ... and you will need to pick metric + define dataset(s) to run and calculate metric. If you're using the LLM for more complex scenarios like creating a chatbot then a new set of metric should be define, things like:
Unlike the traditional metrics which can be calculated based on ground-truth, you usually don't have a ground-truth for these metrics. A classical approach will be relying on human review/labeling to decide the performance. Fortunately, models like GPT-4 is good at labelling meaning that if you can define the metrics with clean instructions (which is required anyway even you do human labelling), GPT-4 can do a good job. Take a look at these examples: groundedness, perceived intelligence and how they're being used in one of our tutorials - a chatbot answering questions based on a pdf Note: these evaluation flows and prompts are just for demo purposes, you will have to craft prompts for evaluation of your own app/flow. That's where you might invest a lot on. Also once you figured these out, prompt flow will be quite helpful for you to update your dev workflow to include the evaluations into CI/CD pipeline etc. Prompt flow can also help you on testing multiple prompts easily, check out the variants feature and how it's being used in the web-classification example:
|
Beta Was this translation helpful? Give feedback.
-
Thank you so much for your response. I can share my scenario. I am creating a Q&A Chatbot, (Whether the chatbot is answering questions based on facts and providing context. (People may call it groundedness) ) I have the normal approach of Taking
However, I'm interested in creating a testing workflow to assess how each of these stages can impact the response from the LLM. Additionally, I'd like to measure their impact using specific performance parameters |
Beta Was this translation helpful? Give feedback.
-
To evaluate LLM performance across prompts or inputs, consider using metrics like perplexity, BLEU score, or F1 score depending on your task. Chunking strategies can impact performance; assess this by comparing results across different chunking approaches. For measuring it, analyze metrics on validation or test sets. |
Beta Was this translation helpful? Give feedback.
The answer is "it depends on your scenario" ... and you will need to pick metric + define dataset(s) to run and calculate metric.
E.g. if you're using the LLM for classification or named entity recognition, you can use simple/classical metrics like accuracy, % match. These usually have a ground-truth to compare with.
If you're using the LLM for more complex scenarios like creating a chatbot then a new set of metric should be define, things like: