How to Check LLM's Performance #362

DevanshuBrahmbhatt · 2023-09-10T00:53:05Z

DevanshuBrahmbhatt
Sep 10, 2023

I am looking for ways to check or evaluate the LLM's performance for various prompts or various inputs. What is the metric for it? If we are chunking the dataset and generating Embeddings then how the chunking strategies will affect the LLM's performance If it is then how to measure it?

Answered by ttthree

Sep 10, 2023

The answer is "it depends on your scenario" ... and you will need to pick metric + define dataset(s) to run and calculate metric.
E.g. if you're using the LLM for classification or named entity recognition, you can use simple/classical metrics like accuracy, % match. These usually have a ground-truth to compare with.

If you're using the LLM for more complex scenarios like creating a chatbot then a new set of metric should be define, things like:

Whether the chatbot is answering questions based on facts and provided context. (people may call it groundedness)
How intelligent/smart the answer look like, e.g. is it relevant to the question, is it creative and able to impress the user
How eff…

View full answer

ttthree · 2023-09-10T21:56:00Z

ttthree
Sep 10, 2023
Maintainer

The answer is "it depends on your scenario" ... and you will need to pick metric + define dataset(s) to run and calculate metric.
E.g. if you're using the LLM for classification or named entity recognition, you can use simple/classical metrics like accuracy, % match. These usually have a ground-truth to compare with.

If you're using the LLM for more complex scenarios like creating a chatbot then a new set of metric should be define, things like:

Whether the chatbot is answering questions based on facts and provided context. (people may call it groundedness)
How intelligent/smart the answer look like, e.g. is it relevant to the question, is it creative and able to impress the user
How efficient the bot can fulfill the ask from the user
etc.

Unlike the traditional metrics which can be calculated based on ground-truth, you usually don't have a ground-truth for these metrics. A classical approach will be relying on human review/labeling to decide the performance. Fortunately, models like GPT-4 is good at labelling meaning that if you can define the metrics with clean instructions (which is required anyway even you do human labelling), GPT-4 can do a good job. Take a look at these examples: groundedness, perceived intelligence and how they're being used in one of our tutorials - a chatbot answering questions based on a pdf

Note: these evaluation flows and prompts are just for demo purposes, you will have to craft prompts for evaluation of your own app/flow. That's where you might invest a lot on.

Also once you figured these out, prompt flow will be quite helpful for you to update your dev workflow to include the evaluations into CI/CD pipeline etc.

Prompt flow can also help you on testing multiple prompts easily, check out the variants feature and how it's being used in the web-classification example:

You can add variant with combination of different prompts and/or different model parameters
You can run these variants in vs code extension in one shot and look at results
You can run batch on these variants using SDK/CLI and evaluate them to compare performance

0 replies

DevanshuBrahmbhatt · 2023-09-10T22:52:04Z

DevanshuBrahmbhatt
Sep 10, 2023
Author

Thank you so much for your response. I can share my scenario. I am creating a Q&A Chatbot, (Whether the chatbot is answering questions based on facts and providing context. (People may call it groundedness) ) I have the normal approach of Taking

PDF-> Raw Text-> Chunks-> Embeddings-> Vector Similarity-> Create Context -> Prompt -> LLM's Result

However, I'm interested in creating a testing workflow to assess how each of these stages can impact the response from the LLM. Additionally, I'd like to measure their impact using specific performance parameters

3 replies

OMGSOFTWARE Oct 11, 2023

I want to create a AI assistant using machine learning and NLP to generate predictive holistic medicine consumption based on several factors that are automatically tracked to several open API's. Any suggestions on how to go about this?

anoexpected Oct 12, 2023

To generate personalized holistic medicine recommendations using machine learning and NLP, I would suggest the following approach:

Collect data on factors that can influence holistic medicine needs, such as demographic info, vital signs, lab tests, genetics, diet, symptoms, etc. You mentioned tracking these automatically through open APIs, which is a great start. Look for APIs from wearables, electronic health records, health tracking apps, etc.
Build a knowledge base of holistic medicines and their intended uses by scraping websites, published research, practitioner forums, etc. Use NLP techniques like named entity recognition to extract medicine names, conditions they treat, ingredients, etc.
Develop a model that takes a patient's current data profile as input and outputs ranked holistic medicine recommendations. This could be a rule-based system, Bayesian model, neural network, or other ML/NLP algorithm.
Train and evaluate the model on real patient data. Apply techniques like k-fold cross-validation to avoid overfitting. Assess predictive accuracy, recall, precision, etc.
Implement a conversational interface so patients can interact with the model through an assistant like you. Use NLP to parse patient queries and extract key details to provide personalized, context-aware recommendations.
Monitor real-world performance and continue training the model on new data over time. Explore ensembling multiple models to improve robustness.

boorge Oct 13, 2023

Building an AI assistant for predictive holistic medicine consumption involves a multi-faceted approach. Begin by selecting robust machine learning frameworks like TensorFlow or PyTorch. Utilize NLP models such as BERT or GPT-3 for understanding user input. Integrate your system with open APIs that provide relevant health data.

Data Collection: Gather diverse datasets including health records, lifestyle choices, and holistic medicine effectiveness.

Preprocessing: Clean and format the data, ensuring compatibility for machine learning algorithms.

Model Development: Train your ML models on historical data, allowing the system to learn patterns and correlations.

API Integration: Connect your AI assistant to open health APIs for real-time information on holistic remedies and medical trends.

User Interface: Design a user-friendly interface for seamless interaction, allowing users to input data and receive personalized predictions.

Remember, ongoing refinement is crucial for accuracy. Regularly update your models and maintain API connections to ensure your AI assistant provides reliable predictions.

Charlotte-br560 · 2024-03-23T07:41:47Z

Charlotte-br560
Mar 23, 2024

To evaluate LLM performance across prompts or inputs, consider using metrics like perplexity, BLEU score, or F1 score depending on your task. Chunking strategies can impact performance; assess this by comparing results across different chunking approaches. For measuring it, analyze metrics on validation or test sets.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Check LLM's Performance #362

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to Check LLM's Performance #362

DevanshuBrahmbhatt Sep 10, 2023

Replies: 3 comments · 3 replies

ttthree Sep 10, 2023 Maintainer

DevanshuBrahmbhatt Sep 10, 2023 Author

OMGSOFTWARE Oct 11, 2023

anoexpected Oct 12, 2023

boorge Oct 13, 2023

Charlotte-br560 Mar 23, 2024

DevanshuBrahmbhatt
Sep 10, 2023

Replies: 3 comments 3 replies

ttthree
Sep 10, 2023
Maintainer

DevanshuBrahmbhatt
Sep 10, 2023
Author

Charlotte-br560
Mar 23, 2024