Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix llm-based evaluation metrics #572

Merged
merged 21 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
a8810db
Fix experiment name
kcortinas May 23, 2024
d96aff4
Add retrieved contexts to queryoutput
kcortinas May 23, 2024
6790166
Fix context precision metric
kcortinas May 23, 2024
3875b2e
Remove log of run_metrics dict
kcortinas May 23, 2024
9912b8d
Fix context recall metric
kcortinas May 23, 2024
8ccdc26
Update documentation context recall metric
kcortinas May 23, 2024
aedab7e
Fix llm_answer_relevance inputs
kcortinas May 23, 2024
2c55ec8
Update output handler unit tests
kcortinas May 23, 2024
91817d2
Update evaluation documentation (#571)
kcortinas May 23, 2024
07a4c2c
Bump azure-ai-ml from 1.15.0 to 1.16.0 (#569)
dependabot[bot] May 23, 2024
71e457e
Bump matplotlib from 3.8.4 to 3.9.0 (#567)
dependabot[bot] May 23, 2024
6ce5cfe
Bump tiktoken from 0.6.0 to 0.7.0 (#565)
dependabot[bot] May 23, 2024
336240e
Bump pytest from 8.2.0 to 8.2.1 (#562)
dependabot[bot] May 23, 2024
7617f7d
Bump langchain-community from 0.0.38 to 0.2.0 (#568)
dependabot[bot] May 23, 2024
4b9a02b
Merge branch 'development' into karina/update-eval-metrics
kcortinas May 23, 2024
ec87f2b
Bump lxml from 5.2.1 to 5.2.2 (#566)
dependabot[bot] May 23, 2024
fda5e9d
Merge branch 'development' into karina/update-eval-metrics
kcortinas May 23, 2024
93f7dbe
Merge branch 'prerelease' into karina/update-eval-metrics
kcortinas May 24, 2024
99c73c8
fix conflicts with prerelease
kcortinas May 24, 2024
ca562a1
fix context precision conflicts with prerelease
kcortinas May 24, 2024
ec6e2f0
Fix context precision conflicts with prerelease
kcortinas May 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ The custom loader resorts to the simpler 'prebuilt-layout' API model as a fallba

1. **Re-Ranking**: The query responses from Azure AI Search are re-evaluated using LLM and ranked according to the relevance between the query and the context.

1. **Metrics and Evaluation**: You can define custom evaluation metrics, which enable precise and granular assessment of search algorithm performance. It includes distance-based, cosine, semantic similarity, and more metrics out of the box.
1. **Metrics and Evaluation**: It supports end-to-end metrics comparing the generated answers (actual) against the ground-truth answers (expected), including distance-based, cosine and semantic similarity metrics. It also includes component-based metrics to assess retrieval and generation performance using LLMs as judges, such as context recall or answer relevance, as well as retrieval metrics to assess search results (e.g. MAP@k).

1. **Report Generation**: The **RAG Experiment Accelerator** automates the process of report generation, complete with visualizations that make it easy to analyze and share experiment findings.

Expand Down Expand Up @@ -253,8 +253,9 @@ Alternatively, you can run the above steps (apart from `02_qa_generation.py`) us
"crossencoder_model" :"determines the model used for cross-encoding re-ranking step. Valid value is cross-encoder/stsb-roberta-base",
"search_types" : "determines the search types used for experimentation. Valid value are search_for_match_semantic, search_for_match_Hybrid_multi, search_for_match_Hybrid_cross, search_for_match_text, search_for_match_pure_vector, search_for_match_pure_vector_multi, search_for_match_pure_vector_cross, search_for_manual_hybrid. e.g. ['search_for_manual_hybrid', 'search_for_match_Hybrid_multi','search_for_match_semantic' ]",
"retrieve_num_of_documents": "determines the number of chunks to retrieve from the search index",
"metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2, llm_answer_relevance, llm_context_precision. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens']",
"metric_types" : "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
"azure_oai_chat_deployment_name": "determines the Azure OpenAI deployment name",
"azure_oai_eval_deployment_name": "determines the Azure OpenAI deployment name used for evaluation",
"embedding_model_name": "embedding model name",
"openai_temperature": "determines the OpenAI temperature. Valid value ranges from 0 to 1.",
"search_relevancy_threshold": "the similarity threshold to determine if a doc is relevant. Valid ranges are from 0.0 to 1.0",
Expand Down Expand Up @@ -333,10 +334,12 @@ default value for `min_query_expansion_related_question_similarity_score` is set
The solution integrates with Azure Machine Learning and uses MLFlow to manage experiments, jobs, and artifacts. You can view the following reports as part of the evaluation process:

### Metric Comparison
`all_metrics_current_run.html` shows average scores across questions and search types for each selected metric:

![Metric Comparison](./images/metric_comparison.png)

### Metric Analysis
The computation of each metric and fields used for evaluation are tracked for each question and search type in the output csv file:

![Alt text](./images/metric_analysis.png)

Expand All @@ -345,13 +348,20 @@ The solution integrates with Azure Machine Learning and uses MLFlow to manage ex
![Hyper Parameters](./images/hyper_parameters.png)

### Sample Metrics
Metrics can be compared across runs:

![Sample Metrics](./images/sample_metric.png)

### Search evaluation
Metrics can be compared across different search strategies:

![Search evaluation](./images/search_chart.png)

### Retrieval evaluation
Mean average precision scores are tracked and average MAP scores can be compared across search type:

![Retrieval evaluation](./images/map_at_k.png)
![Retrieval evaluation](./images/map_scores.png)

### Pitfalls

Expand Down Expand Up @@ -389,6 +399,14 @@ Possible Causes:
**Content Filtering:** Azure OpenAI has content filters in place. If the input text or generated responses are deemed inappropriate, it could lead to errors.
**API Limitations:** The Azure OpenAI service have token and rate limitations that affect the output.

#### Evaluation step

##### Description
**End-to-end evaluation metrics:** not all the metrics comparing the generated and ground-truth answers are able to capture differences in semantics. For example, metrics such as `levenshtein` or `jaro_winkler` only measure edit distances. The `cosine` metric doesn't allow the comparison of semantics either: it uses the *textdistance* token-based implementation based on term frequency vectors. To calculate the semantic similarity between the generated answers and the expected responses, consider using embedding-based metrics such as Bert scores (`bert_`).

**Component-wise evaluation metrics:** evaluation metrics using LLM-as-judges aren't deterministic. The `llm_` metrics included in the accelerator use the model indicated in the `azure_oai_eval_deployment_name` config field. The prompts used for evaluation instruction can be adjusted and are included in the `prompts.py` file (`llm_answer_relevance_instruction`, `llm_context_recall_instruction`, `llm_context_precision_instruction`).

**Retrieval-based metrics:** MAP scores are computed by comparing each retrieved chunk against the question and the chunk used to generate the qna pair. To assess whether a retrieved chunk is relevant or not, the similarity between the retrieved chunk and the concatenation of the end user question and the chunk used in the qna step (`02_qa_generation.py`) is computed using the SpacyEvaluator. Spacy similarity defaults to the average of the token vectors, meaning that the computation is insensitive to the order of the words. By default, the similarity threshold is set to 80% (`spacy_evaluator.py`).

## Contributing

Expand Down
4 changes: 2 additions & 2 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
promptflow==1.10.1
promptflow==1.11.0
promptflow-tools==1.4.0
pytest==8.2.0
pytest==8.2.1
pytest-cov==5.0.0
flake8==7.0.0
pre-commit==3.7.1
Expand Down
19 changes: 14 additions & 5 deletions docs/evaluation-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ You can choose which metrics should be calculated in your experiment by updating
"bert_distilbert_base_nli_stsb_mean_tokens",
"bert_paraphrase_multilingual_MiniLM_L12_v2",
"llm_answer_relevance",
"llm_context_precision"
"llm_context_precision",
"llm_context_recall"
]
```

Expand All @@ -54,13 +55,13 @@ Calculates the longest common substring (LCS) similarity score between two strin

Computes the longest common subsequence (LCS) similarity score between two input strings.

### Cosine similarity
### Cosine similarity (Ochiai coefficient)

| Configuration Key | Calculation Base | Possible Values |
| ----------------- | -------------------- | ------------------ |
| `cosine` | `actual`, `expected` | Percentage (0-100) |

The cosine similarity is calculated as the cosine of the angle between two strings represented as vectors.
This coefficient is calculated as the intersection of the term-frequency vectors of the generated answer (actual) and the ground-truth answer (expected) divided by the geometric mean of the sizes of these vectors.

### Jaro-Winkler distance

Expand Down Expand Up @@ -148,6 +149,14 @@ information is penalized.

| Configuration Key | Calculation Base | Possible Values |
| ------------------- | ------------------- | ----------------------------------------------------------------- |
| `llm_context_precision` | `actual`, `context` | 1 (yes) or 0 (no) depending on if the context is relevant or not. |
| `llm_context_precision` | `question`, `retrieved_contexts` | Percentage (0-100) |

Checks whether or not the context generated by the RAG solution is useful for answering a question.
Proportion of retrieved contexts relevant to the question. Evaluates whether or not the context generated by the RAG solution is useful for answering a question.

### LLM Context recall

| Configuration Key | Calculation Base | Possible Values |
| ------------------- | ------------------- | ----------------------------------------------------------------- |
| `llm_context_recall` | `question`, `expected`, `retrieved_contexts` | Percentage (0-100) |

Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved contexts. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.
Binary file added images/map_at_k.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/map_scores.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/metric_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def test_save(mock_artifact_handler_save_dict):
search_type="search_type1",
search_evals=[],
context="context1",
retrieved_contexts=["retrievedcontext1"],
question="question1",
)

Expand Down Expand Up @@ -94,6 +95,7 @@ def test_load(mock_artifact_handler_load):
search_type="search_type1",
search_evals=[],
context="context1",
retrieved_contexts=["retrievedcontext1"],
question="question1",
)

Expand All @@ -119,6 +121,7 @@ def test_load(mock_artifact_handler_load):
assert d.search_type == data.search_type
assert d.search_evals == data.search_evals
assert d.context == data.context
assert d.retrieved_contexts == data.retrieved_contexts


@patch(
Expand Down
5 changes: 4 additions & 1 deletion rag_experiment_accelerator/artifact/models/query_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ class QueryOutput:
expected (str): The expected output.
search_type (str): The type of search.
search_evals (list): The evaluations for search.
context (str): The context of the query.
context (str): The qna context of the query.
retrieved_contexts (list): The list of retrieved contexts of the query.
question (str): The question of the query.
"""

Expand All @@ -32,6 +33,7 @@ def __init__(
search_type: str,
search_evals: list,
context: str,
retrieved_contexts: list,
question: str,
):
self.rerank = rerank
Expand All @@ -46,4 +48,5 @@ def __init__(
self.search_type = search_type
self.search_evals = search_evals
self.context = context
self.retrieved_contexts = retrieved_contexts
self.question = question
113 changes: 74 additions & 39 deletions rag_experiment_accelerator/evaluation/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,22 @@ def remove_spaces(text):
return text.strip()


def lower_and_strip(text):
"""
Converts the input to lowercase without spaces or empty string if None.

Args:
text (str): The string to format.

Returns:
str: The formatted input string.
"""
if text is None:
return ""
else:
return text.lower().strip()


# https://huggingface.co/spaces/evaluate-metric/bleu
def bleu(predictions, references):
bleu = evaluate.load("bleu")
Expand Down Expand Up @@ -212,7 +228,10 @@ def jaro_winkler(value1, value2):

def cosine(value1, value2):
"""
Calculates the cosine similarity between two vectors.
Calculates the cosine similarity (Ochiai coefficient) between two strings
using token-frequency vectors

https://en.wikipedia.org/wiki/Cosine_similarity.

Args:
value1 (list): The first vector.
Expand Down Expand Up @@ -288,40 +307,51 @@ def llm_answer_relevance(


def llm_context_precision(
response_generator: ResponseGenerator, question, context
response_generator: ResponseGenerator, question, retrieved_contexts
) -> float:
"""
Verifies whether or not a given context is useful for answering a question.
Computes precision by assessing whether each retrieved context is useful for answering a question.
Only considers the presence of relevant chunks in the retrieved contexts, but doesn't take into
account their ranking order.

Args:
question (str): The question being asked.
context (str): The given context.
retrieved_contexts (list[str]): The list of retrieved contexts for the query.

Returns:
int: 1 or 0 depending on if the context is relevant or not.
double: proportion of relevant chunks retrieved for the question
"""
result: str | None = response_generator.generate_response(
llm_context_precision_instruction,
context=context,
question=question,
)
if result is None:
logger.warning("Unable to generate context precision score")
return 0.0
relevancy_scores = []

# Since we're only asking for one response, the result is always a boolean 1 or 0
if result.lower().strip() == "yes":
return 100.0

return 0
for context in retrieved_contexts:
result: str | None = response_generator.generate_response(
llm_context_precision_instruction,
context=context,
question=question,
)
llm_judge_response = lower_and_strip(result)
# Since we're only asking for one response, the result is always a boolean 1 or 0
if llm_judge_response == "yes":
relevancy_scores.append(1)
elif llm_judge_response == "no":
relevancy_scores.append(0)
else:
logger.warning("Unable to generate context precision score")

logger.debug(relevancy_scores)

if not relevancy_scores:
logger.warning("Unable to compute average context precision")
return -1
else:
return (sum(relevancy_scores) / len(relevancy_scores)) * 100


def llm_context_recall(
response_generator: ResponseGenerator,
question,
groundtruth_answer,
context,
temperature: int,
retrieved_contexts,
):
"""
Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context.
Expand All @@ -338,21 +368,24 @@ def llm_context_recall(
Args:
question (str): The question being asked
groundtruth_answer (str): The ground truth ("output_prompt")
context (str): The given context.
temperature (int): Temperature as defined in the config.
retrieved_contexts (list[str]): The list of retrieved contexts for the query

Returns:
double: The context recall score generated between the ground truth (expected) and context.
"""

context = "\n".join(retrieved_contexts)
prompt = (
"\nquestion: "
+ question
+ "\ncontext: "
+ context
+ "\nanswer: "
+ groundtruth_answer
)
result = response_generator.generate_response(
llm_context_recall_instruction,
temperature=temperature,
question=question,
context=context,
answer=groundtruth_answer,
sys_message=llm_context_recall_instruction,
prompt=prompt,
)

good_response = '"Attributed": "1"'
bad_response = '"Attributed": "0"'

Expand Down Expand Up @@ -481,7 +514,7 @@ def compute_metrics(
question,
actual,
expected,
context,
retrieved_contexts,
metric_type,
):
"""
Expand All @@ -490,10 +523,11 @@ def compute_metrics(
Args:
actual (str): The first string to compare.
expected (str): The second string to compare.
retrieved_contexts (list[str]): The list of retrieved contexts for the query.
metric_type (str): The type of metric to use for comparison. Valid options are:
- "lcsstr": Longest common substring
- "lcsseq": Longest common subsequence
- "cosine": Cosine similarity
- "cosine": Cosine similarity (Ochiai coefficient)
- "jaro_winkler": Jaro-Winkler distance
- "hamming": Hamming distance
- "jaccard": Jaccard similarity
Expand All @@ -508,7 +542,6 @@ def compute_metrics(
- "llm_context_precision": Verifies whether or not a given context is useful for answering a question.
- "llm_answer_relevance": Scores the relevancy of the answer according to the given question.
- "llm_context_recall": Scores context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context.
config (Config): The configuration of the experiment.

Returns:
float: The similarity score between the two strings, as determined by the specified metric.
Expand Down Expand Up @@ -566,11 +599,13 @@ def compute_metrics(
actual, expected, paraphrase_multilingual_MiniLM_L12_v2
)
elif metric_type == "llm_answer_relevance":
score = llm_answer_relevance(response_generator, actual, expected)
score = llm_answer_relevance(response_generator, question, actual)
elif metric_type == "llm_context_precision":
score = llm_context_precision(response_generator, actual, context)
score = llm_context_precision(response_generator, question, retrieved_contexts)
elif metric_type == "llm_context_recall":
score = llm_context_recall(response_generator, question, expected, context)
score = llm_context_recall(
response_generator, question, expected, retrieved_contexts
)
else:
pass

Expand All @@ -597,12 +632,12 @@ def evaluate_single_prompt(
data.question,
actual,
expected,
data.context,
data.retrieved_contexts,
metric_type,
)
metric_dic[metric_type] = score
metric_dic["question"] = data.question
metric_dic["context"] = data.context
metric_dic["retrieved_contexts"] = data.retrieved_contexts
metric_dic["actual"] = actual
metric_dic["expected"] = expected
metric_dic["search_type"] = data.search_type
Expand Down Expand Up @@ -710,7 +745,7 @@ def evaluate_prompts(
mean_scores["mean"].append(mean(scores))

run_id = mlflow.active_run().info.run_id
columns_to_remove = ["question", "context", "actual", "expected"]
columns_to_remove = ["question", "retrieved_contexts", "actual", "expected"]
additional_columns_to_remove = ["search_type"]
df = pd.DataFrame(data_list)
df.to_csv(
Expand Down Expand Up @@ -762,7 +797,7 @@ def evaluate_prompts(
os.path.join(config.eval_data_location, f"sum_{name_suffix}.csv")
)
draw_hist_df(sum_df, run_id, mlflow_client)
generate_metrics(config.index_name_prefix, run_id, mlflow_client)
generate_metrics(config.experiment_name, run_id, mlflow_client)


def draw_search_chart(temp_df, run_id, mlflow_client):
Expand Down
Loading
Loading