microsoft · kcortinas · May 24, 2024 · May 23, 2024 · May 23, 2024 · May 23, 2024
diff --git a/README.md b/README.md
@@ -45,7 +45,7 @@ The custom loader resorts to the simpler 'prebuilt-layout' API model as a fallba
 
 1. **Re-Ranking**: The query responses from Azure AI Search are re-evaluated using LLM and ranked according to the relevance between the query and the context.
 
-1. **Metrics and Evaluation**: You can define custom evaluation metrics, which enable precise and granular assessment of search algorithm performance. It includes distance-based, cosine, semantic similarity, and more metrics out of the box.
+1. **Metrics and Evaluation**: It supports end-to-end metrics comparing the generated answers (actual) against the ground-truth answers (expected), including distance-based, cosine and semantic similarity metrics. It also includes component-based metrics to assess retrieval and generation performance using LLMs as judges, such as context recall or answer relevance, as well as retrieval metrics to assess search results (e.g. MAP@k).
 
 1. **Report Generation**: The **RAG Experiment Accelerator** automates the process of report generation, complete with visualizations that make it easy to analyze and share experiment findings.
 
@@ -253,8 +253,9 @@ Alternatively, you can run the above steps (apart from `02_qa_generation.py`) us
     "crossencoder_model" :"determines the model used for cross-encoding re-ranking step. Valid value is cross-encoder/stsb-roberta-base",
     "search_types" : "determines the search types used for experimentation. Valid value are search_for_match_semantic, search_for_match_Hybrid_multi, search_for_match_Hybrid_cross, search_for_match_text, search_for_match_pure_vector, search_for_match_pure_vector_multi, search_for_match_pure_vector_cross, search_for_manual_hybrid. e.g. ['search_for_manual_hybrid', 'search_for_match_Hybrid_multi','search_for_match_semantic' ]",
     "retrieve_num_of_documents": "determines the number of chunks to retrieve from the search index",
-    "metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2, llm_answer_relevance, llm_context_precision. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens']",
+    "metric_types" : "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy','bert_all_MiniLM_L6_v2','cosine','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']",
     "azure_oai_chat_deployment_name":  "determines the Azure OpenAI deployment name",
+    "azure_oai_eval_deployment_name": "determines the Azure OpenAI deployment name used for evaluation",
     "embedding_model_name": "embedding model name",
     "openai_temperature": "determines the OpenAI temperature. Valid value ranges from 0 to 1.",
     "search_relevancy_threshold": "the similarity threshold to determine if a doc is relevant. Valid ranges are from 0.0 to 1.0",
@@ -333,10 +334,12 @@ default value for `min_query_expansion_related_question_similarity_score` is set
 The solution integrates with Azure Machine Learning and uses MLFlow to manage experiments, jobs, and artifacts. You can view the following reports as part of the evaluation process:
 
 ### Metric Comparison
+`all_metrics_current_run.html` shows average scores across questions and search types for each selected metric:
 
 ![Metric Comparison](./images/metric_comparison.png)
 
 ### Metric Analysis
+The computation of each metric and fields used for evaluation are tracked for each question and search type in the output csv file:
 
 ![Alt text](./images/metric_analysis.png)
 
@@ -345,13 +348,20 @@ The solution integrates with Azure Machine Learning and uses MLFlow to manage ex
 ![Hyper Parameters](./images/hyper_parameters.png)
 
 ### Sample Metrics
+Metrics can be compared across runs:
 
 ![Sample Metrics](./images/sample_metric.png)
 
 ### Search evaluation
+Metrics can be compared across different search strategies:
 
 ![Search evaluation](./images/search_chart.png)
 
+### Retrieval evaluation
+Mean average precision scores are tracked and average MAP scores can be compared across search type:
+
+![Retrieval evaluation](./images/map_at_k.png)
+![Retrieval evaluation](./images/map_scores.png)
 
 ### Pitfalls
 
@@ -389,6 +399,14 @@ Possible Causes:
 **Content Filtering:** Azure OpenAI has content filters in place. If the input text or generated responses are deemed inappropriate, it could lead to errors.
 **API Limitations:** The Azure OpenAI service have token and rate limitations that affect the output.
 
+#### Evaluation step
+
+##### Description
+**End-to-end evaluation metrics:** not all the metrics comparing the generated and ground-truth answers are able to capture differences in semantics. For example, metrics such as `levenshtein` or `jaro_winkler` only measure edit distances. The `cosine` metric doesn't allow the comparison of semantics either: it uses the *textdistance* token-based implementation based on term frequency vectors. To calculate the semantic similarity between the generated answers and the expected responses, consider using embedding-based metrics such as Bert scores (`bert_`).
+
+**Component-wise evaluation metrics:** evaluation metrics using LLM-as-judges aren't deterministic. The `llm_` metrics included in the accelerator use the model indicated in the `azure_oai_eval_deployment_name` config field. The prompts used for evaluation instruction can be adjusted and are included in the `prompts.py` file (`llm_answer_relevance_instruction`, `llm_context_recall_instruction`, `llm_context_precision_instruction`).
+
+**Retrieval-based metrics:** MAP scores are computed by comparing each retrieved chunk against the question and the chunk used to generate the qna pair. To assess whether a retrieved chunk is relevant or not, the similarity between the retrieved chunk and the concatenation of the end user question and the chunk used in the qna step (`02_qa_generation.py`) is computed using the SpacyEvaluator. Spacy similarity defaults to the average of the token vectors, meaning that the computation is insensitive to the order of the words. By default, the similarity threshold is set to 80% (`spacy_evaluator.py`).
 
 ## Contributing
 

diff --git a/dev-requirements.txt b/dev-requirements.txt
@@ -1,6 +1,6 @@
-promptflow==1.10.1
+promptflow==1.11.0
 promptflow-tools==1.4.0
-pytest==8.2.0
+pytest==8.2.1
 pytest-cov==5.0.0
 flake8==7.0.0
 pre-commit==3.7.1

diff --git a/docs/evaluation-metrics.md b/docs/evaluation-metrics.md
@@ -29,7 +29,8 @@ You can choose which metrics should be calculated in your experiment by updating
     "bert_distilbert_base_nli_stsb_mean_tokens",
     "bert_paraphrase_multilingual_MiniLM_L12_v2",
     "llm_answer_relevance",
-    "llm_context_precision"
+    "llm_context_precision",
+    "llm_context_recall"
 ]
 ```
 
@@ -54,13 +55,13 @@ Calculates the longest common substring (LCS) similarity score between two strin
 
 Computes the longest common subsequence (LCS) similarity score between two input strings.
 
-### Cosine similarity
+### Cosine similarity (Ochiai coefficient)
 
 | Configuration Key | Calculation Base     | Possible Values    |
 | ----------------- | -------------------- | ------------------ |
 | `cosine`          | `actual`, `expected` | Percentage (0-100) |
 
-The cosine similarity is calculated as the cosine of the angle between two strings represented as vectors.
+This coefficient is calculated as the intersection of the term-frequency vectors of the generated answer (actual) and the ground-truth answer (expected) divided by the geometric mean of the sizes of these vectors.
 
 ### Jaro-Winkler distance
 
@@ -148,6 +149,14 @@ information is penalized.
 
 | Configuration Key   | Calculation Base    | Possible Values                                                   |
 | ------------------- | ------------------- | ----------------------------------------------------------------- |
-| `llm_context_precision` | `actual`, `context` | 1 (yes) or 0 (no) depending on if the context is relevant or not. |
+| `llm_context_precision` | `question`, `retrieved_contexts` | Percentage (0-100) |
 
-Checks whether or not the context generated by the RAG solution is useful for answering a question.
+Proportion of retrieved contexts relevant to the question. Evaluates whether or not the context generated by the RAG solution is useful for answering a question.
+
+### LLM Context recall
+
+| Configuration Key   | Calculation Base    | Possible Values                                                   |
+| ------------------- | ------------------- | ----------------------------------------------------------------- |
+| `llm_context_recall` | `question`, `expected`, `retrieved_contexts` | Percentage (0-100) |
+
+Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved contexts. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.
diff --git a/images/map_at_k.png b/images/map_at_k.png
diff --git a/images/map_scores.png b/images/map_scores.png
diff --git a/images/metric_comparison.png b/images/metric_comparison.png
diff --git a/rag_experiment_accelerator/artifact/handlers/tests/test_query_output_handler.py b/rag_experiment_accelerator/artifact/handlers/tests/test_query_output_handler.py
@@ -67,6 +67,7 @@ def test_save(mock_artifact_handler_save_dict):
         search_type="search_type1",
         search_evals=[],
         context="context1",
+        retrieved_contexts=["retrievedcontext1"],
         question="question1",
     )
 
@@ -94,6 +95,7 @@ def test_load(mock_artifact_handler_load):
         search_type="search_type1",
         search_evals=[],
         context="context1",
+        retrieved_contexts=["retrievedcontext1"],
         question="question1",
     )
 
@@ -119,6 +121,7 @@ def test_load(mock_artifact_handler_load):
         assert d.search_type == data.search_type
         assert d.search_evals == data.search_evals
         assert d.context == data.context
+        assert d.retrieved_contexts == data.retrieved_contexts
 
 
 @patch(

diff --git a/rag_experiment_accelerator/artifact/models/query_output.py b/rag_experiment_accelerator/artifact/models/query_output.py
@@ -14,7 +14,8 @@ class QueryOutput:
         expected (str): The expected output.
         search_type (str): The type of search.
         search_evals (list): The evaluations for search.
-        context (str): The context of the query.
+        context (str): The qna context of the query.
+        retrieved_contexts (list): The list of retrieved contexts of the query.
         question (str): The question of the query.
     """
 
@@ -32,6 +33,7 @@ def __init__(
         search_type: str,
         search_evals: list,
         context: str,
+        retrieved_contexts: list,
         question: str,
     ):
         self.rerank = rerank
@@ -46,4 +48,5 @@ def __init__(
         self.search_type = search_type
         self.search_evals = search_evals
         self.context = context
+        self.retrieved_contexts = retrieved_contexts
         self.question = question
diff --git a/rag_experiment_accelerator/evaluation/eval.py b/rag_experiment_accelerator/evaluation/eval.py
@@ -65,6 +65,22 @@ def remove_spaces(text):
     return text.strip()
 
 
+def lower_and_strip(text):
+    """
+    Converts the input to lowercase without spaces or empty string if None.
+
+    Args:
+        text (str): The string to format.
+
+    Returns:
+        str: The formatted input string.
+    """
+    if text is None:
+        return ""
+    else:
+        return text.lower().strip()
+
+
 # https://huggingface.co/spaces/evaluate-metric/bleu
 def bleu(predictions, references):
     bleu = evaluate.load("bleu")
@@ -212,7 +228,10 @@ def jaro_winkler(value1, value2):
 
 def cosine(value1, value2):
     """
-    Calculates the cosine similarity between two vectors.
+    Calculates the cosine similarity (Ochiai coefficient) between two strings
+    using token-frequency vectors
+
+    https://en.wikipedia.org/wiki/Cosine_similarity.
 
     Args:
         value1 (list): The first vector.
@@ -288,40 +307,51 @@ def llm_answer_relevance(
 
 
 def llm_context_precision(
-    response_generator: ResponseGenerator, question, context
+    response_generator: ResponseGenerator, question, retrieved_contexts
 ) -> float:
     """
-    Verifies whether or not a given context is useful for answering a question.
+    Computes precision by assessing whether each retrieved context is useful for answering a question.
+    Only considers the presence of relevant chunks in the retrieved contexts, but doesn't take into
+    account their ranking order.
 
     Args:
         question (str): The question being asked.
-        context (str): The given context.
+        retrieved_contexts (list[str]): The list of retrieved contexts for the query.
 
     Returns:
-        int: 1 or 0 depending on if the context is relevant or not.
+        double: proportion of relevant chunks retrieved for the question
     """
-    result: str | None = response_generator.generate_response(
-        llm_context_precision_instruction,
-        context=context,
-        question=question,
-    )
-    if result is None:
-        logger.warning("Unable to generate context precision score")
-        return 0.0
+    relevancy_scores = []
 
-    # Since we're only asking for one response, the result is always a boolean 1 or 0
-    if result.lower().strip() == "yes":
-        return 100.0
-
-    return 0
+    for context in retrieved_contexts:
+        result: str | None = response_generator.generate_response(
+            llm_context_precision_instruction,
+            context=context,
+            question=question,
+        )
+        llm_judge_response = lower_and_strip(result)
+        # Since we're only asking for one response, the result is always a boolean 1 or 0
+        if llm_judge_response == "yes":
+            relevancy_scores.append(1)
+        elif llm_judge_response == "no":
+            relevancy_scores.append(0)
+        else:
+            logger.warning("Unable to generate context precision score")
+
+    logger.debug(relevancy_scores)
+
+    if not relevancy_scores:
+        logger.warning("Unable to compute average context precision")
+        return -1
+    else:
+        return (sum(relevancy_scores) / len(relevancy_scores)) * 100
 
 
 def llm_context_recall(
     response_generator: ResponseGenerator,
     question,
     groundtruth_answer,
-    context,
-    temperature: int,
+    retrieved_contexts,
 ):
     """
     Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context.
@@ -338,21 +368,24 @@ def llm_context_recall(
     Args:
         question (str): The question being asked
         groundtruth_answer (str): The ground truth ("output_prompt")
-        context (str): The given context.
-        temperature (int): Temperature as defined in the config.
+        retrieved_contexts (list[str]): The list of retrieved contexts for the query
 
     Returns:
         double: The context recall score generated between the ground truth (expected) and context.
     """
-
+    context = "\n".join(retrieved_contexts)
+    prompt = (
+        "\nquestion: "
+        + question
+        + "\ncontext: "
+        + context
+        + "\nanswer: "
+        + groundtruth_answer
+    )
     result = response_generator.generate_response(
-        llm_context_recall_instruction,
-        temperature=temperature,
-        question=question,
-        context=context,
-        answer=groundtruth_answer,
+        sys_message=llm_context_recall_instruction,
+        prompt=prompt,
     )
-
     good_response = '"Attributed": "1"'
     bad_response = '"Attributed": "0"'
 
@@ -481,7 +514,7 @@ def compute_metrics(
     question,
     actual,
     expected,
-    context,
+    retrieved_contexts,
     metric_type,
 ):
     """
@@ -490,10 +523,11 @@ def compute_metrics(
     Args:
         actual (str): The first string to compare.
         expected (str): The second string to compare.
+        retrieved_contexts (list[str]): The list of retrieved contexts for the query.
         metric_type (str): The type of metric to use for comparison. Valid options are:
             - "lcsstr": Longest common substring
             - "lcsseq": Longest common subsequence
-            - "cosine": Cosine similarity
+            - "cosine": Cosine similarity (Ochiai coefficient)
             - "jaro_winkler": Jaro-Winkler distance
             - "hamming": Hamming distance
             - "jaccard": Jaccard similarity
@@ -508,7 +542,6 @@ def compute_metrics(
             - "llm_context_precision": Verifies whether or not a given context is useful for answering a question.
             - "llm_answer_relevance": Scores the relevancy of the answer according to the given question.
             - "llm_context_recall": Scores context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context.
-        config (Config): The configuration of the experiment.
 
     Returns:
         float: The similarity score between the two strings, as determined by the specified metric.
@@ -566,11 +599,13 @@ def compute_metrics(
             actual, expected, paraphrase_multilingual_MiniLM_L12_v2
         )
     elif metric_type == "llm_answer_relevance":
-        score = llm_answer_relevance(response_generator, actual, expected)
+        score = llm_answer_relevance(response_generator, question, actual)
     elif metric_type == "llm_context_precision":
-        score = llm_context_precision(response_generator, actual, context)
+        score = llm_context_precision(response_generator, question, retrieved_contexts)
     elif metric_type == "llm_context_recall":
-        score = llm_context_recall(response_generator, question, expected, context)
+        score = llm_context_recall(
+            response_generator, question, expected, retrieved_contexts
+        )
     else:
         pass
 
@@ -597,12 +632,12 @@ def evaluate_single_prompt(
             data.question,
             actual,
             expected,
-            data.context,
+            data.retrieved_contexts,
             metric_type,
         )
         metric_dic[metric_type] = score
     metric_dic["question"] = data.question
-    metric_dic["context"] = data.context
+    metric_dic["retrieved_contexts"] = data.retrieved_contexts
     metric_dic["actual"] = actual
     metric_dic["expected"] = expected
     metric_dic["search_type"] = data.search_type
@@ -710,7 +745,7 @@ def evaluate_prompts(
         mean_scores["mean"].append(mean(scores))
 
     run_id = mlflow.active_run().info.run_id
-    columns_to_remove = ["question", "context", "actual", "expected"]
+    columns_to_remove = ["question", "retrieved_contexts", "actual", "expected"]
     additional_columns_to_remove = ["search_type"]
     df = pd.DataFrame(data_list)
     df.to_csv(
@@ -762,7 +797,7 @@ def evaluate_prompts(
         os.path.join(config.eval_data_location, f"sum_{name_suffix}.csv")
     )
     draw_hist_df(sum_df, run_id, mlflow_client)
-    generate_metrics(config.index_name_prefix, run_id, mlflow_client)
+    generate_metrics(config.experiment_name, run_id, mlflow_client)
 
 
 def draw_search_chart(temp_df, run_id, mlflow_client):