Update sample documentation and add column_mapping (#140)

* update promptflow-eval dependencies to azure-ai-evaluation * clear local variables * fix errors and remove 'question' col from data * small fix in evaluator config * update docs and add column mapping * pre-commit fixes
Azure-Samples · Oct 29, 2024 · af861e2 · af861e2
1 parent 933d29e
commit af861e2
Show file tree

Hide file tree

Showing 8 changed files with 89 additions and 17 deletions.
diff --git a/scenarios/evaluate/README.md b/scenarios/evaluate/README.md
@@ -0,0 +1,50 @@
+---
+page_type: sample
+languages:
+- python
+products:
+- ai-services
+- azure-openai
+description: Evaluate.
+---
+
+## Evaluate
+
+### Overview
+
+This tutorial provides a step-by-step guide on how to evaluate Generative AI models with Azure. Each of these samples uses the `azure-ai-evaluation` SDK. 
+
+### Objective
+
+The main objective of this tutorial is to help users understand the process of evaluating an AI model in Azure. By the end of this tutorial, you should be able to:
+
+ - Simulate interactions with an AI model 
+ - Evaluate both deployed model endpoints and applications
+ - Evaluate using quantitative NLP metrics, qualitative metrics, and custom metrics
+
+ Our samples cover the following tools for evaluation of AI models in Azure:  
+
+| Sample name                            | adversarial | simulator | conversation starter | index | raw text | against model endpoint | against app | qualitative metrics | custom metrics | quantitative NLP metrics |
+|----------------------------------------|-------------|-----------|---------------------|-------|----------|-----------------------|-------------|---------------------|----------------|----------------------|
+| simulate_adversarial.ipynb            | X           | X         |                     |      |          | X                      |             |                     |                |                      |
+| simulate_conversation_starter.ipynb   |             | X         | X                   |       |         | X                      |             |                     |                |                      |
+| simulate_input_index.ipynb            |             | X         |                     | X     |          | X                      |             |                     |                |                      |
+| simulate_input_text.ipynb             |             | X         |                     |       | X        | X                     |             |                     |                |                      |
+| evaluate_endpoints.ipynb              |             |           |                     |      |          | X                      |            | X                    |                |                      |
+| evaluate_app.ipynb                    |             |           |                     |       |         |                       | X           | X                    |                |                      |
+| evaluate_qualitative.ipynb            |             |           |                     |       |         | X                      |            | X                    |                |                      |
+| evaluate_custom.ipynb                 |             |           |                     |       |         | X                      |            |                     | X               |                      |
+| evaluate_quantitative.ipynb            |             |           |                     |       |         | X                      |             |                     |               | X                     |
+| evaluate_safety_risk.ipynb            | X           |           |                     |       |          | X                     |             |                     |                |                      |
+| simulate_and_evaluate_endpoint.py      |             | X         |                     |      | X        | X                     |             | X                    |                |                    |
+
+
+
+### Pre-requisites
+To use the `azure-ai-evaluation` SDK, install with```pythonpip install azure-ai-evaluation```Python 3.8 or later is required to use this package.- See our Python reference documentation for our `azure-ai-evaluation` SDK[here](https://aka.ms/azureaieval-python-ref) for more granular details oninput/output requirements and usage instructions.- Check out our Github repo for `azure-ai-evaluation` SDK [here](    https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation). 
+
+
+### Programming Languages
+ - Python
+
+### Estimated Runtime: 30 mins
diff --git a/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb b/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb
@@ -375,7 +375,13 @@
     "            \"relevance\": relevance_evaluator,\n",
     "        },\n",
     "        evaluator_config={\n",
-    "            \"relevance\": {\"response\": \"${target.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"},\n",
+    "            \"relevance\": {\n",
+    "                \"column_mapping\": {\n",
+    "                    \"response\": \"${target.response}\",\n",
+    "                    \"context\": \"${data.context}\",\n",
+    "                    \"query\": \"${data.query}\",\n",
+    "                },\n",
+    "            },\n",
     "        },\n",
     "    )"
    ]

diff --git a/...e/evaluate_quantitative_metrics/README.md → ...s/evaluate/evaluate_nlp_metrics/README.md b/...e/evaluate_quantitative_metrics/README.md → ...s/evaluate/evaluate_nlp_metrics/README.md
@@ -8,16 +8,16 @@ products:
 description: Evaluate with quantitative evaluators
 ---
 
-## Evaluate with math evaluators
+## Evaluate with quantitative NLP evaluators
 
 ### Overview
 
-This notebook demonstrates how to use math-based evaluators to assess the quality of generated text by comparing it to reference text.
+This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text.
 
 ### Objective
 
-The primary goal of this tutorial is to guide users in leveraging the `azure-ai-evaluation` SDK to evaluate datasets using various math metrics. By the end of this tutorial, you'll be able to:
- - Understand different math evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.
+The primary goal of this tutorial is to guide users in leveraging the `azure-ai-evaluation` SDK to evaluate datasets using various NLP metrics. By the end of this tutorial, you'll be able to:
+ - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.
  - Evaluate dataset using these evaluators.
 
 ### Programming Languages

diff --git a/.../evaluate_quantitative_metrics/data.jsonl → .../evaluate/evaluate_nlp_metrics/data.jsonl b/.../evaluate_quantitative_metrics/data.jsonl → .../evaluate/evaluate_nlp_metrics/data.jsonl
diff --git a/..._quantitative_metrics/evaluate-math.ipynb → ...e/evaluate_nlp_metrics/evaluate_nlp.ipynb b/..._quantitative_metrics/evaluate-math.ipynb → ...e/evaluate_nlp_metrics/evaluate_nlp.ipynb
@@ -4,11 +4,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Evaluate with quantitative evaluators\n",
+    "# Evaluate with quantitative NLP evaluators\n",
     "\n",
     "## Objective\n",
-    "This notebook demonstrates how to use math-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:\n",
-    " - Understand different math evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.\n",
+    "This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:\n",
+    " - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.\n",
     " - Evaluate dataset using these evaluators.\n",
     "\n",
     "## Time\n",
@@ -34,7 +34,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Math Evaluators"
+    "## NLP Evaluators"
    ]
   },
   {

diff --git a/scenarios/evaluate/evaluate_qualitative_metrics/evaluate_qualitative_metrics.ipynb b/scenarios/evaluate/evaluate_qualitative_metrics/evaluate_qualitative_metrics.ipynb
@@ -256,16 +256,24 @@
     "        \"similarity\": similarity_evaluator,\n",
     "    },\n",
     "    evaluator_config={\n",
-    "        \"content_safety\": {\"query\": \"${data.query}\", \"response\": \"${target.response}\"},\n",
-    "        \"coherence\": {\"response\": \"${target.response}\", \"query\": \"${data.query}\"},\n",
-    "        \"relevance\": {\"response\": \"${target.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"},\n",
+    "        \"content_safety\": {\"column_mapping\": {\"query\": \"${data.query}\", \"response\": \"${target.response}\"}},\n",
+    "        \"coherence\": {\"column_mapping\": {\"response\": \"${target.response}\", \"query\": \"${data.query}\"}},\n",
+    "        \"relevance\": {\n",
+    "            \"column_mapping\": {\"response\": \"${target.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"}\n",
+    "        },\n",
     "        \"groundedness\": {\n",
-    "            \"response\": \"${target.response}\",\n",
-    "            \"context\": \"${data.context}\",\n",
-    "            \"query\": \"${data.query}\",\n",
+    "            \"column_mapping\": {\n",
+    "                \"response\": \"${target.response}\",\n",
+    "                \"context\": \"${data.context}\",\n",
+    "                \"query\": \"${data.query}\",\n",
+    "            }\n",
+    "        },\n",
+    "        \"fluency\": {\n",
+    "            \"column_mapping\": {\"response\": \"${target.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"}\n",
+    "        },\n",
+    "        \"similarity\": {\n",
+    "            \"column_mapping\": {\"response\": \"${target.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"}\n",
     "        },\n",
-    "        \"fluency\": {\"response\": \"${target.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"},\n",
-    "        \"similarity\": {\"response\": \"${target.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"},\n",
     "    },\n",
     ")"
    ]

diff --git a/scenarios/evaluate/evaluate_safety_risk/README.md b/scenarios/evaluate/evaluate_safety_risk/README.md
@@ -23,6 +23,10 @@ The main objective of this tutorial is to help users understand how to use the a
  - Evaluate the generated dataset for Protected Material and Indirect Attack Jailbreak vulnerability
  - Use Azure AI Content Safety filter prompts to mitigate found vulnerabilities
 
+### Basic requirements
+
+To use Azure AI Safety Evaluation for different scenarios(simulation, annotation, etc..), you need an **Azure AI Project.** You should provide Azure AI project to run your safety evaluations or simulations with. First[create an Azure AI hub](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/ai-resources)then [create an Azure AI project](    https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-projects?tabs=ai-studio).You **do not** need to provide your own LLM deployment as the Azure AI Safety Evaluation servicehosts adversarial models for both simulation and evaluation of harmful content andconnects to it via your Azure AI project.Ensure that your Azure AI project is in one of the supported regions for your desiredevaluation metric:#### Region support for evaluations| Region | Hate and unfairness, sexual, violent, self-harm, XPIA | Groundedness | Protected material || - | - | - | - ||UK South | Will be deprecated 12/1/24| no | no ||East US 2 | yes| yes | yes ||Sweden Central | yes| yes | no|US North Central | yes| no | no ||France Central | yes| no | no ||SwitzerlandWest| yes | no |no|For built-in quality and performance metrics, connect your own deployment of LLMs and therefore youcan evaluate in any region your deployment is in.#### Region support for adversarial simulation| Region | Adversarial simulation || - | - ||UK South | yes||East US 2 | yes||Sweden Central | yes||US North Central | yes||France Central | yes|
+
 ### Programming Languages
  - Python
 

diff --git a/scenarios/evaluate/simulate_adversarial/README.md b/scenarios/evaluate/simulate_adversarial/README.md
@@ -24,4 +24,8 @@ By the end of this tutorial, you should be able to:
 ### Programming Languages
  - Python
 
+### Basic requirements
+
+To use Azure AI Safety Evaluation for different scenarios(simulation, annotation, etc..), you need an **Azure AI Project.** You should provide Azure AI project to run your safety evaluations or simulations with. First[create an Azure AI hub](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/ai-resources)then [create an Azure AI project](    https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-projects?tabs=ai-studio).You **do not** need to provide your own LLM deployment as the Azure AI Safety Evaluation servicehosts adversarial models for both simulation and evaluation of harmful content andconnects to it via your Azure AI project.Ensure that your Azure AI project is in one of the supported regions for your desiredevaluation metric:#### Region support for evaluations| Region | Hate and unfairness, sexual, violent, self-harm, XPIA | Groundedness | Protected material || - | - | - | - ||UK South | Will be deprecated 12/1/24| no | no ||East US 2 | yes| yes | yes ||Sweden Central | yes| yes | no|US North Central | yes| no | no ||France Central | yes| no | no ||SwitzerlandWest| yes | no |no|For built-in quality and performance metrics, connect your own deployment of LLMs and therefore youcan evaluate in any region your deployment is in.#### Region support for adversarial simulation| Region | Adversarial simulation || - | - ||UK South | yes||East US 2 | yes||Sweden Central | yes||US North Central | yes||France Central | yes|
+
 ### Estimated Runtime: 20 mins