diff --git a/scenarios/evaluate/simulate_adversarial/README.md b/scenarios/evaluate/simulate_adversarial/README.md index c3193796..ff039bbb 100644 --- a/scenarios/evaluate/simulate_adversarial/README.md +++ b/scenarios/evaluate/simulate_adversarial/README.md @@ -26,6 +26,28 @@ By the end of this tutorial, you should be able to: ### Basic requirements -To use Azure AI Safety Evaluation for different scenarios(simulation, annotation, etc..), you need an **Azure AI Project.** You should provide Azure AI project to run your safety evaluations or simulations with. First[create an Azure AI hub](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/ai-resources)then [create an Azure AI project]( https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-projects?tabs=ai-studio).You **do not** need to provide your own LLM deployment as the Azure AI Safety Evaluation servicehosts adversarial models for both simulation and evaluation of harmful content andconnects to it via your Azure AI project.Ensure that your Azure AI project is in one of the supported regions for your desiredevaluation metric:#### Region support for evaluations| Region | Hate and unfairness, sexual, violent, self-harm, XPIA | Groundedness | Protected material || - | - | - | - ||UK South | Will be deprecated 12/1/24| no | no ||East US 2 | yes| yes | yes ||Sweden Central | yes| yes | no|US North Central | yes| no | no ||France Central | yes| no | no ||SwitzerlandWest| yes | no |no|For built-in quality and performance metrics, connect your own deployment of LLMs and therefore youcan evaluate in any region your deployment is in.#### Region support for adversarial simulation| Region | Adversarial simulation || - | - ||UK South | yes||East US 2 | yes||Sweden Central | yes||US North Central | yes||France Central | yes| +To use Azure AI Safety Evaluation for different scenarios(simulation, annotation, etc..), you need an **Azure AI Project.** You should provide Azure AI project to run your safety evaluations or simulations with. First[create an Azure AI hub](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/ai-resources)then [create an Azure AI project]( https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-projects?tabs=ai-studio).You **do not** need to provide your own LLM deployment as the Azure AI Safety Evaluation servicehosts adversarial models for both simulation and evaluation of harmful content andconnects to it via your Azure AI project.Ensure that your Azure AI project is in one of the supported regions for your desiredevaluation metric: + +#### Region support for evaluations + +| Region | Hate and unfairness, sexual, violent, self-harm, XPIA | Groundedness | Protected material | +| - | - | - | - | +|UK South | Will be deprecated 12/1/24| no | no | +|East US 2 | yes| yes | yes | +|Sweden Central | yes| yes | no| +|US North Central | yes| no | no | +|France Central | yes| no | no | +|SwitzerlandWest| yes | no |no| + +For built-in quality and performance metrics, connect your own deployment of LLMs and therefore youcan evaluate in any region your deployment is in. + +#### Region support for adversarial simulation +| Region | Adversarial simulation | +| - | - | +|UK South | yes| +|East US 2 | yes| +|Sweden Central | yes| +|US North Central | yes| +|France Central | yes| ### Estimated Runtime: 20 mins \ No newline at end of file diff --git a/scenarios/evaluate/simulate_evaluate_groundedness/README.md b/scenarios/evaluate/simulate_evaluate_groundedness/README.md new file mode 100644 index 00000000..98e1a43d --- /dev/null +++ b/scenarios/evaluate/simulate_evaluate_groundedness/README.md @@ -0,0 +1,53 @@ +--- +page_type: sample +languages: +- python +products: +- ai-services +- azure-openai +description: Simulator and evaluator for assessing groundedness in custom applications using adversarial questions +--- + +## Simulator and Evaluator for Groundedness (simulate_evaluate_groundedness.ipynb) + +### Overview + +This tutorial provides a step-by-step guide on how to use the simulator and evaluator to assess the groundedness of responses in a custom application. + +### Objective + +The main objective of this tutorial is to help users understand the process of creating and using a simulator and evaluator to test the groundedness of responses in a custom application. By the end of this tutorial, you should be able to: +- Use the simulator to generate adversarial questions +- Run the evaluator to assess the groundedness of the responses + +### Programming Languages +- Python + +### Basic Requirements + +To use Azure AI Safety Evaluation for different scenarios (simulation, annotation, etc.), you need an **Azure AI Project.** You should provide an Azure AI project to run your safety evaluations or simulations with. First, [create an Azure AI hub](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/ai-resources) then [create an Azure AI project](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-projects?tabs=ai-studio). You **do not** need to provide your own LLM deployment as the Azure AI Safety Evaluation service hosts adversarial models for both simulation and evaluation of harmful content and connects to it via your Azure AI project. Ensure that your Azure AI project is in one of the supported regions for your desired evaluation metric: + +#### Region Support for Evaluations + +| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness | Protected Material | +| - | - | - | - | +| UK South | Will be deprecated 12/1/24 | no | no | +| East US 2 | yes | yes | yes | +| Sweden Central | yes | yes | no | +| US North Central | yes | no | no | +| France Central | yes | no | no | +| Switzerland West | yes | no | no | + +For built-in quality and performance metrics, connect your own deployment of LLMs and therefore you can evaluate in any region your deployment is in. + +#### Region Support for Adversarial Simulation + +| Region | Adversarial Simulation | +| - | - | +| UK South | yes | +| East US 2 | yes | +| Sweden Central | yes | +| US North Central | yes | +| France Central | yes | + +### Estimated Runtime: 20 mins \ No newline at end of file diff --git a/scenarios/evaluate/simulate_evaluate_groundedness/simulate_evaluate_groundedness.ipynb b/scenarios/evaluate/simulate_evaluate_groundedness/simulate_evaluate_groundedness.ipynb new file mode 100644 index 00000000..7cdb54af --- /dev/null +++ b/scenarios/evaluate/simulate_evaluate_groundedness/simulate_evaluate_groundedness.ipynb @@ -0,0 +1,326 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evaluating Model Groundedness with Azure AI Evaluation SDK\n", + "\n", + "This notebook aims to simulate and evaluate the groundedness of a model endpoint using the Azure AI Evaluation SDK. Groundedness refers to the extent to which the responses generated by a model are based on reliable and verifiable information. Ensuring that a model's outputs are grounded is crucial for maintaining the accuracy and trustworthiness of AI systems.\n", + "\n", + "In this notebook, we will:\n", + "\n", + "1. Set up the Azure AI Evaluation SDK.\n", + "2. Define the dataset for evaluating groundedness, which will vary based on the specific use case of your model.\n", + "3. Simulate the model endpoint and generate responses.\n", + "4. Evaluate the groundedness of the model's responses using the Azure AI Evaluation SDK.\n", + "\n", + "The dataset used for evaluating groundedness will be tailored to the particular application of your model. For instance, if your model is designed for customer support, the dataset might consist of common customer queries and the corresponding accurate responses. If your model is used for medical diagnosis, the dataset would include medical cases and verified diagnostic information.\n", + "\n", + "By the end of this notebook, you will have a clear understanding of how to assess the groundedness of your model's outputs and ensure that they are based on solid and reliable information.\n", + "\n", + "This tutorial uses the following Azure AI services:\n", + "\n", + "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", + "\n", + "## Time\n", + "\n", + "You should expect to spend 30 minutes running this sample. \n", + "\n", + "## About this example\n", + "\n", + "This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation\n", + "\n", + "## Before you begin\n", + "\n", + "### Installation\n", + "\n", + "Install the following packages required to execute this notebook. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install azure-ai-evaluation --upgrade\n", + "%pip install promptflow-azure\n", + "%pip install azure-identity" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Parameters and imports" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from typing import Any, Dict, List, Optional\n", + "import json\n", + "from pathlib import Path\n", + "\n", + "from azure.ai.evaluation import evaluate\n", + "from azure.ai.evaluation import GroundednessEvaluator\n", + "from azure.ai.evaluation.simulator import Simulator\n", + "from openai import AzureOpenAI\n", + "import importlib.resources as pkg_resources\n", + "from azure.identity import DefaultAzureCredential, get_bearer_token_provider" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "os.environ[\"AZURE_SUBSCRIPTION_ID\"] = \"\"\n", + "os.environ[\"RESOURCE_GROUP\"] = \"\"\n", + "os.environ[\"PROJECT_NAME\"] = \"\"\n", + "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"\"\n", + "os.environ[\"AZURE_DEPLOYMENT_NAME\"] = \"\"\n", + "os.environ[\"AZURE_API_VERSION\"] = \"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project_scope = {\n", + " \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n", + " \"resource_group_name\": os.environ.get(\"RESOURCE_GROUP\"),\n", + " \"project_name\": os.environ.get(\"PROJECT_NAME\"),\n", + "}\n", + "\n", + "model_config = {\n", + " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", + " \"azure_deployment\": os.environ.get(\"AZURE_DEPLOYMENT_NAME\"),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data\n", + "Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "resource_name = \"grounding.json\"\n", + "package = \"azure.ai.evaluation.simulator._data_sources\"\n", + "conversation_turns = []\n", + "\n", + "with pkg_resources.path(package, resource_name) as grounding_file, Path.open(grounding_file, \"r\") as file:\n", + " data = json.load(file)\n", + "\n", + "for item in data:\n", + " conversation_turns.append([item])\n", + " if len(conversation_turns) == 2:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Target Endpoint\n", + "\n", + "We will use Evaluate API provided by Azure AI Evaluations SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def example_application_response(query: str, context: str) -> str:\n", + " deployment = os.environ.get(\"AZURE_DEPLOYMENT_NAME\")\n", + " endpoint = os.environ.get(\"AZURE_OPENAI_ENDPOINT\")\n", + " token_provider = get_bearer_token_provider(DefaultAzureCredential(), \"https://cognitiveservices.azure.com/.default\")\n", + "\n", + " # Get a client handle for the AOAI model\n", + " client = AzureOpenAI(\n", + " azure_endpoint=endpoint,\n", + " api_version=os.environ.get(\"AZURE_API_VERSION\"),\n", + " azure_ad_token_provider=token_provider,\n", + " )\n", + "\n", + " # Prepare the messages\n", + " messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": f\"You are a user assistant who helps answer questions based on some context.\\n\\nContext: '{context}'\",\n", + " },\n", + " {\"role\": \"user\", \"content\": query},\n", + " ]\n", + " # Call the model\n", + " completion = client.chat.completions.create(\n", + " model=deployment,\n", + " messages=messages,\n", + " max_tokens=800,\n", + " temperature=0.7,\n", + " top_p=0.95,\n", + " frequency_penalty=0,\n", + " presence_penalty=0,\n", + " stop=None,\n", + " stream=False,\n", + " )\n", + "\n", + " message = completion.to_dict()[\"choices\"][0][\"message\"]\n", + " if isinstance(message, dict):\n", + " message = message[\"content\"]\n", + " return message" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run the simulator\n", + "\n", + "The interactions between your endpoint (in this case, `example_application_response`) and the simulator is managed by a callback method, `custom_simulator_callback` and this method is used to format the request to your endpoint and the response from the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "async def custom_simulator_callback(\n", + " messages: List[Dict],\n", + " stream: bool = False,\n", + " session_state: Optional[str] = None,\n", + " context: Optional[Dict[str, Any]] = None,\n", + ") -> dict:\n", + " messages_list = messages[\"messages\"]\n", + " # get last message\n", + " latest_message = messages_list[-1]\n", + " application_input = latest_message[\"content\"]\n", + " context = latest_message.get(\"context\", None)\n", + " # call your endpoint or ai application here\n", + " response = example_application_response(query=application_input, context=context)\n", + " # we are formatting the response to follow the openAI chat protocol format\n", + " message = {\n", + " \"content\": response,\n", + " \"role\": \"assistant\",\n", + " \"context\": context,\n", + " }\n", + " messages[\"messages\"].append(message)\n", + " return {\"messages\": messages[\"messages\"], \"stream\": stream, \"session_state\": session_state, \"context\": context}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "custom_simulator = Simulator(model_config=model_config)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "outputs = await custom_simulator(\n", + " target=custom_simulator_callback,\n", + " conversation_turns=conversation_turns,\n", + " max_conversation_turns=1,\n", + " concurrent_async_tasks=10,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Convert the outputs to a format that can be evaluated" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "output_file = \"ground_sim_output.jsonl\"\n", + "with Path.open(output_file, \"w\") as file:\n", + " for output in outputs:\n", + " file.write(output.to_eval_qr_json_lines())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Run the evaluation\n", + "\n", + "In this section, we will run the evaluation using the `GroundednessEvaluator` and the `evaluate` function from the Azure AI Evaluation SDK. The evaluation will assess the groundedness of the model's responses based on the dataset produced by the `Simulator` above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "groundedness_evaluator = GroundednessEvaluator(model_config=model_config)\n", + "eval_output = evaluate(\n", + " data=output_file,\n", + " evaluators={\n", + " \"groundedness\": groundedness_evaluator,\n", + " },\n", + " azure_ai_project=project_scope,\n", + ")\n", + "print(eval_output)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}