Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New sample for Remote and Online Evaluation #149

Merged
merged 14 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions scenarios/evaluate/evaluate_online/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
page_type: sample
languages:
- python
products:
- ai-services
- azure-openai
description: Evaluating online
---

## Evaluating in the cloud on a schedule

### Overview

This tutorial provides a step-by-step guide on how to evaluate generative AI or LLMs on a scheduling using online evaluation.

### Objective

The main objective of this tutorial is to help users understand the process of evaluating model remotely in the cloud by triggering an evaluation. This type of evaluation can be used for monitoring LLMs and Generative AI that has been deployed. By the end of this tutorial, you should be able to:

- Learn about evaluations
- Evaluate LLM using various evaluators from Azure AI Evaluations SDK online in the cloud.

### Note
All evaluators supported by [Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning) are supported by Online Evaluation. For updated documentation, please see [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc).

#### Region Support for Evaluations

| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness Pro | Protected Material |
| - | - | - | - |
| UK South | Will be deprecated 12/1/24 | no | no |
| East US 2 | yes | yes | yes |
| Sweden Central | yes | yes | no |
| US North Central | yes | no | no |
| France Central | yes | no | no |
| Switzerland West | yes | no | no |

### Programming Languages
- Python

### Estimated Runtime: 30 mins
248 changes: 248 additions & 0 deletions scenarios/evaluate/evaluate_online/evaluate_online.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Online Evaluations: Evaluating in the Cloud on a Schedule\n",
"\n",
"## Objective\n",
"\n",
"This tutorial provides a step-by-step guide on how to evaluate data generated by LLMs online on a schedule. \n",
"\n",
"This tutorial uses the following Azure AI services:\n",
"\n",
"- [Azure AI Safety Evaluation](https://aka.ms/azureaistudiosafetyeval)\n",
"- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n",
"\n",
"## Time\n",
"\n",
"You should expect to spend 30 minutes running this sample. \n",
"\n",
"## About this example\n",
"\n",
"This example demonstrates the online evaluation of a LLM. It is important to have access to AzureOpenAI credentials and an AzureAI project. This example demonstrates: \n",
"\n",
"- Recurring, Online Evaluation (to be used to monitor LLMs once they are deployed)\n",
"\n",
"## Before you begin\n",
"### Prerequesite\n",
"- Configure resources to support Online Evaluation as per [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -U azure-identity\n",
"%pip install -U azure-ai-project\n",
"%pip install -U azure-ai-evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.project import AIProjectClient\n",
"from azure.identity import DefaultAzureCredential\n",
"from azure.ai.project.models import (\n",
" ApplicationInsightsConfiguration,\n",
" EvaluatorConfiguration,\n",
" ConnectionType,\n",
" EvaluationSchedule,\n",
" RecurrenceTrigger,\n",
")\n",
"from azure.ai.evaluation import F1ScoreEvaluator, ViolenceEvaluator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to your Azure Open AI deployment\n",
"To evaluate your LLM-generated data remotely in the cloud, we must connect to your Azure Open AI deployment. This deployment must be a GPT model which supports `chat completion`, such as `gpt-4`. To see the proper value for `conn_str`, navigate to the connection string at the \"Project Overview\" page for your Azure AI project. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"project_client = AIProjectClient.from_connection_string(\n",
" credential=DefaultAzureCredential(),\n",
" conn_str=\"<connection_string>\", # At the moment, it should be in the format \"<Region>.api.azureml.ms;<AzureSubscriptionId>;<ResourceGroup>;<HubName>\" Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please see [Online Evaluation documentation](https://aka.ms/GenAIMonitoringDoc) for configuration of Application Insights. `service_name` is a unique name you provide to define your Generative AI application and identify it within your Application Insights resource. This property will be logged in the `traces` table in Application Insights and can be found in the `customDimensions[\"service.name\"]` field. `evaluation_name` is a unique name you provide for your Online Evaluation schedule. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your Application Insights resource ID\n",
"# At the moment, it should be something in the format \"/subscriptions/<AzureSubscriptionId>/resourceGroups/<ResourceGroup>/providers/Microsoft.Insights/components/<ApplicationInsights>\"\"\n",
"app_insights_resource_id = \"<app_insights_resource_id>\"\n",
"\n",
"# Name of your generative AI application (will be available in trace data in Application Insights)\n",
"service_name = \"<service_name>\"\n",
"\n",
"# Name of your online evaluation schedule\n",
"evaluation_name = \"<evaluation_name>\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is the Kusto Query Language (KQL) query to query data from Application Insights resource. This query is compatible with data logged by the Azure AI Inferencing Tracing SDK (linked in [documentation](https://aka.ms/GenAIMonitoringDoc)). You can modify it depending on your data schema. The KQL query must output several columns: `operation_ID`, `operation_ParentID`, and `gen_ai_response_id`. You can choose which other columns to output as required by the evaluators you are using."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"kusto_query = 'let gen_ai_spans=(dependencies | where isnotnull(customDimensions[\"gen_ai.system\"]) | extend response_id = tostring(customDimensions[\"gen_ai.response.id\"]) | project id, operation_Id, operation_ParentId, timestamp, response_id); let gen_ai_events=(traces | where message in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") or tostring(customDimensions[\"event.name\"]) in (\"gen_ai.choice\", \"gen_ai.user.message\", \"gen_ai.system.message\") | project id= operation_ParentId, operation_Id, operation_ParentId, user_input = iff(message == \"gen_ai.user.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.user.message\", parse_json(iff(message == \"gen_ai.user.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), system = iff(message == \"gen_ai.system.message\" or tostring(customDimensions[\"event.name\"]) == \"gen_ai.system.message\", parse_json(iff(message == \"gen_ai.system.message\", tostring(customDimensions[\"gen_ai.event.content\"]), message)).content, \"\"), llm_response = iff(message == \"gen_ai.choice\", parse_json(tostring(parse_json(tostring(customDimensions[\"gen_ai.event.content\"])).message)).content, iff(tostring(customDimensions[\"event.name\"]) == \"gen_ai.choice\", parse_json(parse_json(message).message).content, \"\")) | summarize operation_ParentId = any(operation_ParentId), Input = maxif(user_input, user_input != \"\"), System = maxif(system, system != \"\"), Output = maxif(llm_response, llm_response != \"\") by operation_Id, id); gen_ai_spans | join kind=inner (gen_ai_events) on id, operation_Id | project Input, System, Output, operation_Id, operation_ParentId, gen_ai_response_id = response_id'\n",
"\n",
"# AzureMSIClientId is the clientID of the User-assigned managed identity created during set-up - see documentation for how to find it\n",
"properties = {\"AzureMSIClientId\": \"your_client_id\"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Connect to your Application Insights resource\n",
"app_insights_config = ApplicationInsightsConfiguration(\n",
" resource_id=app_insights_resource_id, query=kusto_query, service_name=service_name\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Connect to your AOAI resource, you must use an AOAI GPT model\n",
"deployment_name = \"gpt-4\"\n",
"api_version = \"2024-06-01\"\n",
"default_connection = project_client.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI)\n",
"model_config = default_connection.to_evaluator_model_config(deployment_name=deployment_name, api_version=api_version)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Evaluators to Run\n",
"The code below demonstrates how to configure the evaluators you want to run. In this example, we use the `F1ScoreEvaluator`, `RelevanceEvaluator` and the `ViolenceEvaluator`, but all evaluators supported by [Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning) are supported by Online Evaluation and can be configured here. You can either import the classes from the SDK and reference them with the `.id` property, or you can find the fully formed `id` of the evaluator in the AI Studio registry of evaluators, and use it here. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# id for each evaluator can be found in your AI Studio registry - please see documentation for more information\n",
"# init_params is the configuration for the model to use to perform the evaluation\n",
"# data_mapping is used to map the output columns of your query to the names required by the evaluator\n",
"evaluators = {\n",
" \"f1_score\": EvaluatorConfiguration(\n",
" id=F1ScoreEvaluator.id,\n",
" ),\n",
" \"relevance\": EvaluatorConfiguration(\n",
" id=\"azureml://registries/azureml-staging/models/Relevance-Evaluator/versions/4\",\n",
" init_params={\"model_config\": model_config},\n",
" data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n",
" ),\n",
" \"violence\": EvaluatorConfiguration(\n",
" id=ViolenceEvaluator.id,\n",
" init_params={\"azure_ai_project\": project_client.scope},\n",
" data_mapping={\"query\": \"${data.Input}\", \"response\": \"${data.Output}\"},\n",
" ),\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate in the Cloud on a Schedule with Online Evaluation\n",
"\n",
"You can configure the `RecurrenceTrigger` based on the class definition [here](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.recurrencetrigger?view=azure-python)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Frequency to run the schedule\n",
"recurrence_trigger = RecurrenceTrigger(frequency=\"day\", interval=1)\n",
"\n",
"# Configure the online evaluation schedule\n",
"evaluation_schedule = EvaluationSchedule(\n",
" data=app_insights_config,\n",
" evaluators=evaluators,\n",
" trigger=recurrence_trigger,\n",
" description=f\"{service_name} evaluation schedule\",\n",
" properties=properties,\n",
")\n",
"\n",
"# Create the online evaluation schedule\n",
"created_evaluation_schedule = project_client.evaluations.create_or_replace_schedule(service_name, evaluation_schedule)\n",
"print(\n",
" f\"Successfully submitted the online evaluation schedule creation request - {created_evaluation_schedule.name}, currently in {created_evaluation_schedule.provisioning_state} state.\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Next steps \n",
"\n",
"Navigate to the \"Tracing\" tab in [Azure AI Studio](https://ai.azure.com/) to view your logged trace data alongside the evaluations produced by the Online Evaluation schedule. You can use the reference link provided in the \"Tracing\" tab to navigate to a comprehensive workbook in Application Insights for more details on how your application is performing. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "azureai-samples313",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
41 changes: 41 additions & 0 deletions scenarios/evaluate/evaluate_remotely/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
page_type: sample
languages:
- python
products:
- ai-services
- azure-openai
description: Evaluating remotely
---

## Evaluating in the cloud

### Overview

This tutorial provides a step-by-step guide on how to evaluate generative AI or LLMs remotely using a triggered evaluation.

### Objective

The main objective of this tutorial is to help users understand the process of evaluating model remotely in the cloud by triggering an evaluation. This type of evaluation can be used for pre-deployment testing. By the end of this tutorial, you should be able to:

- Learn about evaluations
- Evaluate LLM using various evaluators from Azure AI Evaluations SDK remotely in the cloud.

### Note
Remote evaluations do not support `Retrieval-Evaluator`, `ContentSafetyEvaluator`, and `QAEvaluator`.

#### Region Support for Evaluations

| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA | Groundedness Pro | Protected Material |
| - | - | - | - |
| UK South | Will be deprecated 12/1/24 | no | no |
| East US 2 | yes | yes | yes |
| Sweden Central | yes | yes | no |
| US North Central | yes | no | no |
| France Central | yes | no | no |
| Switzerland West | yes | no | no |

### Programming Languages
- Python

### Estimated Runtime: 20 mins
Loading
Loading