Converter from AI Service threads/runs to evaluator-compatible schema #40047

thecsw · 2025-03-12T18:14:16Z

Description

This pull request introduces a new converter, AIAgentConverter, that translates agentic conversations into evaluator-friendly schema. Given a thread_id and run_id of a conversation from AI Projects/Services, we will convert that interaction into a schema that is compatible with evaluator SDKs, OpenAI schemas, etc.

Intended pattern on how to call and use,

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import AIAgentConverter

# Create your instance to talk to AI services.
project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str="<CONNECTION_STRING>"
)

# Initialize the converter that will be backed by the project.
converter = AIAgentConverter(project_client)

# These can be retrieved from the UI or after running `project_client.agents.create_run`
thread_id = "<THREAD_ID>"
run_id = "<RUN_ID>"

# Convert the agent run to a format suitable for the OpenAI API.
converted_data: dict = converter.convert(thread_id, run_id)

This will return a dictionary, which will produce the following output showcasing the system message, user messages, tool interactions, and assistant's response.

The top-level structure is as follows:

query: List[dict] — this is the conversation history and context. For backward-compatibility with Azure AI Foundry evaluations, we named it query. This field will have all of the interactions in the thread before the requested run's agent's response. This will include the system message, any other previous assistant messages (including tool calls and tool results), and user queries. Previous runs' tool calls introduce a performance hit, which can be disabled with exclude_tool_calls_previous_runs=True.
response: List[dict] — assistant's response, containing any number of tool calls, their results, and assistant's formatted responses.
tool_definitions: List[dict] — definitions of the tools that assistant can use in this run.

Similarly, there is an API to feed a local file, in an offline fashion to produce the same output,

from azure.ai.evaluation import AIAgentConverter

converted_data: dict = AIAgentConverter.convert_from_file("FILE_PATH", "<RUN_ID>")

Here is a non-trivial example with multiple tools, multiple tool calls/results, and multiple turns in the given conversation,

{
  "query": [
    {
      "role": "system",
      "content": "You are a helpful assistant"
    },
    {
      "createdAt": "2025-03-12T20:03:19Z",
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Hello, what is the weather in New York? Make sure to give it to me in Fahrenheit."
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:21Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "role": "assistant",
      "content": [
        {
          "type": "tool_call",
          "tool_call": {
            "id": "call_bWZiARx69AevDyjpjP1o8vkx",
            "type": "function",
            "function": {
              "name": "fetch_weather",
              "arguments": {
                "location": "New York"
              }
            }
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:22Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "tool_call_id": "call_bWZiARx69AevDyjpjP1o8vkx",
      "role": "tool",
      "content": [
        {
          "type": "tool_result",
          "tool_result": {
            "weather": "Sunny, 25°C"
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:26Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "role": "assistant",
      "content": [
        {
          "type": "tool_call",
          "tool_call": {
            "id": "call_oBJAKB3vLZ5wpMkCg2kfLm2F",
            "type": "function",
            "function": {
              "name": "in_farenheit",
              "arguments": {
                "celsius": 25
              }
            }
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:28Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "tool_call_id": "call_oBJAKB3vLZ5wpMkCg2kfLm2F",
      "role": "tool",
      "content": [
        {
          "type": "tool_result",
          "tool_result": "77.0°F"
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:29Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "The current weather in New York is sunny with a temperature of 77°F."
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:32Z",
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Actually, nevermind, I need the weather in Tokyo. Since they use Celsius, I want it in Celsius too."
        }
      ]
    }
  ],
  "response": [
    {
      "createdAt": "2025-03-12T20:03:33Z",
      "run_id": "run_Y1jbkKYmw5HWtwmK12yfi0Jb",
      "role": "assistant",
      "content": [
        {
          "type": "tool_call",
          "tool_call": {
            "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
            "type": "function",
            "function": {
              "name": "fetch_weather",
              "arguments": {
                "location": "Tokyo"
              }
            }
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:34Z",
      "run_id": "run_Y1jbkKYmw5HWtwmK12yfi0Jb",
      "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
      "role": "tool",
      "content": [
        {
          "type": "tool_result",
          "tool_result": {
            "weather": "Rainy, 22°C"
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:35Z",
      "run_id": "run_Y1jbkKYmw5HWtwmK12yfi0Jb",
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "The current weather in Tokyo is rainy with a temperature of 22°C."
        }
      ]
    }
  ],
  "tool_definitions": [
    {
      "name": "in_farenheit",
      "description": "Converts Celsius to Fahrenheit.",
      "parameters": {
        "type": "object",
        "properties": {
          "celsius": {
            "type": "number",
            "description": "The temperature in Celsius."
          }
        }
      }
    },
    {
      "name": "fetch_weather",
      "description": "Fetches the weather information for the specified location.",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The location to fetch weather for."
          }
        }
      }
    }
  ]
}

This work is actively in progress and sponsored by AI Foundry.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

github-actions · 2025-03-12T18:14:37Z

Thank you for your contribution @thecsw! We will review the pull request and get back to you soon.

Pull Request Overview

This PR introduces the AIAgentConverter to transform agentic conversation data from AI Projects/Services into an evaluator-friendly schema. It defines several Pydantic models for messages and tool definitions, implements a helper function to break tool call data into messages, and adds conversion methods to return different response formats.

Introduces new models for system, user, assistant, and tool messages.
Adds a helper function to split tool calls into separate messages.
Provides the AIAgentConverter class with methods to convert agent runs into various output schemas.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py	New models and helper functions supporting the conversion process
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py	Implements the AIAgentConverter and methods to translate conversation data
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/init.py	Added copyright/license header

Comments suppressed due to low confidence (3)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py:182

The comment at this location suggests that a tool call might be better represented as a tool message, yet an AssistantMessage is used. Verify whether the tool call should be represented with the tool message role to align with the intended schema.

messages.append(AssistantMessage(run_id=run_id, content=[to_dict(content_tool_call)], createdAt=tool_call.created))

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py:126

[nitpick] The sort key uses a boolean expression to order messages with the same timestamp. Consider using an explicit numeric value (e.g., (x.createdAt, 1 if x.role == _AGENT else 0)) for greater clarity.

final_messages.sort(key=lambda x: (x.createdAt, x.role == _AGENT))

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py:208

The to_dict function converts objects by performing a round-trip JSON serialization. If the object is a Pydantic model, using its built-in .dict() method might be more efficient and direct.

return json.loads(json.dumps(obj))

azure-sdk · 2025-03-12T20:05:14Z

API change check

API changes are not detected in this pull request.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py

luigiw · 2025-03-13T06:52:39Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py

+from ._models import break_tool_call_into_messages, convert_message
+
+
+class AIAgentConverter:


consider marking this as experimental

This is going for private preview, and if we were to, where does @experimental decorator come from? I see it used sporadically, but this project doesn't seem to have that dependency.

luigiw · 2025-03-13T06:55:59Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py

+            # Since this is the conversation of the entire thread and we are interested in a given run, we need to
+            # filter out the messages that came after the run.
+            if single_turn.run_id is not None:
+                if single_turn.run_id == run_id:


Nit: List comprehension may be a better way to find all messages belong to a run id. See here as an example. https://www.w3schools.com/python/python_lists_comprehension.asp

Agreed that list comprehension is neat, but notice that this is a greedy loop, where I want to add commentary on why each filtering is happening per a business and design decision reached. List comprehension due to its neatness would lose that non-trivial context.

I prefer to commenting on top of a statement than in-line within the list comprehension.

…ased on ssh_public_access (Azure#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests

thecsw · 2025-03-16T17:07:39Z

Azure:prp/agent_evaluators is not up to date with the most recent main, so there are more files changed than there really are—only 3. Base branch needs to be rebased to get a clean diff.

stevepon · 2025-03-17T18:08:45Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py

+from typing import List, Optional, Union
+
+# Message roles constants.
+_SYSTEM = "system"


Newer APIs use "developer" here (they're essentially interchangeable, just depends on API version). So might want to make sure that's supported too.

Interesting, didn't know that. What API version should we be bound to or set it to developer by default? I'll look into versioning.

I think for output, we can just use system, at least for now. But when reading in messages from threads/etc., you might need to cover the case where it could say "developer". Or maybe we just default to use whatever the thread itself is using, actually.

Actually, for this converter's purpose I don't think it matters. I take it back.

Let's write system for now—I was thinking of possibly capturing the model it ran on, so if it were o1, we could mark the instructions as coming from developer. The thread itself only gives us instructions without an explicit marking for either.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py

stevepon · 2025-03-17T18:18:24Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py

+    :param role: The role of the message sender (e.g., system, user, tool, assistant).
+    :type role: str
+    :param content: The content of the message, which can be a string or a list of dictionaries.
+    :type content: Union[str, List[dict]]


Not for this PR, it can be a quick follown so we don't need to be blocked, but I've suggested we include name in order to support multi-agent flows. Any objections? @singankit ?

Can user provide a name ? I do not see it being generated right now by service.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py

…#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>

* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com>

thecsw added 4 commits March 11, 2025 13:27

WIP AIAgentConverter

03af2ba

Added the v1 of the converter

e24d77f

Updated the AIAgentConverter with different output schemas.

73e3939

ruff format

c53b72f

Copilot bot review requested due to automatic review settings March 12, 2025 18:14

thecsw requested a review from a team as a code owner March 12, 2025 18:14

github-actions bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Mar 12, 2025

Copilot AI reviewed Mar 12, 2025

View reviewed changes

thecsw added 2 commits March 12, 2025 14:37

Update the top schema to have: query, response, tool_definitions

149a4cc

"agentic" is not a recognized word, change the wording.

8d87168

thecsw changed the title ~~[DRAFT] Converter to translate agentic conversations into evaluator-friendly schema~~ Converter from AI Service threads/runs to evaluator-compatible schema Mar 12, 2025

thecsw added 5 commits March 12, 2025 15:24

System message always comes first in query with multiple runs.

b3f5ef2

Add support for getting inputs from local files with run_ids.

465c1c7

Export AIAgentConverter through azure.ai.evaluation, local read updates

5219616

Use from ._models import

eed1375

Ruff format again.

3758eae

luigiw reviewed Mar 13, 2025

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py Outdated Show resolved Hide resolved

luigiw reviewed Mar 13, 2025

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py Show resolved Hide resolved

luigiw reviewed Mar 13, 2025

View reviewed changes

pdhotems and others added 6 commits March 13, 2025 12:48

Simplify the API by rolling up the static methods and hiding internals.

1741980

Merge branch 'main' into sandy/ai_services_evaluator

55294ae

Lock the ._converters._ai_services behind an import error.

a30abdf

Print to install azure-ai-projects if we can't import AIAgentConverter

7ec8c35

By default, include all previous runs' tool calls and results.

50e819f

thecsw added 2 commits March 14, 2025 13:38

Don't crash if there is no content in historical thread messages.

7637357

Parallelize the calls to get step_details for each run_id.

8bb6cd3

thecsw changed the base branch from main to prp/agent_evaluators March 15, 2025 16:04

thecsw requested review from paulshealy1, achauhan-scc, kingernupur and jayesh-tanna as code owners March 15, 2025 16:04

Merge branch 'prp/agent_evaluators' into sandy/ai_services_evaluator

6deb358

stevepon reviewed Mar 17, 2025

View reviewed changes

singankit removed customer-reported Issues that are reported by GitHub users external to the Azure organization. Community Contribution Community members are working on the issue Evaluation Issues related to the client library for Azure AI Evaluation labels Mar 17, 2025

stevepon reviewed Mar 17, 2025

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py Outdated Show resolved Hide resolved

stevepon reviewed Mar 17, 2025

View reviewed changes