Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converter from AI Service threads/runs to evaluator-compatible schema #40047

Merged

Conversation

thecsw
Copy link

@thecsw thecsw commented Mar 12, 2025

Description

This pull request introduces a new converter, AIAgentConverter, that translates agentic conversations into evaluator-friendly schema. Given a thread_id and run_id of a conversation from AI Projects/Services, we will convert that interaction into a schema that is compatible with evaluator SDKs, OpenAI schemas, etc.

Intended pattern on how to call and use,

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import AIAgentConverter

# Create your instance to talk to AI services.
project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str="<CONNECTION_STRING>"
)

# Initialize the converter that will be backed by the project.
converter = AIAgentConverter(project_client)

# These can be retrieved from the UI or after running `project_client.agents.create_run`
thread_id = "<THREAD_ID>"
run_id = "<RUN_ID>"

# Convert the agent run to a format suitable for the OpenAI API.
converted_data: dict = converter.convert(thread_id, run_id)

This will return a dictionary, which will produce the following output showcasing the system message, user messages, tool interactions, and assistant's response.

The top-level structure is as follows:

  • query: List[dict] — this is the conversation history and context. For backward-compatibility with Azure AI Foundry evaluations, we named it query. This field will have all of the interactions in the thread before the requested run's agent's response. This will include the system message, any other previous assistant messages (including tool calls and tool results), and user queries. Previous runs' tool calls introduce a performance hit, which can be disabled with exclude_tool_calls_previous_runs=True.
  • response: List[dict] — assistant's response, containing any number of tool calls, their results, and assistant's formatted responses.
  • tool_definitions: List[dict] — definitions of the tools that assistant can use in this run.

Similarly, there is an API to feed a local file, in an offline fashion to produce the same output,

from azure.ai.evaluation import AIAgentConverter

converted_data: dict = AIAgentConverter.convert_from_file("FILE_PATH", "<RUN_ID>")

Here is a non-trivial example with multiple tools, multiple tool calls/results, and multiple turns in the given conversation,

{
  "query": [
    {
      "role": "system",
      "content": "You are a helpful assistant"
    },
    {
      "createdAt": "2025-03-12T20:03:19Z",
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Hello, what is the weather in New York? Make sure to give it to me in Fahrenheit."
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:21Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "role": "assistant",
      "content": [
        {
          "type": "tool_call",
          "tool_call": {
            "id": "call_bWZiARx69AevDyjpjP1o8vkx",
            "type": "function",
            "function": {
              "name": "fetch_weather",
              "arguments": {
                "location": "New York"
              }
            }
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:22Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "tool_call_id": "call_bWZiARx69AevDyjpjP1o8vkx",
      "role": "tool",
      "content": [
        {
          "type": "tool_result",
          "tool_result": {
            "weather": "Sunny, 25°C"
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:26Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "role": "assistant",
      "content": [
        {
          "type": "tool_call",
          "tool_call": {
            "id": "call_oBJAKB3vLZ5wpMkCg2kfLm2F",
            "type": "function",
            "function": {
              "name": "in_farenheit",
              "arguments": {
                "celsius": 25
              }
            }
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:28Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "tool_call_id": "call_oBJAKB3vLZ5wpMkCg2kfLm2F",
      "role": "tool",
      "content": [
        {
          "type": "tool_result",
          "tool_result": "77.0°F"
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:29Z",
      "run_id": "run_drO1YX3nw1GLp0ZYtIi2nGb3",
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "The current weather in New York is sunny with a temperature of 77°F."
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:32Z",
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Actually, nevermind, I need the weather in Tokyo. Since they use Celsius, I want it in Celsius too."
        }
      ]
    }
  ],
  "response": [
    {
      "createdAt": "2025-03-12T20:03:33Z",
      "run_id": "run_Y1jbkKYmw5HWtwmK12yfi0Jb",
      "role": "assistant",
      "content": [
        {
          "type": "tool_call",
          "tool_call": {
            "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
            "type": "function",
            "function": {
              "name": "fetch_weather",
              "arguments": {
                "location": "Tokyo"
              }
            }
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:34Z",
      "run_id": "run_Y1jbkKYmw5HWtwmK12yfi0Jb",
      "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
      "role": "tool",
      "content": [
        {
          "type": "tool_result",
          "tool_result": {
            "weather": "Rainy, 22°C"
          }
        }
      ]
    },
    {
      "createdAt": "2025-03-12T20:03:35Z",
      "run_id": "run_Y1jbkKYmw5HWtwmK12yfi0Jb",
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "The current weather in Tokyo is rainy with a temperature of 22°C."
        }
      ]
    }
  ],
  "tool_definitions": [
    {
      "name": "in_farenheit",
      "description": "Converts Celsius to Fahrenheit.",
      "parameters": {
        "type": "object",
        "properties": {
          "celsius": {
            "type": "number",
            "description": "The temperature in Celsius."
          }
        }
      }
    },
    {
      "name": "fetch_weather",
      "description": "Fetches the weather information for the specified location.",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The location to fetch weather for."
          }
        }
      }
    }
  ]
}

This work is actively in progress and sponsored by AI Foundry.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@Copilot Copilot bot review requested due to automatic review settings March 12, 2025 18:14
@thecsw thecsw requested a review from a team as a code owner March 12, 2025 18:14
@github-actions github-actions bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Mar 12, 2025
Copy link

Thank you for your contribution @thecsw! We will review the pull request and get back to you soon.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the AIAgentConverter to transform agentic conversation data from AI Projects/Services into an evaluator-friendly schema. It defines several Pydantic models for messages and tool definitions, implements a helper function to break tool call data into messages, and adds conversion methods to return different response formats.

  • Introduces new models for system, user, assistant, and tool messages.
  • Adds a helper function to split tool calls into separate messages.
  • Provides the AIAgentConverter class with methods to convert agent runs into various output schemas.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py New models and helper functions supporting the conversion process
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py Implements the AIAgentConverter and methods to translate conversation data
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/init.py Added copyright/license header
Comments suppressed due to low confidence (3)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py:182

  • The comment at this location suggests that a tool call might be better represented as a tool message, yet an AssistantMessage is used. Verify whether the tool call should be represented with the tool message role to align with the intended schema.
messages.append(AssistantMessage(run_id=run_id, content=[to_dict(content_tool_call)], createdAt=tool_call.created))

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py:126

  • [nitpick] The sort key uses a boolean expression to order messages with the same timestamp. Consider using an explicit numeric value (e.g., (x.createdAt, 1 if x.role == _AGENT else 0)) for greater clarity.
final_messages.sort(key=lambda x: (x.createdAt, x.role == _AGENT))

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py:208

  • The to_dict function converts objects by performing a round-trip JSON serialization. If the object is a Pydantic model, using its built-in .dict() method might be more efficient and direct.
return json.loads(json.dumps(obj))
@thecsw thecsw changed the title [DRAFT] Converter to translate agentic conversations into evaluator-friendly schema Converter from AI Service threads/runs to evaluator-compatible schema Mar 12, 2025
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

from ._models import break_tool_call_into_messages, convert_message


class AIAgentConverter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider marking this as experimental

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going for private preview, and if we were to, where does @experimental decorator come from? I see it used sporadically, but this project doesn't seem to have that dependency.

# Since this is the conversation of the entire thread and we are interested in a given run, we need to
# filter out the messages that came after the run.
if single_turn.run_id is not None:
if single_turn.run_id == run_id:
Copy link
Contributor

@luigiw luigiw Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: List comprehension may be a better way to find all messages belong to a run id. See here as an example. https://www.w3schools.com/python/python_lists_comprehension.asp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that list comprehension is neat, but notice that this is a greedy loop, where I want to add commentary on why each filtering is happening per a business and design decision reached. List comprehension due to its neatness would lose that non-trivial context.

I prefer to commenting on top of a statement than in-line within the list comprehension.

@thecsw thecsw changed the base branch from main to prp/agent_evaluators March 15, 2025 16:04
@thecsw
Copy link
Author

thecsw commented Mar 16, 2025

Azure:prp/agent_evaluators is not up to date with the most recent main, so there are more files changed than there really are—only 3. Base branch needs to be rebased to get a clean diff.

from typing import List, Optional, Union

# Message roles constants.
_SYSTEM = "system"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newer APIs use "developer" here (they're essentially interchangeable, just depends on API version). So might want to make sure that's supported too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, didn't know that. What API version should we be bound to or set it to developer by default? I'll look into versioning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for output, we can just use system, at least for now. But when reading in messages from threads/etc., you might need to cover the case where it could say "developer". Or maybe we just default to use whatever the thread itself is using, actually.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for this converter's purpose I don't think it matters. I take it back.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's write system for now—I was thinking of possibly capturing the model it ran on, so if it were o1, we could mark the instructions as coming from developer. The thread itself only gives us instructions without an explicit marking for either.

@singankit singankit removed customer-reported Issues that are reported by GitHub users external to the Azure organization. Community Contribution Community members are working on the issue Evaluation Issues related to the client library for Azure AI Evaluation labels Mar 17, 2025
:param role: The role of the message sender (e.g., system, user, tool, assistant).
:type role: str
:param content: The content of the message, which can be a string or a list of dictionaries.
:type content: Union[str, List[dict]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, it can be a quick follown so we don't need to be blocked, but I've suggested we include name in order to support multi-agent flows. Any objections? @singankit ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can user provide a name ? I do not see it being generated right now by service.

@singankit singankit merged commit d3829f4 into Azure:prp/agent_evaluators Mar 17, 2025
3 of 4 checks passed
singankit pushed a commit that referenced this pull request Mar 24, 2025
…#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
singankit added a commit that referenced this pull request Mar 25, 2025
* Tool Call Accuracy Evaluator (#40068)

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter

* Converter from AI Service threads/runs to evaluator-compatible schema (#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>

* Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Add Task Adherence and Completeness  (#40098)

* Agentic Evaluator - Response Completeness

* Added Change Log for Response Completeness Agentic Evaluator

* Task Adherence Agentic Evaluator

* Add Task Adherence Evaluator to changelog

* fixing contracts for Completeness and Task Adherence Evaluators

* Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator

* update completeness implementation.

* update the completeness evaluator response to include threshold comparison.

* updating the implementation for completeness.

* updating the type for completeness score.

* updating the parsing logic for llm output of completeness.

* updating the response dict for completeness.

* Adding Task adherence

* Adding Task Adherence evaluator with samples

* Delete old files

* updating the exception for completeness evaluator.

* Changing docstring

* Adding changelog

* Use _result_key

* Add admonition

---------

Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Adding bug bash sample and instructions (#40125)

* Adding bug bash sample and instructions

* Updating instructions

* Update instructions.md

* Adding instructions and evaluator to agent evaluation sample

* add bug bash sample notebook for response completeness evaluator. (#40139)

* add bug bash sample notebook for response completeness evaluator.

* update the notebook for completeness.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Sample specific for tool call accuracy evaluator (#40135)

* Update instructions.md

* Add IntentResolution evaluator bug bash notebook (#40144)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Sample notebook to demo intent_resolution evaluator

* Add synthetic data and section on how to test data from disk

* Update instructions.md

* Update _tool_call_accuracy.py

* Improve task adherence prompt and add sample notebook for bugbash (#40146)

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Add resource prefix for safe secret standard alerts (#40028)

Add the prefix to identify RGs that we are creating in our TME
tenant to identify them as potentially using local auth and violating
our safe secret standards.

Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* Add examples to task_adherence prompt. Add Task Adherence sample notebook

* Undo changes to New-TestResources.ps1

* Add sample .env file

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* [AIAgentConverter] Added support for converting entire threads. (#40178)

* Implemented prepare_evaluation_data

* Add support for retrieving multiple threads into the same file.

* Parallelize thread preparing across threads.

* Set the maximum number of workers in thread pools to 10.

* Users/singankit/tool call accuracy evaluator tests (#40190)

* Raising error when tool call not found

* Adding unit tests for tool call accuracy evaluator

* Updating sample

* update output of converter for tool calls

* add built-ins

* handle file search

* remove extra files

* revert

* revert

---------

Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com>
Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com>
Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Jose Santos <jcsantos@microsoft.com>
Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com>
Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
Co-authored-by: Ankit Singhal <anksing@microsoft.com>
Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>
Co-authored-by: spon <stevenpon@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants