Tool Call Accuracy Evaluator #40068

singankit · 2025-03-13T21:43:39Z

Description

Adding ToolCallAccuracyEvaluator Evaluator

Single Tool Call Example

tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
response = tool_call_accuracy_evaluator(
            query="How is the weather in New York?",
            response="The weather in New York is sunny.",
            tool_calls={
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "New York"
                            }
                        }
                }
            },
            tool_definitions={
                "id": "fetch_weather",
                "name": "fetch_weather",
                "description": "Fetches the weather information for the specified location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                        "type": "string",
                        "description": "The location to fetch weather for."
                        }
                    }
                }
            }
        )

Output

{
    "tool_call_accuracy": 1.0,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The input Data should get a Score of 1 because the TOOL CALL is directly relevant to the user's query about the weather in New York, has the correct parameters, and the values are accurately extracted from the CONVERSATION.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

Multiple Function Tool Call

tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
response = tool_call_accuracy_evaluator(
            query="How is the weather in New York?",
            tool_calls=[{
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "Seattle"
                            }
                        }
                    }
                },
                {
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "New York"
                            }
                        }
                    }
                }
                ],
            tool_definitions={
                "id": "fetch_weather",
                "name": "fetch_weather",
                "description": "Fetches the weather information for the specified location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                        "type": "string",
                        "description": "The location to fetch weather for."
                        }
                    }
                }
            }
        )

Output

{
    "tool_call_accuracy": 0.5,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

stevepon · 2025-03-13T21:45:55Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/utils.py

@@ -294,7 +294,7 @@ def parse_quality_evaluator_reason_score(llm_output: str) -> Tuple[float, str]:
    reason = ""
    if llm_output:
        try:
-            score_pattern = r"<S2>\D*?([1-5]).*?</S2>"
+            score_pattern = rf"<S2>\D*?({score_range}).*?</S2>"


Is this supposed to be a common util? What if the score is a continuos float 0-1?

It can be overridden for each evaluator if needed. I made this configurable here to make it reusable if possible. For all existing evaluators we have score which are whole number and do not have decimals. But can be untrue for new evaluators.

I guess the only thing I'd recommend then is a different variable name -- "valid_score_regex" or something. But this is not a big deal, feel free to ignore.

stevepon · 2025-03-13T21:51:25Z

...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

+                        tool_calls.extend([content for content in message.get("content")
+                                        if content.get("type") == "tool_call" and content.get("tool_call").get("type") == "function"])
+            else:
+                raise EvaluationException(


Are we sure we want this to be an exception? As a user, I could see just wanting to call this evaluator after every agent run. I don't want to have to pre-filter on the client side, and adding try-catch also seems a bit cumbersome. Maybe instead there is a benign "no tools were called" response we could give?

Either this has to be done before or after the response to understand if the evaluator gave a response that is valid. I would prefer to enforce in on input side.

...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

stevepon · 2025-03-13T21:56:31Z

...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty

+# Instruction
+## Goal
+### Your are an expert in evaluating the accuracy of a tool call considering relevance and potential usefulness including syntactic and semantic correctness of a proposed tool call from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
+- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.


This line is confusing to me. Has it been used successfully elsewhere?

Existing evaluators use it. I built this on top of the prompt from existing evaluators.

...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty

changliu2 · 2025-03-13T21:58:06Z

Description

Adding ToolCallAccuracyEvaluator Evaluator

tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
response = tool_call_accuracy_evaluator(
            query="How is the weather in New York?",
            response="The weather in New York is sunny.",
            tool_calls={
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "New York"
                            }
                        }
                }
            },
            tool_definitions={
                "id": "fetch_weather",
                "name": "fetch_weather",
                "description": "Fetches the weather information for the specified location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                        "type": "string",
                        "description": "The location to fetch weather for."
                        }
                    }
                }
            }
        )

Output

{
    "tool_call_accuracy": 1.0,
    "tool_call_accuracy_reason": "The input Data should get a Score of 1 because the TOOL CALL is directly relevant to the user's inquiry about the weather in New York, uses the correct parameters, and the parameter values are accurate and contextually appropriate."
}

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

can you also provide an example of 2 function calls evaluated?

...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty

stevepon · 2025-03-13T21:59:02Z

...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty

+
+# Ratings
+## [Tool Call Accuracy: 0] (Irrelevant)
+**Definition:** The tool call is not relevant and will not help resolve the user's need and TOOL CALL include information that is not presented in the CONVERSATION or TOOL CALL has parameters that is not present in TOOL DEFINITION.


Do you want to say that the "TOOL CALL parameters include information that is not presented in the CONVERSAION"?

Yes and also if a tool was called with a parameter that is not in the tool definition

stevepon · 2025-03-13T22:00:56Z

...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty

+- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "0", "1") based on the levels of the definitions.
+
+
+## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.


I made the same comment on Jose's... that if we want to move toward json mode or structured output we might need to revise this approach, right? (And also that a potential optimization is to remove the chain of thought piece in since the explanation can essentially play the same role, and then just have the LLM write the explanation field in the json response first.)

While I was writing the prompt the chain of thought was helpful for me to debug. But agree if we do not expose it to customer should we ask LLM to generate it. Once we move to structured output yes we will need to revise it.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_eval.py

singankit · 2025-03-14T19:10:49Z

@changliu2 To help close on the keys for the output of this evaluator

Current

{
    "tool_call_accuracy": 0.5,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

Proposed

{
    "percentage_tool_call_accurate": 0.5,
    "tool_call_correctness_label": True  # Chang to help close of this name 
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

changliu2 · 2025-03-14T19:34:35Z

@changliu2 To help close on the keys for the output of this evaluator

Current

{
    "tool_call_accuracy": 0.5,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

Proposed

{
    "percentage_tool_call_accurate": 0.5,
    "tool_call_correctness_label": True  # Chang to help close of this name 
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

@singankit, let's stick to the current convention pattern (key+ "_label"):

{
    "tool_call_accuracy": 0.5,
    "tool_call_accuracy_label": True,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

* Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter

* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com>

* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert * fix built-in tool parsing bug * remove local debug * Formatted and updated the converter to avoid built-in tool crashes. * Added an experimental decorator to AIAgentConverter * Update import path for experimental decorator --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com> Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>

* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * Raising error when tool call not found * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * Fixng doc strings and moving sample to a different agent sample folder * Fixing bug with raising exception * Fixing rebase issue * Removing cell outputs to fix spell check erros * Fixing failing tests * Fixing failing test * Spon/update evals converter (#40204) * Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com> * Updating tool call accuracy to use update tool call schema * Update response completeness evaluator based on schema (#40214) * add seed parameter for deterministic results and update completeness return type based on schema. * add unit tests for response completeness. * update completeness key to response completeness. * fixing the unit test for updated completeness key. * updating completeness to responsecompleteness evaluator. * clearing output in response completeness sample notebook. * clearing output in response completeness sample notebook. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Spon/update evals converter (#40215) * Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert * fix built-in tool parsing bug * remove local debug * Formatted and updated the converter to avoid built-in tool crashes. * Added an experimental decorator to AIAgentConverter * Update import path for experimental decorator --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com> Co-authored-by: Sandy Urazayev <surazayev@microsoft.com> * Adding instructions file * Updating default scores * Fix test case for non-agent tool call --------- Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: stevepon <steven@ponshop.net> Co-authored-by: spon <stevenpon@microsoft.com> Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>

Tool Call Accuracy Evaluator

Loading
Loading status checks…

8b53c50

singankit requested a review from a team as a code owner March 13, 2025 21:43

github-actions bot added the Evaluation label Mar 13, 2025