Spon/update evals converter (#40204)

stevepon · singankit · thecsw · web-flow · commit 840ccfd06365 · 2025-03-24T22:54:16.000-07:00
* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com>
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py
@@ -30,6 +30,50 @@
 # Maximum number of workers allowed to make API calls at the same time.
 _MAX_WORKERS = 10
 
+# Built-in tool descriptions and parameters are hidden, but we include basic descriptions
+# for evaluation purposes
+_BUILT_IN_DESCRIPTIONS = {"code_interpreter": "Use code interpreter to read and interpret information from datasets, "
+                                              "generate code, and create graphs and charts using your data. Supports "
+                                              "up to 20 files.",
+                            "bing_grounding": "Enhance model output with web data.",
+                            "file_search": "Search for data across uploaded files.",
+                          }
+
+_BUILT_IN_PARAMS = {"code_interpreter": {"type": "object",
+                                         "properties": {
+                                             "input": {
+                                                "type": "string",
+                                                "description": "Generated code to be executed."
+                                                }
+                                             }
+                                         },
+                    "bing_grounding": {"type": "object",
+                                         "properties": {
+                                             "requesturl": {
+                                                "type": "string",
+                                                "description": "URL used in Bing Search API."
+                                                }
+                                             }
+                                         },
+                    "file_search": {"type": "object",
+                                       "properties": {
+                                           "ranking_options": {
+                                               "type": "object",
+                                               "properties": {
+                                                    "ranker": {
+                                                         "type": "string",
+                                                         "description": "Ranking algorithm to use."
+                                                         },
+                                                    "score_threshold": {
+                                                         "type": "number",
+                                                         "description": "Threshold for search results."
+                                                         }
+                                                    },
+                                               "description": "Ranking options for search results."
+                                           }
+                                       }
+                                    },
+                    }
 class AIAgentConverter:
     """
     A converter for AI agent data.
@@ -163,6 +207,20 @@ def _extract_function_tool_definitions(thread_run: ThreadRun) -> List[ToolDefini
                         parameters=parameters,
                     )
                 )
+            else:
+                # add limited support for built-in tools.  Descriptions and parameters
+                # are not published, but we'll include placeholders.
+                try:
+                    final_tools.append(
+                        ToolDefinition(
+                            name=tool.type,
+                            description=_BUILT_IN_DESCRIPTIONS[tool.type],
+                            parameters=_BUILT_IN_PARAMS[tool.type]
+                        )
+                    )
+                except:
+                    # if we run into an unknown tool, don't fail
+                    pass
         return final_tools
 
     @staticmethod
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py
@@ -169,30 +169,63 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
     # all in most of the cases, and bing would only show the API URL, without arguments or results.
     # Bing grounding would have "bing_grounding" in details with "requesturl" that will just be the API path with query.
     # TODO: Work with AI Services to add converter support for BingGrounding and CodeInterpreter.
-    if not hasattr(tool_call.details, _FUNCTION):
-        return messages
-
-    # This is the internals of the content object that will be included with the tool call.
-    tool_call_id = tool_call.details.id
-    content_tool_call = {
-        "type": _TOOL_CALL,
-        _TOOL_CALL: {
-            "id": tool_call_id,
-            "type": _FUNCTION,
-            _FUNCTION: {
-                "name": tool_call.details.function.name,
-                "arguments": safe_loads(tool_call.details.function.arguments),
-            },
-        },
-    }
+    if hasattr(tool_call.details, _FUNCTION):
+        # This is the internals of the content object that will be included with the tool call.
+        tool_call_id = tool_call.details.id
+        content_tool_call = {
+            "type": _TOOL_CALL,
+            "tool_call_id": tool_call_id,
+            "name": tool_call.details.function.name,
+            "arguments": safe_loads(tool_call.details.function.arguments),
+        }
+    else:
+        # Treat built-in tools separately.  Object models may be unique so handle each case separately
+        # Just converting to dicts here rather than custom serializers for simplicity for now.
+        # Don't fail if we run into a newly seen tool, just skip
+        if tool_call.details.type == "code_interpreter":
+            arguments = {"input": tool_call.details.code_interpreter.input}
+        elif tool_call.details.type == "bing_grounding":
+            arguments = {"requesturl": tool_call.details.bing_grounding.requesturl}
+        elif tool_call.details.type == "file_search":
+            options = tool_call.details.file_search.ranking_options
+            arguments = {"ranking_options": {"ranker": options.ranker, "score_threshold": options.score_threshold}}
+        else:
+            # unsupported tool type, skip
+            return messages
+        try:
+            tool_call_id = tool_call.details.id
+            content_tool_call = {
+                "type": _TOOL_CALL,
+                "tool_call_id": tool_call_id,
+                "name": tool_call.details.type,
+                "arguments": arguments,
+            }
+        except:
+            return messages
 
     # We format it into an assistant message, where the content is a singleton list of the content object.
     # It should be a tool message, since this is the call, but the given schema treats this message as
     # assistant's action of calling the tool.
     messages.append(AssistantMessage(run_id=run_id, content=[to_dict(content_tool_call)], createdAt=tool_call.created))
 
+    if hasattr(tool_call.details, _FUNCTION):
+        output = safe_loads(tool_call.details.function.output)
+    else:
+        try:
+            # Some built-ins may have output, others may not
+            # Try to retrieve it, but if we don't find anything, skip adding the message
+            # Just manually converting to dicts for easy serialization for now rather than custom serializers
+            if tool_call.details.type == "code_interpreter":
+                output = tool_call.details.code_interpreter.outputs
+            elif tool_call.details.type == "bing_grounding":
+                return messages # not supported yet from bing grounding tool
+            elif tool_call.details.type == "file_search":
+                output = [{"file_id": result.file_id, "file_name": result.file_name, "score": result.score, "content": result.content} for result in tool_call.details.file_search.results]
+        except:
+            return messages
+
     # Now, onto the tool result, which only includes the result of the function call.
-    content_tool_call_result = {"type": _TOOL_RESULT, _TOOL_RESULT: safe_loads(tool_call.details.function.output)}
+    content_tool_call_result = {"type": _TOOL_RESULT, _TOOL_RESULT: output}
 
     # Since this is a tool's action of returning, we put it as a tool message.
     messages.append(