Skip to content

Commit 840ccfd

Browse files
steveponsingankitthecswpdhotemsJoseCSantos
authored
Spon/update evals converter (#40204)
* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com>
1 parent 0ae34eb commit 840ccfd

File tree

2 files changed

+108
-17
lines changed

2 files changed

+108
-17
lines changed

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py

+58
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,50 @@
3030
# Maximum number of workers allowed to make API calls at the same time.
3131
_MAX_WORKERS = 10
3232

33+
# Built-in tool descriptions and parameters are hidden, but we include basic descriptions
34+
# for evaluation purposes
35+
_BUILT_IN_DESCRIPTIONS = {"code_interpreter": "Use code interpreter to read and interpret information from datasets, "
36+
"generate code, and create graphs and charts using your data. Supports "
37+
"up to 20 files.",
38+
"bing_grounding": "Enhance model output with web data.",
39+
"file_search": "Search for data across uploaded files.",
40+
}
41+
42+
_BUILT_IN_PARAMS = {"code_interpreter": {"type": "object",
43+
"properties": {
44+
"input": {
45+
"type": "string",
46+
"description": "Generated code to be executed."
47+
}
48+
}
49+
},
50+
"bing_grounding": {"type": "object",
51+
"properties": {
52+
"requesturl": {
53+
"type": "string",
54+
"description": "URL used in Bing Search API."
55+
}
56+
}
57+
},
58+
"file_search": {"type": "object",
59+
"properties": {
60+
"ranking_options": {
61+
"type": "object",
62+
"properties": {
63+
"ranker": {
64+
"type": "string",
65+
"description": "Ranking algorithm to use."
66+
},
67+
"score_threshold": {
68+
"type": "number",
69+
"description": "Threshold for search results."
70+
}
71+
},
72+
"description": "Ranking options for search results."
73+
}
74+
}
75+
},
76+
}
3377
class AIAgentConverter:
3478
"""
3579
A converter for AI agent data.
@@ -163,6 +207,20 @@ def _extract_function_tool_definitions(thread_run: ThreadRun) -> List[ToolDefini
163207
parameters=parameters,
164208
)
165209
)
210+
else:
211+
# add limited support for built-in tools. Descriptions and parameters
212+
# are not published, but we'll include placeholders.
213+
try:
214+
final_tools.append(
215+
ToolDefinition(
216+
name=tool.type,
217+
description=_BUILT_IN_DESCRIPTIONS[tool.type],
218+
parameters=_BUILT_IN_PARAMS[tool.type]
219+
)
220+
)
221+
except:
222+
# if we run into an unknown tool, don't fail
223+
pass
166224
return final_tools
167225

168226
@staticmethod

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py

+50-17
Original file line numberDiff line numberDiff line change
@@ -169,30 +169,63 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
169169
# all in most of the cases, and bing would only show the API URL, without arguments or results.
170170
# Bing grounding would have "bing_grounding" in details with "requesturl" that will just be the API path with query.
171171
# TODO: Work with AI Services to add converter support for BingGrounding and CodeInterpreter.
172-
if not hasattr(tool_call.details, _FUNCTION):
173-
return messages
174-
175-
# This is the internals of the content object that will be included with the tool call.
176-
tool_call_id = tool_call.details.id
177-
content_tool_call = {
178-
"type": _TOOL_CALL,
179-
_TOOL_CALL: {
180-
"id": tool_call_id,
181-
"type": _FUNCTION,
182-
_FUNCTION: {
183-
"name": tool_call.details.function.name,
184-
"arguments": safe_loads(tool_call.details.function.arguments),
185-
},
186-
},
187-
}
172+
if hasattr(tool_call.details, _FUNCTION):
173+
# This is the internals of the content object that will be included with the tool call.
174+
tool_call_id = tool_call.details.id
175+
content_tool_call = {
176+
"type": _TOOL_CALL,
177+
"tool_call_id": tool_call_id,
178+
"name": tool_call.details.function.name,
179+
"arguments": safe_loads(tool_call.details.function.arguments),
180+
}
181+
else:
182+
# Treat built-in tools separately. Object models may be unique so handle each case separately
183+
# Just converting to dicts here rather than custom serializers for simplicity for now.
184+
# Don't fail if we run into a newly seen tool, just skip
185+
if tool_call.details.type == "code_interpreter":
186+
arguments = {"input": tool_call.details.code_interpreter.input}
187+
elif tool_call.details.type == "bing_grounding":
188+
arguments = {"requesturl": tool_call.details.bing_grounding.requesturl}
189+
elif tool_call.details.type == "file_search":
190+
options = tool_call.details.file_search.ranking_options
191+
arguments = {"ranking_options": {"ranker": options.ranker, "score_threshold": options.score_threshold}}
192+
else:
193+
# unsupported tool type, skip
194+
return messages
195+
try:
196+
tool_call_id = tool_call.details.id
197+
content_tool_call = {
198+
"type": _TOOL_CALL,
199+
"tool_call_id": tool_call_id,
200+
"name": tool_call.details.type,
201+
"arguments": arguments,
202+
}
203+
except:
204+
return messages
188205

189206
# We format it into an assistant message, where the content is a singleton list of the content object.
190207
# It should be a tool message, since this is the call, but the given schema treats this message as
191208
# assistant's action of calling the tool.
192209
messages.append(AssistantMessage(run_id=run_id, content=[to_dict(content_tool_call)], createdAt=tool_call.created))
193210

211+
if hasattr(tool_call.details, _FUNCTION):
212+
output = safe_loads(tool_call.details.function.output)
213+
else:
214+
try:
215+
# Some built-ins may have output, others may not
216+
# Try to retrieve it, but if we don't find anything, skip adding the message
217+
# Just manually converting to dicts for easy serialization for now rather than custom serializers
218+
if tool_call.details.type == "code_interpreter":
219+
output = tool_call.details.code_interpreter.outputs
220+
elif tool_call.details.type == "bing_grounding":
221+
return messages # not supported yet from bing grounding tool
222+
elif tool_call.details.type == "file_search":
223+
output = [{"file_id": result.file_id, "file_name": result.file_name, "score": result.score, "content": result.content} for result in tool_call.details.file_search.results]
224+
except:
225+
return messages
226+
194227
# Now, onto the tool result, which only includes the result of the function call.
195-
content_tool_call_result = {"type": _TOOL_RESULT, _TOOL_RESULT: safe_loads(tool_call.details.function.output)}
228+
content_tool_call_result = {"type": _TOOL_RESULT, _TOOL_RESULT: output}
196229

197230
# Since this is a tool's action of returning, we put it as a tool message.
198231
messages.append(

0 commit comments

Comments
 (0)