Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool Call Accuracy Evaluator #40068

Merged

Conversation

singankit
Copy link
Contributor

@singankit singankit commented Mar 13, 2025

Description

Adding ToolCallAccuracyEvaluator Evaluator

  • Single Tool Call Example
tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
response = tool_call_accuracy_evaluator(
            query="How is the weather in New York?",
            response="The weather in New York is sunny.",
            tool_calls={
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "New York"
                            }
                        }
                }
            },
            tool_definitions={
                "id": "fetch_weather",
                "name": "fetch_weather",
                "description": "Fetches the weather information for the specified location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                        "type": "string",
                        "description": "The location to fetch weather for."
                        }
                    }
                }
            }
        )

Output

{
    "tool_call_accuracy": 1.0,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The input Data should get a Score of 1 because the TOOL CALL is directly relevant to the user's query about the weather in New York, has the correct parameters, and the values are accurately extracted from the CONVERSATION.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}
  • Multiple Function Tool Call
tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
response = tool_call_accuracy_evaluator(
            query="How is the weather in New York?",
            tool_calls=[{
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "Seattle"
                            }
                        }
                    }
                },
                {
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "New York"
                            }
                        }
                    }
                }
                ],
            tool_definitions={
                "id": "fetch_weather",
                "name": "fetch_weather",
                "description": "Fetches the weather information for the specified location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                        "type": "string",
                        "description": "The location to fetch weather for."
                        }
                    }
                }
            }
        )

Output

{
    "tool_call_accuracy": 0.5,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Sorry, something went wrong.

@singankit singankit requested a review from a team as a code owner March 13, 2025 21:43
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 13, 2025
@@ -294,7 +294,7 @@ def parse_quality_evaluator_reason_score(llm_output: str) -> Tuple[float, str]:
reason = ""
if llm_output:
try:
score_pattern = r"<S2>\D*?([1-5]).*?</S2>"
score_pattern = rf"<S2>\D*?({score_range}).*?</S2>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be a common util? What if the score is a continuos float 0-1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be overridden for each evaluator if needed. I made this configurable here to make it reusable if possible. For all existing evaluators we have score which are whole number and do not have decimals. But can be untrue for new evaluators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only thing I'd recommend then is a different variable name -- "valid_score_regex" or something. But this is not a big deal, feel free to ignore.

tool_calls.extend([content for content in message.get("content")
if content.get("type") == "tool_call" and content.get("tool_call").get("type") == "function"])
else:
raise EvaluationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want this to be an exception? As a user, I could see just wanting to call this evaluator after every agent run. I don't want to have to pre-filter on the client side, and adding try-catch also seems a bit cumbersome. Maybe instead there is a benign "no tools were called" response we could give?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this has to be done before or after the response to understand if the evaluator gave a response that is valid. I would prefer to enforce in on input side.

# Instruction
## Goal
### Your are an expert in evaluating the accuracy of a tool call considering relevance and potential usefulness including syntactic and semantic correctness of a proposed tool call from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is confusing to me. Has it been used successfully elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing evaluators use it. I built this on top of the prompt from existing evaluators.

@changliu2
Copy link
Member

Description

Adding ToolCallAccuracyEvaluator Evaluator

tool_call_accuracy_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
response = tool_call_accuracy_evaluator(
            query="How is the weather in New York?",
            response="The weather in New York is sunny.",
            tool_calls={
                "type": "tool_call",
                "tool_call": {
                    "id": "call_eYtq7fMyHxDWIgeG2s26h0lJ",
                    "type": "function",
                        "function": {
                            "name": "fetch_weather",
                            "arguments": {
                                "location": "New York"
                            }
                        }
                }
            },
            tool_definitions={
                "id": "fetch_weather",
                "name": "fetch_weather",
                "description": "Fetches the weather information for the specified location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                        "type": "string",
                        "description": "The location to fetch weather for."
                        }
                    }
                }
            }
        )

Output

{
    "tool_call_accuracy": 1.0,
    "tool_call_accuracy_reason": "The input Data should get a Score of 1 because the TOOL CALL is directly relevant to the user's inquiry about the weather in New York, uses the correct parameters, and the parameter values are accurate and contextually appropriate."
}

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

can you also provide an example of 2 function calls evaluated?

Sorry, something went wrong.


# Ratings
## [Tool Call Accuracy: 0] (Irrelevant)
**Definition:** The tool call is not relevant and will not help resolve the user's need and TOOL CALL include information that is not presented in the CONVERSATION or TOOL CALL has parameters that is not present in TOOL DEFINITION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to say that the "TOOL CALL parameters include information that is not presented in the CONVERSAION"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and also if a tool was called with a parameter that is not in the tool definition

- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "0", "1") based on the levels of the definitions.


## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the same comment on Jose's... that if we want to move toward json mode or structured output we might need to revise this approach, right? (And also that a potential optimization is to remove the chain of thought piece in since the explanation can essentially play the same role, and then just have the LLM write the explanation field in the json response first.)

Copy link
Contributor Author

@singankit singankit Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I was writing the prompt the chain of thought was helpful for me to debug. But agree if we do not expose it to customer should we ask LLM to generate it. Once we move to structured output yes we will need to revise it.

@singankit
Copy link
Contributor Author

singankit commented Mar 14, 2025

@changliu2 To help close on the keys for the output of this evaluator

Current

{
    "tool_call_accuracy": 0.5,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

Proposed

{
    "percentage_tool_call_accurate": 0.5,
    "tool_call_correctness_label": True  # Chang to help close of this name 
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

@changliu2
Copy link
Member

@changliu2 To help close on the keys for the output of this evaluator

Current

{
    "tool_call_accuracy": 0.5,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

Proposed

{
    "percentage_tool_call_accurate": 0.5,
    "tool_call_correctness_label": True  # Chang to help close of this name 
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

@singankit, let's stick to the current convention pattern (key+ "_label"):

{
    "tool_call_accuracy": 0.5,
    "tool_call_accuracy_label": True,
    "per_tool_call_details": [
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is irrelevant to the user's inquiry about the weather in New York, as it attempts to fetch weather information for Seattle instead. The parameters used do not match the user's request, and the values are not appropriate or correct. Thus, the score is 0.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        },
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in New York, has the correct parameters and values, and is likely to provide useful information. Thus, it deserves a score of 1 for being relevant and useful.",
            "tool_call_id": "call_eYtq7fMyHxDWIgeG2s26h0lJ"
        }
    ]
}

@singankit singankit merged commit ce7eced into prp/agent_evaluators Mar 14, 2025
3 of 4 checks passed
@singankit singankit deleted the users/singankit/tool_call_accuracy_evaluator branch March 14, 2025 23:56
singankit added a commit that referenced this pull request Mar 24, 2025
* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter
singankit added a commit that referenced this pull request Mar 25, 2025

Partially verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
* Tool Call Accuracy Evaluator (#40068)

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter

* Converter from AI Service threads/runs to evaluator-compatible schema (#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>

* Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Add Task Adherence and Completeness  (#40098)

* Agentic Evaluator - Response Completeness

* Added Change Log for Response Completeness Agentic Evaluator

* Task Adherence Agentic Evaluator

* Add Task Adherence Evaluator to changelog

* fixing contracts for Completeness and Task Adherence Evaluators

* Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator

* update completeness implementation.

* update the completeness evaluator response to include threshold comparison.

* updating the implementation for completeness.

* updating the type for completeness score.

* updating the parsing logic for llm output of completeness.

* updating the response dict for completeness.

* Adding Task adherence

* Adding Task Adherence evaluator with samples

* Delete old files

* updating the exception for completeness evaluator.

* Changing docstring

* Adding changelog

* Use _result_key

* Add admonition

---------

Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Adding bug bash sample and instructions (#40125)

* Adding bug bash sample and instructions

* Updating instructions

* Update instructions.md

* Adding instructions and evaluator to agent evaluation sample

* add bug bash sample notebook for response completeness evaluator. (#40139)

* add bug bash sample notebook for response completeness evaluator.

* update the notebook for completeness.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Sample specific for tool call accuracy evaluator (#40135)

* Update instructions.md

* Add IntentResolution evaluator bug bash notebook (#40144)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Sample notebook to demo intent_resolution evaluator

* Add synthetic data and section on how to test data from disk

* Update instructions.md

* Update _tool_call_accuracy.py

* Improve task adherence prompt and add sample notebook for bugbash (#40146)

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Add resource prefix for safe secret standard alerts (#40028)

Add the prefix to identify RGs that we are creating in our TME
tenant to identify them as potentially using local auth and violating
our safe secret standards.

Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* Add examples to task_adherence prompt. Add Task Adherence sample notebook

* Undo changes to New-TestResources.ps1

* Add sample .env file

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* [AIAgentConverter] Added support for converting entire threads. (#40178)

* Implemented prepare_evaluation_data

* Add support for retrieving multiple threads into the same file.

* Parallelize thread preparing across threads.

* Set the maximum number of workers in thread pools to 10.

* Users/singankit/tool call accuracy evaluator tests (#40190)

* Raising error when tool call not found

* Adding unit tests for tool call accuracy evaluator

* Updating sample

* update output of converter for tool calls

* add built-ins

* handle file search

* remove extra files

* revert

* revert

---------

Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com>
Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com>
Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Jose Santos <jcsantos@microsoft.com>
Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com>
Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
Co-authored-by: Ankit Singhal <anksing@microsoft.com>
Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>
Co-authored-by: spon <stevenpon@microsoft.com>
singankit added a commit that referenced this pull request Mar 25, 2025

Partially verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
* Tool Call Accuracy Evaluator (#40068)

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter

* Converter from AI Service threads/runs to evaluator-compatible schema (#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>

* Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Add Task Adherence and Completeness  (#40098)

* Agentic Evaluator - Response Completeness

* Added Change Log for Response Completeness Agentic Evaluator

* Task Adherence Agentic Evaluator

* Add Task Adherence Evaluator to changelog

* fixing contracts for Completeness and Task Adherence Evaluators

* Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator

* update completeness implementation.

* update the completeness evaluator response to include threshold comparison.

* updating the implementation for completeness.

* updating the type for completeness score.

* updating the parsing logic for llm output of completeness.

* updating the response dict for completeness.

* Adding Task adherence

* Adding Task Adherence evaluator with samples

* Delete old files

* updating the exception for completeness evaluator.

* Changing docstring

* Adding changelog

* Use _result_key

* Add admonition

---------

Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Adding bug bash sample and instructions (#40125)

* Adding bug bash sample and instructions

* Updating instructions

* Update instructions.md

* Adding instructions and evaluator to agent evaluation sample

* add bug bash sample notebook for response completeness evaluator. (#40139)

* add bug bash sample notebook for response completeness evaluator.

* update the notebook for completeness.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Sample specific for tool call accuracy evaluator (#40135)

* Update instructions.md

* Add IntentResolution evaluator bug bash notebook (#40144)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Sample notebook to demo intent_resolution evaluator

* Add synthetic data and section on how to test data from disk

* Update instructions.md

* Update _tool_call_accuracy.py

* Improve task adherence prompt and add sample notebook for bugbash (#40146)

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Add resource prefix for safe secret standard alerts (#40028)

Add the prefix to identify RGs that we are creating in our TME
tenant to identify them as potentially using local auth and violating
our safe secret standards.

Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* Add examples to task_adherence prompt. Add Task Adherence sample notebook

* Undo changes to New-TestResources.ps1

* Add sample .env file

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* [AIAgentConverter] Added support for converting entire threads. (#40178)

* Implemented prepare_evaluation_data

* Add support for retrieving multiple threads into the same file.

* Parallelize thread preparing across threads.

* Set the maximum number of workers in thread pools to 10.

* Users/singankit/tool call accuracy evaluator tests (#40190)

* Raising error when tool call not found

* Adding unit tests for tool call accuracy evaluator

* Updating sample

* update output of converter for tool calls

* add built-ins

* handle file search

* remove extra files

* revert

* revert

* fix built-in tool parsing bug

* remove local debug

* Formatted and updated the converter to avoid built-in tool crashes.

* Added an experimental decorator to AIAgentConverter

* Update import path for experimental decorator

---------

Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com>
Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com>
Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Jose Santos <jcsantos@microsoft.com>
Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com>
Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
Co-authored-by: Ankit Singhal <anksing@microsoft.com>
Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>
Co-authored-by: spon <stevenpon@microsoft.com>
Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>
singankit added a commit that referenced this pull request Mar 25, 2025

Partially verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
* Tool Call Accuracy Evaluator (#40068)

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter

* Converter from AI Service threads/runs to evaluator-compatible schema (#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>

* Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Add Task Adherence and Completeness  (#40098)

* Agentic Evaluator - Response Completeness

* Added Change Log for Response Completeness Agentic Evaluator

* Task Adherence Agentic Evaluator

* Add Task Adherence Evaluator to changelog

* fixing contracts for Completeness and Task Adherence Evaluators

* Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator

* update completeness implementation.

* update the completeness evaluator response to include threshold comparison.

* updating the implementation for completeness.

* updating the type for completeness score.

* updating the parsing logic for llm output of completeness.

* updating the response dict for completeness.

* Adding Task adherence

* Adding Task Adherence evaluator with samples

* Delete old files

* updating the exception for completeness evaluator.

* Changing docstring

* Adding changelog

* Use _result_key

* Add admonition

---------

Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Adding bug bash sample and instructions (#40125)

* Adding bug bash sample and instructions

* Updating instructions

* Update instructions.md

* Adding instructions and evaluator to agent evaluation sample

* Raising error when tool call not found

* add bug bash sample notebook for response completeness evaluator. (#40139)

* add bug bash sample notebook for response completeness evaluator.

* update the notebook for completeness.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Sample specific for tool call accuracy evaluator (#40135)

* Update instructions.md

* Add IntentResolution evaluator bug bash notebook (#40144)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Sample notebook to demo intent_resolution evaluator

* Add synthetic data and section on how to test data from disk

* Update instructions.md

* Improve task adherence prompt and add sample notebook for bugbash (#40146)

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Add resource prefix for safe secret standard alerts (#40028)

Add the prefix to identify RGs that we are creating in our TME
tenant to identify them as potentially using local auth and violating
our safe secret standards.

Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* Add examples to task_adherence prompt. Add Task Adherence sample notebook

* Undo changes to New-TestResources.ps1

* Add sample .env file

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* [AIAgentConverter] Added support for converting entire threads. (#40178)

* Implemented prepare_evaluation_data

* Add support for retrieving multiple threads into the same file.

* Parallelize thread preparing across threads.

* Set the maximum number of workers in thread pools to 10.

* Users/singankit/tool call accuracy evaluator tests (#40190)

* Raising error when tool call not found

* Adding unit tests for tool call accuracy evaluator

* Updating sample

* Fixng doc strings and moving sample to a different agent sample folder

* Fixing bug with raising exception

* Fixing rebase issue

* Removing cell outputs to fix spell check erros

* Fixing failing tests

* Fixing failing test

* Spon/update evals converter (#40204)

* Tool Call Accuracy Evaluator (#40068)

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter

* Converter from AI Service threads/runs to evaluator-compatible schema (#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>

* Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Add Task Adherence and Completeness  (#40098)

* Agentic Evaluator - Response Completeness

* Added Change Log for Response Completeness Agentic Evaluator

* Task Adherence Agentic Evaluator

* Add Task Adherence Evaluator to changelog

* fixing contracts for Completeness and Task Adherence Evaluators

* Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator

* update completeness implementation.

* update the completeness evaluator response to include threshold comparison.

* updating the implementation for completeness.

* updating the type for completeness score.

* updating the parsing logic for llm output of completeness.

* updating the response dict for completeness.

* Adding Task adherence

* Adding Task Adherence evaluator with samples

* Delete old files

* updating the exception for completeness evaluator.

* Changing docstring

* Adding changelog

* Use _result_key

* Add admonition

---------

Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Adding bug bash sample and instructions (#40125)

* Adding bug bash sample and instructions

* Updating instructions

* Update instructions.md

* Adding instructions and evaluator to agent evaluation sample

* add bug bash sample notebook for response completeness evaluator. (#40139)

* add bug bash sample notebook for response completeness evaluator.

* update the notebook for completeness.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Sample specific for tool call accuracy evaluator (#40135)

* Update instructions.md

* Add IntentResolution evaluator bug bash notebook (#40144)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Sample notebook to demo intent_resolution evaluator

* Add synthetic data and section on how to test data from disk

* Update instructions.md

* Update _tool_call_accuracy.py

* Improve task adherence prompt and add sample notebook for bugbash (#40146)

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Add resource prefix for safe secret standard alerts (#40028)

Add the prefix to identify RGs that we are creating in our TME
tenant to identify them as potentially using local auth and violating
our safe secret standards.

Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* Add examples to task_adherence prompt. Add Task Adherence sample notebook

* Undo changes to New-TestResources.ps1

* Add sample .env file

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* [AIAgentConverter] Added support for converting entire threads. (#40178)

* Implemented prepare_evaluation_data

* Add support for retrieving multiple threads into the same file.

* Parallelize thread preparing across threads.

* Set the maximum number of workers in thread pools to 10.

* Users/singankit/tool call accuracy evaluator tests (#40190)

* Raising error when tool call not found

* Adding unit tests for tool call accuracy evaluator

* Updating sample

* update output of converter for tool calls

* add built-ins

* handle file search

* remove extra files

* revert

* revert

---------

Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com>
Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com>
Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Jose Santos <jcsantos@microsoft.com>
Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com>
Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
Co-authored-by: Ankit Singhal <anksing@microsoft.com>
Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>
Co-authored-by: spon <stevenpon@microsoft.com>

* Updating tool call accuracy to use update tool call schema

* Update response completeness evaluator based on schema (#40214)

* add seed parameter for deterministic results and update completeness return type based on schema.

* add unit tests for response completeness.

* update completeness key to response completeness.

* fixing the unit test for updated completeness key.

* updating completeness to responsecompleteness evaluator.

* clearing output in response completeness sample notebook.

* clearing output in response completeness sample notebook.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Spon/update evals converter (#40215)

* Tool Call Accuracy Evaluator (#40068)

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter

* Converter from AI Service threads/runs to evaluator-compatible schema (#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>

* Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Add Task Adherence and Completeness  (#40098)

* Agentic Evaluator - Response Completeness

* Added Change Log for Response Completeness Agentic Evaluator

* Task Adherence Agentic Evaluator

* Add Task Adherence Evaluator to changelog

* fixing contracts for Completeness and Task Adherence Evaluators

* Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator

* update completeness implementation.

* update the completeness evaluator response to include threshold comparison.

* updating the implementation for completeness.

* updating the type for completeness score.

* updating the parsing logic for llm output of completeness.

* updating the response dict for completeness.

* Adding Task adherence

* Adding Task Adherence evaluator with samples

* Delete old files

* updating the exception for completeness evaluator.

* Changing docstring

* Adding changelog

* Use _result_key

* Add admonition

---------

Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Adding bug bash sample and instructions (#40125)

* Adding bug bash sample and instructions

* Updating instructions

* Update instructions.md

* Adding instructions and evaluator to agent evaluation sample

* add bug bash sample notebook for response completeness evaluator. (#40139)

* add bug bash sample notebook for response completeness evaluator.

* update the notebook for completeness.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>

* Sample specific for tool call accuracy evaluator (#40135)

* Update instructions.md

* Add IntentResolution evaluator bug bash notebook (#40144)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Sample notebook to demo intent_resolution evaluator

* Add synthetic data and section on how to test data from disk

* Update instructions.md

* Update _tool_call_accuracy.py

* Improve task adherence prompt and add sample notebook for bugbash (#40146)

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Add resource prefix for safe secret standard alerts (#40028)

Add the prefix to identify RGs that we are creating in our TME
tenant to identify them as potentially using local auth and violating
our safe secret standards.

Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* Add examples to task_adherence prompt. Add Task Adherence sample notebook

* Undo changes to New-TestResources.ps1

* Add sample .env file

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* [AIAgentConverter] Added support for converting entire threads. (#40178)

* Implemented prepare_evaluation_data

* Add support for retrieving multiple threads into the same file.

* Parallelize thread preparing across threads.

* Set the maximum number of workers in thread pools to 10.

* Users/singankit/tool call accuracy evaluator tests (#40190)

* Raising error when tool call not found

* Adding unit tests for tool call accuracy evaluator

* Updating sample

* update output of converter for tool calls

* add built-ins

* handle file search

* remove extra files

* revert

* revert

* fix built-in tool parsing bug

* remove local debug

* Formatted and updated the converter to avoid built-in tool crashes.

* Added an experimental decorator to AIAgentConverter

* Update import path for experimental decorator

---------

Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com>
Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com>
Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Jose Santos <jcsantos@microsoft.com>
Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com>
Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
Co-authored-by: Ankit Singhal <anksing@microsoft.com>
Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>
Co-authored-by: spon <stevenpon@microsoft.com>
Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>

* Adding instructions file

* Updating default scores

* Fix test case for non-agent tool call

---------

Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com>
Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Jose Santos <jcsantos@microsoft.com>
Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com>
Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>
Co-authored-by: stevepon <steven@ponshop.net>
Co-authored-by: spon <stevenpon@microsoft.com>
Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Evaluation Issues related to the client library for Azure AI Evaluation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants