Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prp/agent evaluators #40189

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

Prp/agent evaluators #40189

wants to merge 16 commits into from

Conversation

singankit
Copy link
Contributor

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

singankit and others added 15 commits March 14, 2025 16:56
* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Tool Call Accuracy Evaluator

* Review comments

* Updating score key and output structure

* Updating prompt

* Renaming parameter
…#40047)

* WIP AIAgentConverter

* Added the v1 of the converter

* Updated the AIAgentConverter with different output schemas.

* ruff format

* Update the top schema to have: query, response, tool_definitions

* "agentic" is not a recognized word, change the wording.

* System message always comes first in query with multiple runs.

* Add support for getting inputs from local files with run_ids.

* Export AIAgentConverter through azure.ai.evaluation, local read updates

* Use from ._models import

* Ruff format again.

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Simplify the API by rolling up the static methods and hiding internals.

* Lock the ._converters._ai_services behind an import error.

* Print to install azure-ai-projects if we can't import AIAgentConverter

* By default, include all previous runs' tool calls and results.

* Don't crash if there is no content in historical thread messages.

* Parallelize the calls to get step_details for each run_id.

* Addressing PR comments.

* Use a single underscore to hide internal static members.

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
…0065)

* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function
* Agentic Evaluator - Response Completeness

* Added Change Log for Response Completeness Agentic Evaluator

* Task Adherence Agentic Evaluator

* Add Task Adherence Evaluator to changelog

* fixing contracts for Completeness and Task Adherence Evaluators

* Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator

* update completeness implementation.

* update the completeness evaluator response to include threshold comparison.

* updating the implementation for completeness.

* updating the type for completeness score.

* updating the parsing logic for llm output of completeness.

* updating the response dict for completeness.

* Adding Task adherence

* Adding Task Adherence evaluator with samples

* Delete old files

* updating the exception for completeness evaluator.

* Changing docstring

* Adding changelog

* Use _result_key

* Add admonition

---------

Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com>
Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com>
Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
* Adding bug bash sample and instructions

* Updating instructions
…0139)

* add bug bash sample notebook for response completeness evaluator.

* update the notebook for completeness.

---------

Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com>
* Add intent resolution evaluator

* updated intent_resolution evaluator logic

* Remove spurious print statements

* Address reviewers feedback

* add threshold key, update result to pass/fail rather than True/False

* Add example + remove repeated fields

* Harden check_score_is_valid function

* Sample notebook to demo intent_resolution evaluator

* Add synthetic data and section on how to test data from disk
…0146)

* For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934)

* add disableLocalAuth for computeInstance

* fix disableLocalAuthAuth issue for amlCompute

* update compute instance

* update recordings

* temp changes

* Revert "temp changes"

This reverts commit 64e3c38.

* update recordings

* fix tests

* Add resource prefix for safe secret standard alerts (#40028)

Add the prefix to identify RGs that we are creating in our TME
tenant to identify them as potentially using local auth and violating
our safe secret standards.

Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>

* Add examples to task_adherence prompt. Add Task Adherence sample notebook

* Undo changes to New-TestResources.ps1

* Add sample .env file

---------

Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com>
* Implemented prepare_evaluation_data

* Add support for retrieving multiple threads into the same file.

* Parallelize thread preparing across threads.

* Set the maximum number of workers in thread pools to 10.
@github-actions github-actions bot added Evaluation Issues related to the client library for Azure AI Evaluation Machine Learning labels Mar 21, 2025
* Raising error when tool call not found

* Adding unit tests for tool call accuracy evaluator

* Updating sample
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Evaluation Issues related to the client library for Azure AI Evaluation Machine Learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants