Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update binarization to be individual params #40168

Merged
merged 127 commits into from
Mar 24, 2025
Merged
Changes from all commits
Commits
Show all changes
127 commits
Select commit Hold shift + click to select a range
7de6367
Update task_query_response.prompty
nagkumar91 Oct 1, 2024
f288b34
Update task_simulate.prompty
nagkumar91 Oct 1, 2024
2a4b6f7
Update task_query_response.prompty
nagkumar91 Oct 2, 2024
c8ce251
Update task_simulate.prompty
nagkumar91 Oct 2, 2024
4522ae4
Merge branch 'Azure:main' into main
nagkumar91 Oct 3, 2024
32e9c1d
Merge branch 'Azure:main' into main
nagkumar91 Oct 7, 2024
76df69d
Merge branch 'Azure:main' into main
nagkumar91 Oct 8, 2024
aeddcb4
Merge branch 'Azure:main' into main
nagkumar91 Oct 8, 2024
65a759c
Merge branch 'Azure:main' into main
nagkumar91 Oct 9, 2024
e4cdd30
Fix the api_key needed
Oct 9, 2024
e3ab026
Merge branch 'Azure:main' into main
nagkumar91 Oct 11, 2024
4fb09c4
Merge branch 'Azure:main' into main
nagkumar91 Oct 15, 2024
e71a52d
Merge branch 'Azure:main' into main
nagkumar91 Oct 15, 2024
87166b3
Merge branch 'Azure:main' into main
nagkumar91 Oct 16, 2024
b478651
Update for release
nagkumar91 Oct 16, 2024
8e5a264
Black fix for file
nagkumar91 Oct 16, 2024
2077d6d
Merge branch 'Azure:main' into main
nagkumar91 Oct 17, 2024
3ab59c8
Merge branch 'Azure:main' into main
nagkumar91 Oct 17, 2024
3a80606
Add original text in global context
Oct 17, 2024
6768f9a
Update test
Oct 17, 2024
f7cc4bb
Update the indirect attack simulator
Oct 18, 2024
07eb466
Black suggested fixes
Oct 18, 2024
942bfd5
Update simulator prompty
Oct 18, 2024
2d4c376
Merge branch 'main' into main
nagkumar91 Oct 18, 2024
98cad97
Update adversarial scenario enum to exclude XPIA
Oct 18, 2024
d510316
Update changelog
Oct 18, 2024
742943e
Black fixes
Oct 18, 2024
12e0615
Remove duplicate import
Oct 18, 2024
de32b50
Fix the mypy error
Oct 19, 2024
4b64132
Mypy please be happy
Oct 21, 2024
1c0b4dd
Updates to non adv simulator
Oct 22, 2024
c4f9111
Merge branch 'Azure:main' into main
nagkumar91 Oct 22, 2024
6de617c
accept context from assistant messages, exclude them when using them …
Oct 23, 2024
1e5d40c
update changelog
Oct 23, 2024
93b29c7
pylint fixes
Oct 23, 2024
8e3ddc3
pylint fixes
Oct 23, 2024
31e0d29
Merge branch 'main' into main
nagkumar91 Oct 23, 2024
4ccc7c8
remove redundant quotes
Oct 23, 2024
bed5196
Fix typo
Oct 23, 2024
0fdd644
pylint fix
Oct 23, 2024
1f695cc
Update broken tests
Oct 23, 2024
3da3a94
Merge branch 'main' into main
nagkumar91 Oct 23, 2024
56c2657
Merge branch 'Azure:main' into main
nagkumar91 Oct 23, 2024
b04b3e6
Merge branch 'Azure:main' into main
nagkumar91 Oct 24, 2024
b9793ca
Merge branch 'Azure:main' into main
nagkumar91 Oct 25, 2024
92c9a6d
Include the grounding json in the manifest
Oct 25, 2024
0673cd5
Fix typo
Oct 25, 2024
7b360fc
Come on package
Oct 25, 2024
e3fd2bb
Merge branch 'Azure:main' into main
nagkumar91 Oct 28, 2024
c9f38c9
Release 1.0.0b5
Oct 28, 2024
bbb78fd
Merge branch 'main' of https://github.com/nagkumar91/azure-sdk-for-py…
Oct 28, 2024
ed7eed1
Notice from Chang
Oct 28, 2024
103f397
Merge branch 'Azure:main' into main
nagkumar91 Oct 28, 2024
3de5b66
Remove adv_conv template parameters from the outputs
Oct 28, 2024
21e3551
Merge branch 'main' of https://github.com/nagkumar91/azure-sdk-for-py…
Oct 28, 2024
78df8c9
Merge branch 'Azure:main' into main
nagkumar91 Oct 28, 2024
2b693bc
Merge branch 'Azure:main' into main
nagkumar91 Oct 29, 2024
f2e95d1
Update chanagelog
Oct 29, 2024
20b6d47
Merge branch 'Azure:main' into main
nagkumar91 Oct 29, 2024
a920c28
Merge branch 'main' of https://github.com/nagkumar91/azure-sdk-for-py…
Oct 29, 2024
f9ac10c
Experimental tags on adv scenarios
Oct 29, 2024
b570a51
Merge branch 'Azure:main' into main
nagkumar91 Oct 30, 2024
6c81cbb
Readme fix onbreaking change
Oct 30, 2024
b48f8ab
Add the category and both user and assistant context to the response …
Oct 30, 2024
d422e05
Update changelog
Oct 30, 2024
de105db
Merge branch 'Azure:main' into main
nagkumar91 Oct 30, 2024
d9b80f7
Merge branch 'Azure:main' into main
nagkumar91 Nov 4, 2024
04823fd
Merge branch 'Azure:main' into main
nagkumar91 Nov 5, 2024
988f2ad
Merge branch 'Azure:main' into main
nagkumar91 Nov 7, 2024
fb12fdd
Rename _kwargs to _options
Nov 7, 2024
d912c52
_options as prefix
Nov 7, 2024
059e767
update troubleshooting for simulator
Nov 7, 2024
f91228f
Rename according to suggestions
Nov 7, 2024
e660918
Merge branch 'Azure:main' into main
nagkumar91 Nov 7, 2024
5ad5a26
Merge branch 'Azure:main' into main
nagkumar91 Nov 11, 2024
cde740c
Clean up readme
Nov 11, 2024
a90c788
more links
Nov 11, 2024
11cf0ba
Merge branch 'Azure:main' into main
nagkumar91 Nov 14, 2024
3050ce7
Merge branch 'Azure:main' into main
nagkumar91 Nov 18, 2024
ae461cc
Merge branch 'Azure:main' into main
nagkumar91 Nov 20, 2024
035881e
Merge branch 'Azure:main' into main
nagkumar91 Nov 22, 2024
87c871c
Merge branch 'Azure:main' into main
nagkumar91 Nov 26, 2024
a1519dd
Merge branch 'Azure:main' into main
nagkumar91 Dec 2, 2024
3ad53d5
Bugfix: zip_longest created null parameters
Dec 2, 2024
e9f3241
Updated changelog
Dec 2, 2024
79c2f0d
zip does the job
Dec 2, 2024
a0bc930
remove ununsed import
Dec 3, 2024
32b15eb
Merge branch 'Azure:main' into main
nagkumar91 Dec 9, 2024
19c4ea1
Merge branch 'Azure:main' into main
nagkumar91 Dec 11, 2024
95052bd
Merge branch 'Azure:main' into main
nagkumar91 Dec 12, 2024
a03abdf
Merge branch 'Azure:main' into main
nagkumar91 Dec 17, 2024
74d8553
Fix changelog merge
Dec 18, 2024
c78f768
Merge branch 'Azure:main' into main
nagkumar91 Dec 19, 2024
d37d0c3
Merge branch 'Azure:main' into main
nagkumar91 Jan 5, 2025
151f4c4
Merge branch 'Azure:main' into main
nagkumar91 Jan 7, 2025
0a417ae
Merge branch 'Azure:main' into main
nagkumar91 Jan 9, 2025
ede99b8
Remove print statements
Jan 9, 2025
a824f83
Merge branch 'Azure:main' into main
nagkumar91 Jan 13, 2025
7c8eae9
Merge branch 'Azure:main' into main
nagkumar91 Jan 15, 2025
5feeabb
Merge branch 'Azure:main' into main
nagkumar91 Jan 17, 2025
1df3839
Merge branch 'Azure:main' into main
nagkumar91 Jan 20, 2025
4616896
Merge branch 'Azure:main' into main
nagkumar91 Jan 22, 2025
ed5d87c
Merge branch 'Azure:main' into main
nagkumar91 Jan 23, 2025
66c7c5b
Merge branch 'Azure:main' into main
nagkumar91 Jan 24, 2025
4019245
Merge branch 'Azure:main' into main
nagkumar91 Jan 27, 2025
c37b6c5
Merge branch 'Azure:main' into main
nagkumar91 Jan 28, 2025
246ab9b
Merge branch 'Azure:main' into main
nagkumar91 Feb 4, 2025
4767587
Merge branch 'Azure:main' into main
nagkumar91 Feb 11, 2025
f7e6089
Merge branch 'Azure:main' into main
nagkumar91 Feb 17, 2025
5b45900
Merge branch 'Azure:main' into main
nagkumar91 Feb 19, 2025
b394fe2
Merge branch 'Azure:main' into main
nagkumar91 Feb 20, 2025
54602fe
Merge branch 'Azure:main' into main
nagkumar91 Mar 4, 2025
ff36631
Merge branch 'Azure:main' into main
nagkumar91 Mar 5, 2025
f3e1850
Merge branch 'Azure:main' into main
nagkumar91 Mar 6, 2025
16173c3
Merge branch 'Azure:main' into main
nagkumar91 Mar 6, 2025
f856210
Merge branch 'Azure:main' into main
nagkumar91 Mar 10, 2025
602a2e1
Merge branch 'Azure:main' into main
nagkumar91 Mar 11, 2025
747c0db
Merge branch 'Azure:main' into main
nagkumar91 Mar 12, 2025
7741608
Merge branch 'Azure:main' into main
nagkumar91 Mar 13, 2025
5e36ddf
Merge branch 'Azure:main' into main
nagkumar91 Mar 13, 2025
648d45b
Merge branch 'Azure:main' into main
nagkumar91 Mar 17, 2025
b37ba2a
Merge branch 'Azure:main' into main
nagkumar91 Mar 19, 2025
3782341
Merge branch 'Azure:main' into main
nagkumar91 Mar 19, 2025
35682be
Merge branch 'Azure:main' into main
nagkumar91 Mar 20, 2025
c8dd420
Merge branch 'Azure:main' into main
nagkumar91 Mar 20, 2025
d225e2c
Update the threshold to be individual parameters
Mar 20, 2025
d20e915
Remove higher is better as a public ref
Mar 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -62,6 +62,7 @@ def __init__(
self._azure_ai_project = validate_azure_ai_project(azure_ai_project)
self._credential = credential
self._threshold = threshold
self._higher_is_better = _higher_is_better

@override
def __call__( # pylint: disable=docstring-missing-param
Original file line number Diff line number Diff line change
@@ -25,8 +25,14 @@ class ContentSafetyEvaluator(MultiEvaluatorBase[Union[str, float]]):
:param azure_ai_project: The scope of the Azure AI project.
It contains subscription id, resource group, and project name.
:type azure_ai_project: ~azure.ai.evaluation.AzureAIProject
:param threshold: The threshold for the content safety evaluator. Default is 3.
:type threshold: int
:param violence_threshold: The threshold for violence evaluation. Default is 3.
:type violence_threshold: int
:param sexual_threshold: The threshold for sexual content evaluation. Default is 3.
:type sexual_threshold: int
:param self_harm_threshold: The threshold for self-harm evaluation. Default is 3.
:type self_harm_threshold: int
:param hate_unfairness_threshold: The threshold for hate/unfairness evaluation. Default is 3.
:type hate_unfairness_threshold: int
:param kwargs: Additional arguments to pass to the evaluator.
:type kwargs: Any
:return: A function that evaluates content-safety metrics for "question-answering" scenario.
@@ -40,7 +46,6 @@ class ContentSafetyEvaluator(MultiEvaluatorBase[Union[str, float]]):
:dedent: 8
:caption: Initialize and call a ContentSafetyEvaluator.
# todo: should threshold be a dict like QAEvaluator?
.. admonition:: Example with Threshold:
.. literalinclude:: ../samples/evaluation_samples_threshold.py
@@ -54,12 +59,31 @@ class ContentSafetyEvaluator(MultiEvaluatorBase[Union[str, float]]):
id = "content_safety"
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""

def __init__(self, credential, azure_ai_project, threshold=3, **kwargs):
def __init__(
self,
credential,
azure_ai_project,
violence_threshold: int = 3,
sexual_threshold: int = 3,
self_harm_threshold: int = 3,
hate_unfairness_threshold: int = 3,
**kwargs
):
# Type checking
for name, value in [
("violence_threshold", violence_threshold),
("sexual_threshold", sexual_threshold),
("self_harm_threshold", self_harm_threshold),
("hate_unfairness_threshold", hate_unfairness_threshold),
]:
if not isinstance(value, int):
raise TypeError(f"{name} must be an int, got {type(value)}")

evaluators = [
ViolenceEvaluator(credential, azure_ai_project, threshold=threshold),
SexualEvaluator(credential, azure_ai_project, threshold=threshold),
SelfHarmEvaluator(credential, azure_ai_project, threshold=threshold),
HateUnfairnessEvaluator(credential, azure_ai_project, threshold=threshold),
ViolenceEvaluator(credential, azure_ai_project, threshold=violence_threshold),
SexualEvaluator(credential, azure_ai_project, threshold=sexual_threshold),
SelfHarmEvaluator(credential, azure_ai_project, threshold=self_harm_threshold),
HateUnfairnessEvaluator(credential, azure_ai_project, threshold=hate_unfairness_threshold),
]
super().__init__(evaluators=evaluators, **kwargs)

Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# ---------------------------------------------------------

from typing import Optional, Union
from typing import Union

from typing_extensions import overload, override

@@ -23,13 +23,18 @@ class QAEvaluator(MultiEvaluatorBase[Union[str, float]]):
:param model_config: Configuration for the Azure OpenAI model.
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
~azure.ai.evaluation.OpenAIModelConfiguration]
:param threshold: Optional dictionary of thresholds for different evaluation metrics.
Keys can be "groundedness", "relevance", "coherence", "fluency", "similarity",
and "f1_score". Default values are 3 for integer metrics and 0.5 for float
metrics. If None or an empty dictionary is provided, default values will be
used for all metrics. If a partial dictionary is provided, default values
will be used for any missing keys.
:type threshold: Optional[dict]
:param groundedness_threshold: The threshold for groundedness evaluation. Default is 3.
:type groundedness_threshold: int
:param relevance_threshold: The threshold for relevance evaluation. Default is 3.
:type relevance_threshold: int
:param coherence_threshold: The threshold for coherence evaluation. Default is 3.
:type coherence_threshold: int
:param fluency_threshold: The threshold for fluency evaluation. Default is 3.
:type fluency_threshold: int
:param similarity_threshold: The threshold for similarity evaluation. Default is 3.
:type similarity_threshold: int
:param f1_score_threshold: The threshold for F1 score evaluation. Default is 0.5.
:type f1_score_threshold: float
:return: A callable class that evaluates and generates metrics for "question-answering" scenario.
:param kwargs: Additional arguments to pass to the evaluator.
:type kwargs: Any
@@ -62,31 +67,36 @@ class QAEvaluator(MultiEvaluatorBase[Union[str, float]]):
id = "qa"
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""

def __init__(self, model_config, threshold: Optional[dict] = {}, **kwargs):
default_threshold = {
"groundedness": 3,
"relevance": 3,
"coherence": 3,
"fluency": 3,
"similarity": 3,
"f1_score": 0.5,
}
if threshold is None:
threshold = {}
for key in default_threshold.keys():
if key not in threshold:
threshold[key] = default_threshold[key]
if not isinstance(threshold[key], (int, float)):
raise TypeError(
f"Threshold for {key} must be an int or float, got {type(threshold[key])}"
)
def __init__(
self,
model_config,
groundedness_threshold: int = 3,
relevance_threshold: int = 3,
coherence_threshold: int = 3,
fluency_threshold: int = 3,
similarity_threshold: int = 3,
f1_score_threshold: float = 0.5,
**kwargs
):
# Type checking
for name, value in [
("groundedness_threshold", groundedness_threshold),
("relevance_threshold", relevance_threshold),
("coherence_threshold", coherence_threshold),
("fluency_threshold", fluency_threshold),
("similarity_threshold", similarity_threshold),
("f1_score_threshold", f1_score_threshold),
]:
if not isinstance(value, (int, float)):
raise TypeError(f"{name} must be an int or float, got {type(value)}")

evaluators = [
GroundednessEvaluator(model_config, threshold=threshold["groundedness"]),
RelevanceEvaluator(model_config, threshold=threshold["relevance"]),
CoherenceEvaluator(model_config, threshold=threshold["coherence"]),
FluencyEvaluator(model_config, threshold=threshold["fluency"]),
SimilarityEvaluator(model_config, threshold=threshold["similarity"]),
F1ScoreEvaluator(threshold=threshold["f1_score"]),
GroundednessEvaluator(model_config, threshold=groundedness_threshold),
RelevanceEvaluator(model_config, threshold=relevance_threshold),
CoherenceEvaluator(model_config, threshold=coherence_threshold),
FluencyEvaluator(model_config, threshold=fluency_threshold),
SimilarityEvaluator(model_config, threshold=similarity_threshold),
F1ScoreEvaluator(threshold=f1_score_threshold),
]
super().__init__(evaluators=evaluators, **kwargs)

Original file line number Diff line number Diff line change
@@ -54,10 +54,12 @@ class RougeScoreEvaluator(EvaluatorBase):
ROUGE scores range from 0 to 1, with higher scores indicating better quality.
:param rouge_type: The type of ROUGE score to calculate. Default is "rouge1".
:type rouge_type: str
:param threshold: The threshold value to determine if the evaluation passes or fails.
Can be either a float (applied to all metrics) or a dictionary with separate thresholds for each metric
{"precision": float, "recall": float, "f1_score": float}. Default is 0.5.
:type threshold: Union[float, dict]
:param precision_threshold: The threshold value to determine if the precision evaluation passes or fails. Default is 0.5.
:type precision_threshold: float
:param recall_threshold: The threshold value to determine if the recall evaluation passes or fails. Default is 0.5.
:type recall_threshold: float
:param f1_score_threshold: The threshold value to determine if the F1 score evaluation passes or fails. Default is 0.5.
:type f1_score_threshold: float
.. admonition:: Example:
@@ -82,24 +84,31 @@ class RougeScoreEvaluator(EvaluatorBase):
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""

@override
def __init__(self, rouge_type: RougeType, threshold: dict = {}):
def __init__(
self,
rouge_type: RougeType,
precision_threshold: float = 0.5,
recall_threshold: float = 0.5,
f1_score_threshold: float = 0.5
):
self._rouge_type = rouge_type
self._higher_is_better = True
super().__init__()
default_threshold = {
"precision": 0.5,
"recall": 0.5,
"f1_score": 0.5,

# Type checking for threshold parameters
for name, value in [
("precision_threshold", precision_threshold),
("recall_threshold", recall_threshold),
("f1_score_threshold", f1_score_threshold),
]:
if not isinstance(value, float):
raise TypeError(f"{name} must be a float, got {type(value)}")

self._threshold = {
"precision": precision_threshold,
"recall": recall_threshold,
"f1_score": f1_score_threshold,
}
if not isinstance(threshold, dict):
raise TypeError(
f"Threshold must be a dictionary, got {type(threshold)}"
)
for key in default_threshold.keys():
if key not in threshold:
threshold[key] = default_threshold[key]

self._threshold = threshold

def _get_binary_result(
self,
@@ -130,23 +139,22 @@ def _get_binary_result(
precision_valid = not math.isnan(rouge_precision)
recall_valid = not math.isnan(rouge_recall)
f1_valid = not math.isnan(rouge_f1_score)
if all(key in self._threshold for key in ["precision", "recall", "f1_score"]):
if self._higher_is_better:
if precision_valid:
results["rouge_precision_result"] = (rouge_precision >= self._threshold["precision"])
if recall_valid:
results["rouge_recall_result"] = (rouge_recall >= self._threshold["recall"])
if f1_valid:
results["rouge_f1_score_result"] = (rouge_f1_score >= self._threshold["f1_score"])
else:
if precision_valid:
results["rouge_precision_result"] = (rouge_precision <= self._threshold["precision"])
if recall_valid:
results["rouge_recall_result"] = (rouge_recall <= self._threshold["recall"])
if f1_valid:
results["rouge_f1_score_result"] = (rouge_f1_score <= self._threshold["f1_score"])

if self._higher_is_better:
if precision_valid:
results["rouge_precision_result"] = (rouge_precision >= self._threshold["precision"])
if recall_valid:
results["rouge_recall_result"] = (rouge_recall >= self._threshold["recall"])
if f1_valid:
results["rouge_f1_score_result"] = (rouge_f1_score >= self._threshold["f1_score"])
else:
raise ValueError("Threshold dictionary must contain 'precision', 'recall', and 'f1_score' keys.")
if precision_valid:
results["rouge_precision_result"] = (rouge_precision <= self._threshold["precision"])
if recall_valid:
results["rouge_recall_result"] = (rouge_recall <= self._threshold["recall"])
if f1_valid:
results["rouge_f1_score_result"] = (rouge_f1_score <= self._threshold["f1_score"])

return results

@override
Original file line number Diff line number Diff line change
@@ -65,18 +65,16 @@ def __init__(
credential,
azure_ai_project,
threshold: int = 5,
_higher_is_better: bool = True,
**kwargs,
):
self.threshold = threshold
self._higher_is_better = _higher_is_better
self._higher_is_better = True
self._output_prefix = "groundedness_pro"
super().__init__(
eval_metric=EvaluationMetrics.GROUNDEDNESS,
azure_ai_project=azure_ai_project,
credential=credential,
threshold=self.threshold,
_higher_is_better=self._higher_is_better,
**kwargs,
)

Original file line number Diff line number Diff line change
@@ -247,14 +247,15 @@ def evaluation_classes_methods_with_thresholds(self):
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

qa_eval = QAEvaluator(model_config=model_config, threshold={
"groundedness": 2,
"relevance": 2,
"coherence": 2,
"fluency": 2,
"similarity": 2,
"f1_score": 0.5,
})
qa_eval = QAEvaluator(
model_config=model_config,
groundedness_threshold=2,
relevance_threshold=2,
coherence_threshold=2,
fluency_threshold=2,
similarity_threshold=2,
f1_score_threshold=0.5
)
qa_eval(query="This's the color?", response="Black", ground_truth="gray", context="gray")
# [END threshold_qa_evaluator]

@@ -311,11 +312,9 @@ def evaluation_classes_methods_with_thresholds(self):

rouge_evaluator = RougeScoreEvaluator(
rouge_type=RougeType.ROUGE_4,
threshold={
"precision": 0.5,
"recall": 0.5,
"f1_score": 0.5,
}
precision_threshold=0.5,
recall_threshold=0.5,
f1_score_threshold=0.5
)
rouge_evaluator(response="Paris is the capital of France.", ground_truth="France's capital is Paris.")
# [END threshold_rouge_score_evaluator]
Original file line number Diff line number Diff line change
@@ -124,7 +124,7 @@ def test_f1_score_threshold(self, mock_call, threshold, score, should_pass):

@pytest.mark.unittest
class TestRougeThresholdBehavior:
"""Tests for threshold behavior in Rouge evaluators which use dictionary thresholds."""
"""Tests for threshold behavior in Rouge evaluators which use individual threshold parameters."""

def test_rouge_default_threshold(self):
"""Test that default thresholds are set correctly in Rouge evaluator."""
@@ -137,15 +137,11 @@ def test_rouge_default_threshold(self):

def test_rouge_custom_threshold(self):
"""Test that custom thresholds work correctly in Rouge evaluator."""
custom_threshold = {
"precision": 0.9,
"recall": 0.1,
"f1_score": 0.75
}

evaluator = RougeScoreEvaluator(
rouge_type=RougeType.ROUGE_L,
threshold=custom_threshold
precision_threshold=0.9,
recall_threshold=0.1,
f1_score_threshold=0.75
)

# Custom thresholds should be set
@@ -156,15 +152,11 @@ def test_rouge_custom_threshold(self):
@patch("azure.ai.evaluation._evaluators._rouge._rouge.RougeScoreEvaluator.__call__")
def test_rouge_threshold_behavior(self, mock_call):
"""Test threshold behavior with mocked Rouge scores."""
custom_threshold = {
"precision": 0.9,
"recall": 0.1,
"f1_score": 0.75
}

evaluator = RougeScoreEvaluator(
rouge_type=RougeType.ROUGE_L,
threshold=custom_threshold
precision_threshold=0.9,
recall_threshold=0.1,
f1_score_threshold=0.75
)

# Mock results with precision passing, recall failing, and f1_score passing
@@ -200,13 +192,12 @@ def test_rouge_threshold_behavior(self, mock_call):
@patch("azure.ai.evaluation._evaluators._rouge._rouge.RougeScoreEvaluator.__call__")
def test_rouge_different_types(self, mock_call, rouge_type):
"""Test that different Rouge types work correctly with thresholds."""
threshold = {
"precision": 0.5,
"recall": 0.5,
"f1_score": 0.5
}

evaluator = RougeScoreEvaluator(rouge_type=rouge_type, threshold=threshold)
evaluator = RougeScoreEvaluator(
rouge_type=rouge_type,
precision_threshold=0.5,
recall_threshold=0.5,
f1_score_threshold=0.5
)

# Mock scores that all pass the threshold
result = {