Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoding of threshold values for data tests #1352

Open
AlexJonesNLP opened this issue Oct 23, 2024 · 3 comments
Open

Hardcoding of threshold values for data tests #1352

AlexJonesNLP opened this issue Oct 23, 2024 · 3 comments

Comments

@AlexJonesNLP
Copy link

Hello, I was wondering about setting custom thresholds for data tests and I was a bit surprised that there didn't seem to be a straightforward way to do it for many of the tests. Some threshold values appear to be hardcoded, e.g. here:
https://github.com/evidentlyai/evidently/blob/d3e21fb657118f82e5e388223347e57e313c800e/src/evidently/tests/data_integrity_tests.py#L126C75-L126C87

I'm aware that there's a way to make custom drift methods (https://docs.evidentlyai.com/user-guide/customization/options-for-statistical-tests), but it seems very bizarre to me that these thresholds are hardcoded in certain tests rather than simply being optional parameters. Am I missing something here? Why aren't these thresholds always passed as optional parameters?

@elenasamuylova
Copy link
Collaborator

elenasamuylova commented Oct 23, 2024

Hi @AlexJonesNLP, the hardcoded values are only the "sensible defaults". If you do not pass your own test condition, this default will apply and will be derived from reference.

You can set custom test conditions using parameters like gt (greater than), lt (less than), etc.

Check the docs here: https://docs.evidentlyai.com/user-guide/tests-and-reports/run-tests#id-3.-set-test-conditions

@elenasamuylova
Copy link
Collaborator

elenasamuylova commented Oct 23, 2024

Here is a usage example using parameters eq for equal, lte for less than equal, etc.:

tests = TestSuite(
    tests=[
        TestShareOfMissingValues(lte=0.05),
        TestNumberOfConstantColumns(eq=0),
        TestNumberOfEmptyRows(eq=0),
        TestNumberOfEmptyColumns(eq=0),
        TestNumberOfDuplicatedColumns(eq=0),
        generate_column_tests(
            TestColumnDrift,
            columns="all",
            parameters={"stattest_threshold": 0.3, "stattest": "psi", "is_critical": False}
        ),
        TestShareOfDriftedColumns(lte=0.3),
    ]
)

@AlexJonesNLP
Copy link
Author

@elenasamuylova Ahh I see thank you, looking at these lines I get it now:

def get_condition(self) -> TestValueCondition:
if self.condition.has_condition():
return self.condition
reference_stats = getattr(self.metric.get_result(), self.reference_field)
return self.get_condition_from_reference(reference_stats)

I'm still not quite sure why this implementation was chosen over using optional parameters though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants