Lack of understandable documentation for Custom Data Validation #246

sachn1 · 2023-08-18T08:56:48Z

URL(s) with the issue:

Description of issue (what needs changing):

Clear description:

No end-to-end workflow in using custom_validation_config.proto. Need more understanding about:
1. How is the protobuf file written?
2. Where should the sql statements be written?
3. How is the custom protobuf file used as value to custom_validation_config argument in tfdv.validate_statistics()? I saw in some places that the protobuf file has to be converted to a python file and then import ValidationConfig from it. But saw no documentation related to it in tensorflow/tfdv/tfx.
4. https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics has an arg custom_validation_config but no documentation how a custom config can be loaded and passed as an arg.
What is the difference between using tensorflow-data-validation and tfx.ExampleValidator?

Correct links

The Link to custom_validate_statistics in https://www.tensorflow.org/tfx/data_validation/custom_data_validation is incorrect - Gets a 404 - page not found error.

The text was updated successfully, but these errors were encountered:

singhniraj08 · 2023-08-23T09:49:03Z

@sachn1, Referencing to the examples for better understanding of custom data validation below.

Example usage of proto to configure custom validations in TFDV: custom_validation_config.proto
Sql statement are written in CustomValidationConfig as shown in example
You can write the validation_config in text format, parse it to CustomValidationConfig proto as shown here and pass it to validate_statistics() as shown here
This test function for validate_statistics and test function of custom_validation_statistics will help you understand how custom config can be loaded and passed as an arg.
tensorflow-data-validation and tfx.ExampleValidator has same objective of identifying anomalies in training and serving data. The difference is in how you use it. TFDV can be used outside TFX pipeline to compute statistics and analyse data where is ExampleValidator component brings that functionality into TFX pipeline. Infact, ExampleValidator makes extensive use of TensorFlow Data Validation for validating your input data.

PR #241 already addresses the broken links, I will bump this up internally to make sure the docs are up to date.
Hope this clarifies your doubts, feel free to let us know if you have any issues. Thank you!

sachn1 · 2023-08-28T06:14:15Z

@singhniraj08 Thank you! I managed to get the custom validations to work with custom configurations thanks to your help. In the meantime, I figured out how I can fully use Python API to do the same.

feature_validation_base = custom_validation_config_pb2.FeatureValidation()
feature_validation_base.feature_path.CopyFrom(path_pb2.Path(step=["someFeature"]))

slice_validation_rules = {
    "slice1": {
        "sql_expression": "feature.string_stats.avg_length = 5",
        "description": "someFeature must have a length of 5",
    },
    "slice2": {
        "sql_expression": "feature.string_stats.avg_length = 8",
        "description": "someFeature must have a length of 8",
    }
}

feature_validation = custom_validation_config_pb2.FeatureValidation()
feature_validation.CopyFrom(feature_validation_base)

feature_validation.dataset_name = slice_key # slice1 or slice2

validation_rule_info = slice_validation_rules[slice_key]
validation_rule = custom_validation_config_pb2.Validation()
validation_rule.sql_expression = validation_rule_info["sql_expression"]
validation_rule.description = validation_rule_info["description"]

feature_validation.validations.append(validation_rule)

custom_validation_config = custom_validation_config_pb2.CustomValidationConfig()
custom_validation_config.feature_validations.append(feature_validation)

sliced_anomalies = tfdv.validate_statistics(
            sliced_stats,
            sliced_schema,
            custom_validation_config=custom_validation_config,
        )

Just in case anyone is interested!

I still have a few questions though!

Regardless of whether the parser based approach or the above approach is used, in both cases a custom validation of the default statistics is created (default in the sense of generated with tfdv functions). How can I create a custom statistic and apply custom validation to it? Do I need to write a pandas/tensorflow_transform function to apply to the dataset before calling the stat generators?
what is the difference between validate_statistics and custom_validation_statistics?

tensorflow-data-validation and tfx.ExampleValidator has same objective of identifying anomalies in training and serving data. The difference is in how you use it. TFDV can be used outside TFX pipeline to compute statistics and analyse data where is ExampleValidator component brings that functionality into TFX pipeline. Infact, ExampleValidator makes extensive use of TensorFlow Data Validation for validating your input data.

Unfortunately I didn't manage to go through it, but I can certainly open another ticket if I have problems with it in the future.

singhniraj08 · 2023-08-28T08:41:57Z

@sachn1,

#144 (comment) talks about how we can update/edit schema using the standard protocol-buffer API. It is strongly advised to review the inferred schema and refine it as needed. TFDV also provides a few utility methods to make these updates easier. A short example of updating schema is shown in [tutorial]. Once you have updated your schema, you can apply same process for custom validation.

validate_statistics is used to run standard, schema-based data validation along with custom validation, this validates the input statistics against the provided input schema.
whereas custom_validate_statistics is used to run only custom validation, this validates the input statistics with the user-supplied SQL queries. Additionally, custom_validate_statistics allows generation of custom anomaly using the provided severity and anomaly description. In single feature validations, the test feature will be mapped to feature in the SQL query. In two feature validations, the test feature will be mapped to feature_test in the SQL query, and the base feature will be mapped to feature_base.

Thank you!

sachn1 · 2023-08-28T09:06:16Z

@singhniraj08

Thanks again for the quick response.

#144 (comment) talks about how we can update/edit schema using the standard protocol-buffer API. It is strongly advised to review the inferred schema and refine it as needed. TFDV also provides a few utility methods to make these updates easier. A short example of updating schema is shown in [tutorial]. Once you have updated your schema, you can apply same process for custom validation.

I think you misunderstood my question. I am aware of editing/updating the schema from the docs. But inorder to validate something, tfdv first needs to have the statistics generated, right? My question is, could we create custom statistics other than the one automatically created by tfdv using tfdv.generate_statistics_from_*()?

validate_statistics is used to run standard, schema-based data validation along with custom validation, this validates the input statistics against the provided input schema.
whereas custom_validate_statistics is used to run only custom validation, this validates the input statistics with the user-supplied SQL queries. Additionally, custom_validate_statistics allows generation of custom anomaly using the provided severity and anomaly description. In single feature validations, the test feature will be mapped to feature in the SQL query. In two feature validations, the test feature will be mapped to feature_test in the SQL query, and the base feature will be mapped to feature_base.

So basically validate_statistics is more stricter (because of schema check) and accepts custom validation. Hmm, still wondering what is then the use of custom_validate_statistics.

singhniraj08 · 2023-08-28T10:04:48Z

@sachn1, tfdv.StatsOptions provides options for generating statistics by passing it in tfdv.generate_statistics_from_*() as shown in example here. Apart from that, I don't think we have other options to customize statistics generation.

custom_validate_statistics allows more level of customisation and allows generation of custom anomaly whereas validate_statistics runs standard, schema-based data validation. Even I am not sure, why both of the functions allows custom validations. Thanks.

sachn1 · 2023-08-29T08:57:00Z

Thanks for clarifying. Closing the issue as the main objective of the raising the issue is answered.

singhniraj08 self-assigned this Aug 21, 2023

singhniraj08 added the type:docs label Aug 21, 2023

singhniraj08 added the stat:awaiting response label Aug 23, 2023

singhniraj08 added stat:awaiting response and removed stat:awaiting response labels Aug 28, 2023

sachn1 closed this as completed Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of understandable documentation for Custom Data Validation #246

Lack of understandable documentation for Custom Data Validation #246

sachn1 commented Aug 18, 2023

singhniraj08 commented Aug 23, 2023

sachn1 commented Aug 28, 2023 •

edited

Loading

singhniraj08 commented Aug 28, 2023

sachn1 commented Aug 28, 2023

singhniraj08 commented Aug 28, 2023

sachn1 commented Aug 29, 2023

Lack of understandable documentation for Custom Data Validation #246

Lack of understandable documentation for Custom Data Validation #246

Comments

sachn1 commented Aug 18, 2023

URL(s) with the issue:

Description of issue (what needs changing):

Clear description:

Correct links

singhniraj08 commented Aug 23, 2023

sachn1 commented Aug 28, 2023 • edited Loading

singhniraj08 commented Aug 28, 2023

sachn1 commented Aug 28, 2023

singhniraj08 commented Aug 28, 2023

sachn1 commented Aug 29, 2023

sachn1 commented Aug 28, 2023 •

edited

Loading