Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of understandable documentation for Custom Data Validation #246

Closed
sachn1 opened this issue Aug 18, 2023 · 6 comments
Closed

Lack of understandable documentation for Custom Data Validation #246

sachn1 opened this issue Aug 18, 2023 · 6 comments

Comments

@sachn1
Copy link

sachn1 commented Aug 18, 2023

URL(s) with the issue:

  1. https://www.tensorflow.org/tfx/data_validation/custom_data_validation
  2. https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics

Description of issue (what needs changing):

Clear description:

  1. No end-to-end workflow in using custom_validation_config.proto. Need more understanding about:
    1. How is the protobuf file written?
    2. Where should the sql statements be written?
    3. How is the custom protobuf file used as value to custom_validation_config argument in tfdv.validate_statistics()? I saw in some places that the protobuf file has to be converted to a python file and then import ValidationConfig from it. But saw no documentation related to it in tensorflow/tfdv/tfx.
    4. https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics has an arg custom_validation_config but no documentation how a custom config can be loaded and passed as an arg.
  2. What is the difference between using tensorflow-data-validation and tfx.ExampleValidator?

Correct links

The Link to custom_validate_statistics in https://www.tensorflow.org/tfx/data_validation/custom_data_validation is incorrect - Gets a 404 - page not found error.

@singhniraj08
Copy link

@sachn1, Referencing to the examples for better understanding of custom data validation below.

  1. Example usage of proto to configure custom validations in TFDV: custom_validation_config.proto
  2. Sql statement are written in CustomValidationConfig as shown in example
  3. You can write the validation_config in text format, parse it to CustomValidationConfig proto as shown here and pass it to validate_statistics() as shown here
  4. This test function for validate_statistics and test function of custom_validation_statistics will help you understand how custom config can be loaded and passed as an arg.
  5. tensorflow-data-validation and tfx.ExampleValidator has same objective of identifying anomalies in training and serving data. The difference is in how you use it. TFDV can be used outside TFX pipeline to compute statistics and analyse data where is ExampleValidator component brings that functionality into TFX pipeline. Infact, ExampleValidator makes extensive use of TensorFlow Data Validation for validating your input data.

PR #241 already addresses the broken links, I will bump this up internally to make sure the docs are up to date.
Hope this clarifies your doubts, feel free to let us know if you have any issues. Thank you!

@sachn1
Copy link
Author

sachn1 commented Aug 28, 2023

@singhniraj08 Thank you! I managed to get the custom validations to work with custom configurations thanks to your help. In the meantime, I figured out how I can fully use Python API to do the same.

feature_validation_base = custom_validation_config_pb2.FeatureValidation()
feature_validation_base.feature_path.CopyFrom(path_pb2.Path(step=["someFeature"]))

slice_validation_rules = {
    "slice1": {
        "sql_expression": "feature.string_stats.avg_length = 5",
        "description": "someFeature must have a length of 5",
    },
    "slice2": {
        "sql_expression": "feature.string_stats.avg_length = 8",
        "description": "someFeature must have a length of 8",
    }
}

feature_validation = custom_validation_config_pb2.FeatureValidation()
feature_validation.CopyFrom(feature_validation_base)

feature_validation.dataset_name = slice_key # slice1 or slice2

validation_rule_info = slice_validation_rules[slice_key]
validation_rule = custom_validation_config_pb2.Validation()
validation_rule.sql_expression = validation_rule_info["sql_expression"]
validation_rule.description = validation_rule_info["description"]

feature_validation.validations.append(validation_rule)

custom_validation_config = custom_validation_config_pb2.CustomValidationConfig()
custom_validation_config.feature_validations.append(feature_validation)

sliced_anomalies = tfdv.validate_statistics(
            sliced_stats,
            sliced_schema,
            custom_validation_config=custom_validation_config,
        )

Just in case anyone is interested!

I still have a few questions though!

  1. Regardless of whether the parser based approach or the above approach is used, in both cases a custom validation of the default statistics is created (default in the sense of generated with tfdv functions). How can I create a custom statistic and apply custom validation to it? Do I need to write a pandas/tensorflow_transform function to apply to the dataset before calling the stat generators?
  2. what is the difference between validate_statistics and custom_validation_statistics?

tensorflow-data-validation and tfx.ExampleValidator has same objective of identifying anomalies in training and serving data. The difference is in how you use it. TFDV can be used outside TFX pipeline to compute statistics and analyse data where is ExampleValidator component brings that functionality into TFX pipeline. Infact, ExampleValidator makes extensive use of TensorFlow Data Validation for validating your input data.

Unfortunately I didn't manage to go through it, but I can certainly open another ticket if I have problems with it in the future.

@singhniraj08
Copy link

@sachn1,

#144 (comment) talks about how we can update/edit schema using the standard protocol-buffer API. It is strongly advised to review the inferred schema and refine it as needed. TFDV also provides a few utility methods to make these updates easier. A short example of updating schema is shown in [tutorial]. Once you have updated your schema, you can apply same process for custom validation.

validate_statistics is used to run standard, schema-based data validation along with custom validation, this validates the input statistics against the provided input schema.
whereas custom_validate_statistics is used to run only custom validation, this validates the input statistics with the user-supplied SQL queries. Additionally, custom_validate_statistics allows generation of custom anomaly using the provided severity and anomaly description. In single feature validations, the test feature will be mapped to feature in the SQL query. In two feature validations, the test feature will be mapped to feature_test in the SQL query, and the base feature will be mapped to feature_base.

Thank you!

@sachn1
Copy link
Author

sachn1 commented Aug 28, 2023

@singhniraj08

Thanks again for the quick response.

#144 (comment) talks about how we can update/edit schema using the standard protocol-buffer API. It is strongly advised to review the inferred schema and refine it as needed. TFDV also provides a few utility methods to make these updates easier. A short example of updating schema is shown in [tutorial]. Once you have updated your schema, you can apply same process for custom validation.

I think you misunderstood my question. I am aware of editing/updating the schema from the docs. But inorder to validate something, tfdv first needs to have the statistics generated, right? My question is, could we create custom statistics other than the one automatically created by tfdv using tfdv.generate_statistics_from_*()?

validate_statistics is used to run standard, schema-based data validation along with custom validation, this validates the input statistics against the provided input schema.
whereas custom_validate_statistics is used to run only custom validation, this validates the input statistics with the user-supplied SQL queries. Additionally, custom_validate_statistics allows generation of custom anomaly using the provided severity and anomaly description. In single feature validations, the test feature will be mapped to feature in the SQL query. In two feature validations, the test feature will be mapped to feature_test in the SQL query, and the base feature will be mapped to feature_base.

So basically validate_statistics is more stricter (because of schema check) and accepts custom validation. Hmm, still wondering what is then the use of custom_validate_statistics.

@singhniraj08
Copy link

@sachn1, tfdv.StatsOptions provides options for generating statistics by passing it in tfdv.generate_statistics_from_*() as shown in example here. Apart from that, I don't think we have other options to customize statistics generation.

custom_validate_statistics allows more level of customisation and allows generation of custom anomaly whereas validate_statistics runs standard, schema-based data validation. Even I am not sure, why both of the functions allows custom validations. Thanks.

@sachn1
Copy link
Author

sachn1 commented Aug 29, 2023

Thanks for clarifying. Closing the issue as the main objective of the raising the issue is answered.

@sachn1 sachn1 closed this as completed Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants