Data Validation is a package developed by Members of Scientific Staff's MetaDQ team. The package is intended to be used by developers to check data quality. It requires information on data like attribute names (column names), datatypes, datatype limits, etc called as metadata. Based on this information, it runs generic and datatype specific validations and generates Data Quality report on the most granular level. This report can be utilized in multiple ways, for example:
- Data owners and data stewards can fix quality issues
- Developers can run analytics and create dashboards for decision-making
-
Generic Rulebook - This module includes an initial examination of data to check for completeness (nulls, missing values, mandatory fields), uniqueness (data duplication), accuracy (attribute/column names in data and metadata) and validity (range of values, unique values).
-
Datatype Specific Rulebook - This module includes checking data for conformity (conform with standard definitions of datatype, size, format, etc) specified in metadata. Datatype validations included - numeric, integer, short, long, float, double, string, varchar.
-
Data Quality Report - This module combines results from generic and datatype specific data validations, generates a Data Quality Report and saves the report to S3.
- DQ_REPORT_ID - Unique identifier of report
- AWS_ACCOUNT_NAME - AWS account that contains data to be validated
- S3_BUCKET - Bucket that contains data to be validated
- TABLE_NAME - Table that is being validated
- COLUMN_NAME - Column from the table that that failed validation check
- VALIDATION_CATEGORY - Category of data validation - Generic, Datatype Specific, Sensitive Data
- VALIDATION_ID - Unique ID assigned to each validation check
- VALIDATION_MESSAGE - Validation error message
- PRIMARY_KEY_COLUMN - Column that is dataset’s Primary Key or Unique Identifier. In the absence of Primary Key in the table, module assigns ROW_ID as table’s Primary Key
- PRIMARY_KEY_VALUE - Primary Key value that failed validation check
- TIMESTAMP - Timestamp of when data validation was performed
- S3 object filepath of data to be validated
- S3 object filepath of corresponding metadata
- AWS Account ID of vendor whose data is being validated
- Bucket name where data and metadata are stored
- Data Quality report in CSV format saved in S3 bucket
- Upload validation folder into Amazon SageMaker from AWS account that stores data and metadata.
- Install package dependencies by running the following command in SageMaker Terminal:
pip install -r requirements.txt
- Use the package to run data quality checks on your data by running the following command in SageMaker Terminal. Replace data_filepath, metadata_filepath, account_id and bucket_name with corresponding S3 object filepaths, vendor's AWS account ID and bucket that stores the data. When using S3 url in data_filepath and metadata_filepath, add an 'a' after 's3' in the url. Example url: 's3a://<bucket_name>/<filepath_in_bucket>'
python run.py <data_filepath> <metadata_filepath> <account_id> <bucket_name>
- Quality Report will be saved in
qualityreport
folder on the root level of<bucket_name>
. - Logs will be saved in
logfile
file in thedp_data_quality
folder.