Data Validation - Version 1

Data Validation is a package developed by Members of Scientific Staff's MetaDQ team. The package is intended to be used by developers to check data quality. It requires information on data like attribute names (column names), datatypes, datatype limits, etc called as metadata. Based on this information, it runs generic and datatype specific validations and generates Data Quality report on the most granular level. This report can be utilized in multiple ways, for example:

Data owners and data stewards can fix quality issues
Developers can run analytics and create dashboards for decision-making

Modules of Data Validation -

Generic Rulebook - This module includes an initial examination of data to check for completeness (nulls, missing values, mandatory fields), uniqueness (data duplication), accuracy (attribute/column names in data and metadata) and validity (range of values, unique values).
Datatype Specific Rulebook - This module includes checking data for conformity (conform with standard definitions of datatype, size, format, etc) specified in metadata. Datatype validations included - numeric, integer, short, long, float, double, string, varchar.
Data Quality Report - This module combines results from generic and datatype specific data validations, generates a Data Quality Report and saves the report to S3.

Features of Data Quality Report -

DQ_REPORT_ID - Unique identifier of report
AWS_ACCOUNT_NAME - AWS account that contains data to be validated
S3_BUCKET - Bucket that contains data to be validated
TABLE_NAME - Table that is being validated
COLUMN_NAME - Column from the table that that failed validation check
VALIDATION_CATEGORY - Category of data validation - Generic, Datatype Specific, Sensitive Data
VALIDATION_ID - Unique ID assigned to each validation check
VALIDATION_MESSAGE - Validation error message
PRIMARY_KEY_COLUMN - Column that is dataset’s Primary Key or Unique Identifier. In the absence of Primary Key in the table, module assigns ROW_ID as table’s Primary Key
PRIMARY_KEY_VALUE - Primary Key value that failed validation check
TIMESTAMP - Timestamp of when data validation was performed

Dependencies

Inputs

S3 object filepath of data to be validated
S3 object filepath of corresponding metadata
AWS Account ID of vendor whose data is being validated
Bucket name where data and metadata are stored

Output

Data Quality report in CSV format saved in S3 bucket

Using Data Validation package

Upload validation folder into Amazon SageMaker from AWS account that stores data and metadata.
Install package dependencies by running the following command in SageMaker Terminal:

pip install -r requirements.txt

Use the package to run data quality checks on your data by running the following command in SageMaker Terminal. Replace data_filepath, metadata_filepath, account_id and bucket_name with corresponding S3 object filepaths, vendor's AWS account ID and bucket that stores the data. When using S3 url in data_filepath and metadata_filepath, add an 'a' after 's3' in the url. Example url: 's3a://<bucket_name>/<filepath_in_bucket>'

python run.py <data_filepath> <metadata_filepath> <account_id> <bucket_name>

Quality Report will be saved in qualityreport folder on the root level of <bucket_name>.
Logs will be saved in logfile file in the dp_data_quality folder.

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
.github/workflows		.github/workflows
dp_data_quality		dp_data_quality
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Validation - Version 1

Modules of Data Validation -

Features of Data Quality Report -

Dependencies

Inputs

Output

Using Data Validation package

About

Releases

Packages

Contributors 6

Languages

License

DISHDevEx/dp-data-quality

Folders and files

Latest commit

History

Repository files navigation

Data Validation - Version 1

Modules of Data Validation -

Features of Data Quality Report -

Dependencies

Inputs

Output

Using Data Validation package

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages