Welcome to Anomaly Data Quality Detection Rule

Project Description

This project is designed to build an automated data pipeline for detecting anomalies in various NYC Taxi datasets. The pipeline is built using AWS services such as AWS Glue for ETL (Extract, Transform, Load) operations and AWS CDK (Cloud Development Kit) for infrastructure as code (IaC). The pipeline handles four different types of datasets yello_tripdata, green_tripdata, fvh_tripdata, and fhvh_tripdata, each with its own schema and data processing requirements. The AWS Glue job reads the input data, processes it according to the specified dataset, runs data quality checks, and then stores the processed data in S3. The job also dynamically updates AWS Glue Data Catalog tables, making the processed data available for further analysis.

Project Structure

├── README.md
├── cdk
│   ├── bin
│   │   └── app.ts
│   ├── lib
│   │   └── data-anomaly-detection-stack.ts
│   ├── scripts
│   │   └── anomaly_detection_blog_data_generator_job.py
│   ├── data
│   │   ├── yellow_tripdata_2024-05.parquet
│   │   ├── green_tripdata_2024-05.parquet
│   │   ├── fhv_tripdata_2024-05.parquet
│   │   └── fhvh_tripdata_2024-05.parquet
└── cdk.json

cdk/lib/data-anomaly-detection-stack.ts: Contains the CDK code for creating AWS infrastructure, including S3 buckets, Glue jobs, Glue tables, and IAM roles.
cdk/scripts/anomaly_detection_blog_data_generator_job.py: The Python script executed by AWS Glue to process datasets, apply data quality checks, and store the data.
cdk/data: Contains sample datasets for yellow, green, fhv, and fhvh tripdata.

Note: The `fhvh_tripdata` and `yellow_tripdata` are not available in `data` folder because of large amount of dataset, so that causes error while deploying or cloning repository. Download the these two datasets and paste in `data` folder.

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Prerequisites

Node.js and npm installed on your machine.
AWS CLI configured with appropriate credentials.
AWS CDK installed globally via npm:

npm install -g aws-cdk

Instructions

1: Clone the Repository and Install Dependencies

git clone https://github.com/Carma-tech/data-anamoly-detection
cd data-anamoly-detection
npm install

2: Build and Deploy the CDK Stack

cdk bootstrap
cdk synth
cdk deploy DataAnomalyDetectionRuleStack

This will provision the required AWS infrastructure in your AWS account.

The CDK deployment will automatically upload the datasets and the Glue job scripts to the S3 bucket created during the deployment.

To upload all the files to s3 bucket run the following command:

aws s3 cp ./data s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/ --recursive

Or

aws s3 sync ./data s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/

3: Run the Glue Job for a Specific Dataset

You can just start the Glue job for the default dataset:

aws glue start-job-run --job-name anomaly_detection_blog_data_generator_job

OR

You can start the Glue job for a specific dataset by passing the appropriate PREFIX and TABLE_NAME arguments. For example, to run the job for the green_tripdata dataset:

aws glue start-job-run \
  --job-name anomaly_detection_blog_data_generator_job \
  --arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/green_tripdata_2024-05.parquet","--TABLE_NAME":"green_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'

This command starts the Glue job, which processes the green_tripdata dataset, applies data quality checks, and stores the processed data in S3.

3.1: Generalizing for Other Datasets

You can replace the --PREFIX and --TABLE_NAME values to match the dataset you want to process:

For yellow_tripdata:

aws glue start-job-run \
  --job-name anomaly_detection_blog_data_generator_job \
  --arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/yellow_tripdata_2024-05.parquet","--TABLE_NAME":"yellow_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'

For fhvh_tripdata

aws glue start-job-run \
  --job-name anomaly_detection_blog_data_generator_job \
  --arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/fhvh_tripdata_2024-06.parquet","--TABLE_NAME":"fhvh_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'

For fhv_tripdata

aws glue start-job-run \
  --job-name anomaly_detection_blog_data_generator_job \
  --arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/fvh_tripdata_2024-06.parquet","--TABLE_NAME":"fhv_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'

4: Monitor the Glue Job

You can monitor the progress of the Glue job via the AWS Glue Console. Once the job completes successfully, the processed data will be available in the specified S3 location, and the Glue Data Catalog will be updated with the new dataset.

To test or check the data quality DQ run data results in AWS console: Glue > ETL jobs > Visual ETL > anomaly_detection_blog_data_generator_job > Data Quality

5: Analyze Processed Data

After the Glue job completes, you can query the processed data using AWS Athena, Redshift Spectrum, or any other compatible tool that supports Glue Data Catalog as a data source.

6: Clean Up Resources

To avoid incurring unnecessary charges, you can delete the AWS resources created by the CDK stack:

cdk destroy DataAnomalyDetectionRuleStack

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bin		bin
data		data
lib		lib
scripts		scripts
test		test
.gitignore		.gitignore
.npmignore		.npmignore
README.md		README.md
cdk.json		cdk.json
data_quality_ruleset.json		data_quality_ruleset.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Anomaly Data Quality Detection Rule

Project Description

Project Structure

Note: The `fhvh_tripdata` and `yellow_tripdata` are not available in `data` folder because of large amount of dataset, so that causes error while deploying or cloning repository. Download the these two datasets and paste in `data` folder.

Prerequisites

Instructions

1: Clone the Repository and Install Dependencies

2: Build and Deploy the CDK Stack

3: Run the Glue Job for a Specific Dataset

3.1: Generalizing for Other Datasets

4: Monitor the Glue Job

5: Analyze Processed Data

6: Clean Up Resources

Read more about

About

Releases

Packages

Contributors 3

Languages

Carma-tech/data-anamoly-detection

Folders and files

Latest commit

History

Repository files navigation

Welcome to Anomaly Data Quality Detection Rule

Project Description

Project Structure

Note: The fhvh_tripdata and yellow_tripdata are not available in data folder because of large amount of dataset, so that causes error while deploying or cloning repository. Download the these two datasets and paste in data folder.

Prerequisites

Instructions

1: Clone the Repository and Install Dependencies

2: Build and Deploy the CDK Stack

3: Run the Glue Job for a Specific Dataset

3.1: Generalizing for Other Datasets

4: Monitor the Glue Job

5: Analyze Processed Data

6: Clean Up Resources

Read more about

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Note: The `fhvh_tripdata` and `yellow_tripdata` are not available in `data` folder because of large amount of dataset, so that causes error while deploying or cloning repository. Download the these two datasets and paste in `data` folder.

Packages