This project is designed to build an automated data pipeline for detecting anomalies in various NYC Taxi
datasets. The pipeline is built using AWS services such as AWS Glue for ETL (Extract, Transform, Load) operations and AWS CDK (Cloud Development Kit) for infrastructure as code (IaC). The pipeline handles four different types of datasets yello_tripdata
, green_tripdata
, fvh_tripdata
, and fhvh_tripdata
, each with its own schema and data processing requirements. The AWS Glue job reads the input data, processes it according to the specified dataset, runs data quality checks, and then stores the processed data in S3. The job also dynamically updates AWS Glue Data Catalog tables, making the processed data available for further analysis.
├── README.md
├── cdk
│ ├── bin
│ │ └── app.ts
│ ├── lib
│ │ └── data-anomaly-detection-stack.ts
│ ├── scripts
│ │ └── anomaly_detection_blog_data_generator_job.py
│ ├── data
│ │ ├── yellow_tripdata_2024-05.parquet
│ │ ├── green_tripdata_2024-05.parquet
│ │ ├── fhv_tripdata_2024-05.parquet
│ │ └── fhvh_tripdata_2024-05.parquet
└── cdk.json
- cdk/lib/data-anomaly-detection-stack.ts: Contains the CDK code for creating AWS infrastructure, including S3 buckets, Glue jobs, Glue tables, and IAM roles.
- cdk/scripts/anomaly_detection_blog_data_generator_job.py: The Python script executed by AWS Glue to process datasets, apply data quality checks, and store the data.
- cdk/data: Contains sample datasets for
yellow
,green
,fhv
, andfhvh tripdata
.
Note: The fhvh_tripdata
and yellow_tripdata
are not available in data
folder because of large amount of dataset, so that causes error while deploying or cloning repository. Download the these two datasets and paste in data
folder.
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Node.js and npm installed on your machine.
- AWS CLI configured with appropriate credentials.
- AWS CDK installed globally via npm:
npm install -g aws-cdk
git clone https://github.com/Carma-tech/data-anamoly-detection
cd data-anamoly-detection
npm install
cdk bootstrap
cdk synth
cdk deploy DataAnomalyDetectionRuleStack
This will provision the required AWS infrastructure in your AWS account.
The CDK deployment will automatically upload the datasets and the Glue job scripts to the S3 bucket created during the deployment.
To upload all the files to s3 bucket run the following command:
aws s3 cp ./data s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/ --recursive
Or
aws s3 sync ./data s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/
You can just start the Glue job for the default dataset:
aws glue start-job-run --job-name anomaly_detection_blog_data_generator_job
OR
You can start the Glue job for a specific dataset by passing the appropriate PREFIX
and TABLE_NAME
arguments. For example, to run the job for the green_tripdata
dataset:
aws glue start-job-run \
--job-name anomaly_detection_blog_data_generator_job \
--arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/green_tripdata_2024-05.parquet","--TABLE_NAME":"green_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'
This command starts the Glue job, which processes the green_tripdata dataset, applies data quality checks, and stores the processed data in S3.
You can replace the --PREFIX
and --TABLE_NAME
values to match the dataset you want to process:
- For
yellow_tripdata
:
aws glue start-job-run \
--job-name anomaly_detection_blog_data_generator_job \
--arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/yellow_tripdata_2024-05.parquet","--TABLE_NAME":"yellow_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'
- For
fhvh_tripdata
aws glue start-job-run \
--job-name anomaly_detection_blog_data_generator_job \
--arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/fhvh_tripdata_2024-06.parquet","--TABLE_NAME":"fhvh_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'
- For
fhv_tripdata
aws glue start-job-run \
--job-name anomaly_detection_blog_data_generator_job \
--arguments '{"--PREFIX":"s3://anomaly-detection-data-bucket/anomaly_detection_blog/data/fvh_tripdata_2024-06.parquet","--TABLE_NAME":"fhv_tripdata","---YEAR":"2024","--MONTH":"5","--DAY":"1"}'
You can monitor the progress of the Glue job via the AWS Glue Console. Once the job completes successfully, the processed data will be available in the specified S3 location, and the Glue Data Catalog will be updated with the new dataset.
To test or check the data quality DQ run data results in AWS console: Glue > ETL jobs > Visual ETL > anomaly_detection_blog_data_generator_job > Data Quality
After the Glue job completes, you can query the processed data using AWS Athena, Redshift Spectrum, or any other compatible tool that supports Glue Data Catalog as a data source.
To avoid incurring unnecessary charges, you can delete the AWS resources created by the CDK stack:
cdk destroy DataAnomalyDetectionRuleStack
https://aws.amazon.com/blogs/big-data/introducing-aws-glue-data-quality-anomaly-detection/