- Description
- Build Status
- Folder Structure
- Getting Started
- Prerequisites
- Clone repository
- Set-up development environment
- How to run
- How to Unit Test
- How to check coverage
- Data Source
- Automated build setup
- Future Enhancement Plan:
TaxiDataPipeLine is an application to demonstrate creation of a simple data pipeline using Yellow Taxis trip data.
Build & Test | |
---|---|
Windows x64 | |
Linux x64 |
Follow these instructions to get the source code and run it on your local machine.
You need Python 3.7.3
(Official download link) to run this project.
git clone https://github.com/write2sushma/TaxiDataPipeLine.git
cd TaxiDataPipeLine
In Linux OS
python3 -m venv env
source env\bin\activate
In Windows OS
python -m venv env
env\Scripts\activate
Project dependencies are listed in requirements.txt
file. Use below command to install them -
pip3 install -r requirements.txt
If there is any issue in installing dask using requirements.txt
file, use the below commands in command prompt/terminal window:
pip3 install “dask[complete]”
pip3 install dask distributed
Navigate to TaxiDataPipeLine\taxidata
folder and run data_processor.py
python data_processor.py
Unit tests are written using Python's UnitTest library. Tests can be run using below command:
pytest
or
python -m unittest test\test_data_processor.py
Run below command to check code coverage:
python -m coverage run test\test_data_processor.py
And, then we can see coverage and can generate coverage report in html format
coverage report
coverage html
Here is the list of data source urls used for creating data Pipe Line -
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv
Azure DevOp Pipeline is used to set and configure Automated build pipeline
• Optimize performance using dask scheduler to enable faster parallel processing.
- This is already implemented in 'enhancements' feature branch.
• Scale pipeline to a multiple of the data size that does not fit any more to one machine using multinode clusters in cloud (e.g. AWS)
• Setup performance monitoring
• Automate deployment using Azure DevOp Pipeline