TaxiDataPipeLine - Data Pipeline Project

Description
Build Status
Folder Structure
Getting Started
Prerequisites
Clone repository
Set-up development environment
How to run
How to Unit Test
How to check coverage
Data Source
Automated build setup
Future Enhancement Plan:

Description

TaxiDataPipeLine is an application to demonstrate creation of a simple data pipeline using Yellow Taxis trip data.

Build Status

	Build & Test
Windows x64
Linux x64

Folder Structure

docs - Project documentation
src - Python source code
test - Unit test

Getting Started

Follow these instructions to get the source code and run it on your local machine.

Prerequisites

You need Python 3.7.3 (Official download link) to run this project.

Clone repository

git clone https://github.com/write2sushma/TaxiDataPipeLine.git

Set-up development environment

Navigate to source folder

cd TaxiDataPipeLine

Create a virtual environment

In Linux OS

python3 -m venv env
source env\bin\activate

In Windows OS

python -m venv env
env\Scripts\activate

Install project dependencies

Project dependencies are listed in requirements.txt file. Use below command to install them -

pip3 install -r requirements.txt

If there is any issue in installing dask using requirements.txt file, use the below commands in command prompt/terminal window:

pip3 install “dask[complete]”

pip3 install dask distributed

How to run

Navigate to TaxiDataPipeLine\taxidata folder and run data_processor.py

python data_processor.py

How to Unit Test

Unit tests are written using Python's UnitTest library. Tests can be run using below command:

pytest

or 

python -m unittest test\test_data_processor.py

How to check coverage

Run below command to check code coverage:

python -m coverage run test\test_data_processor.py

And, then we can see coverage and can generate coverage report in html format

coverage report
coverage html

Data Source

Here is the list of data source urls used for creating data Pipe Line -

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv

Automated build setup

Azure DevOp Pipeline is used to set and configure Automated build pipeline

Future Enhancement Plan:

• Optimize performance using dask scheduler to enable faster parallel processing.
    - This is already implemented in 'enhancements' feature branch.
• Scale pipeline to a multiple of the data size that does not fit any more to one machine using multinode clusters in cloud (e.g. AWS)
• Setup performance monitoring 
• Automate deployment using Azure DevOp Pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docs		docs
taxidata		taxidata
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TaxiDataPipeLine - Data Pipeline Project

Description

Build Status

Folder Structure

Getting Started

Prerequisites

Clone repository

Set-up development environment

Navigate to source folder

Create a virtual environment

Install project dependencies

How to run

How to Unit Test

How to check coverage

Data Source

Automated build setup

Future Enhancement Plan:

About

Releases

Packages

Languages

License

namerhila/TaxiDataPipeLine

Folders and files

Latest commit

History

Repository files navigation

TaxiDataPipeLine - Data Pipeline Project

Description

Build Status

Folder Structure

Getting Started

Prerequisites

Clone repository

Set-up development environment

Navigate to source folder

Create a virtual environment

Install project dependencies

How to run

How to Unit Test

How to check coverage

Data Source

Automated build setup

Future Enhancement Plan:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages