Custom pipeline for microbatch data ingestion into sqlite3 database applying file-by-file, line-by-line ingestion with near real time stats tracking and updating
Please check the folder "Files" , there you'll find the initial testing notebooks
Modularized code can be found in "pipeline" folder and main.py file
Microbatch , near real time processing pipeline, containeraized for easy replication and review in data engineering projects
Pull the docker image
docker pull niconomist98/pragma-microbatch-pipeline-v2:latest
Run the pipeline with the docker container
docker run niconomist98/pragma-microbatch-pipeline-v2
-
Run the docker image and the pipeline will start its workflow, the pipeline contains a preprocessing step to handle missing values in raw data before ingestion
-
Once preprocessing is completed, the pipeline starts ingestion, inserting line by line and updating the stats of average price, min, max and row counts without querying the final sqlite3 table, updating these calculations in near real time
-
Once mini batch ingestion is completed, the pipeline run queries against the sqlite3 prices table (final table after ingestion) to verify the stats report of avg min, max, count of price column
-
Once the general datasets pipeline is completed, the process starts once again, runing the pipeline for validation dataset, preprocessing data to handle null values, inserting one by one in database updating the stats in real time and querying the final table to validate the results
Distributed under the MIT License. See LICENSE.txt
for more information.
Nicolas Restrepo Carvajal https://www.linkedin.com/in/niconomist98/