DataWagon

Automated loading of YouTube Analytics files into a PostgreSQL database

Background

This project was built to replace an existing process which used a bash file to load uncompressed files into a Postgres instance. The original script was not able to handle compressed files, and required manually moving files in and out of an import directory. It had no mechanism to check for duplicate files, and no way to check if the database was up to date with the files in the import directory. The script also required modification and manual table creation to add new types of files to the db.

Goals

Leave existing process relatively intact. The user will still download the csv files and place them in a directory
Use the existing file storage and remove necessity of moving files in and out of an import directory
Allow for compressed files
Check for duplicate files
Prevent files from being imported more than once
Provide user feedback on the status of the import process
Use proper data types for each column (ie, don't use floats for revenue tracking in fractions of a cent)

Pipeline Setup

Install Application

Open Terminall.app Retrieve code from GitHub

git clone https://github.com/jtmcn/datawagon.git

Move into folder

cd datawagon

Setup environment

./update.sh

Update Environment Variables

Three run time variables are required. They may be passed in as parameters or as environment variables. An easy way to manage them is by putting them in a .env file and dropping it in the top of the application folder (next to this readme).

Variables:

DW_POSTGRES_DB_URL
DW_DB_SCHEMA
DW_CSV_SOURCE_DIR

Typical Usage

When new files have been added to the source folder and should be copied to the database, use the following example as a guide:

cd ~/Code/datawagon
./update.sh # this is optional
source .venv/bin/activate
datawagon import

This will check for code updates, activate the python environment, and being the import process.

When the datawagon import command is executed, It will

check for a database connection
prompt to create schema from DW_DB_SCHEMA if necessary
check files in DW_CSV_SOURCE_DIR for duplicates and invalid names
pull a list of files already uploaded from the database
present user with comparison table for existing and new files
prompt and begin upload on confirmation

Usage Notes

Supported file extensions: .csv, .csv.gz, .csv.zip
Columns added to each table begin with an underscore. ex, _content_owner

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github/workflows		.github/workflows
.vscode		.vscode
datawagon		datawagon
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
datawagon-config.toml		datawagon-config.toml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt
update.sh		update.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataWagon

Background

Goals

Pipeline Setup

Install Application

Update Environment Variables

Typical Usage

Usage Notes

About

Releases 4

Languages

License

jtmcn/datawagon-loader

Folders and files

Latest commit

History

Repository files navigation

DataWagon

Background

Goals

Pipeline Setup

Install Application

Update Environment Variables

Typical Usage

Usage Notes

About

Resources

License

Stars

Watchers

Forks

Releases 4

Languages