data-engineering

This repository deploys Poetry for package management. Once installed, the virtual environment can be created with the relevant dependencies via poetry install. Execute poetry shell to activate the virtual environment.

Section 1: Consulting Soft Skills

Please refer to the markdown file soft_skills.md.

Section 2: Database & Python ETL

The relevant files for this section are located in the db_python_etl directory. I elected to use the Pandas library over PySpark due to its first class Excel support, and over Dask or Pandas on PySpark since there were only 1,000 records so parallelization would not have a marked benefit.

The ETL script executes the following steps:

Connects to the MySQL database created on Azure by Terraform
Ingests the Excel file, which I included in the repository for simplicity's sake
Creates a new boolean column, is_public_ip, via an anonymous function which determines if the user's ip_address is publicly allocated
Concatenates the first_name and last_name columns to create a new column full_name
Apply a MD5 hash function to the user's email to create a new column obfuscated_email
Drop the first_name, last_name, and email columns
Write the resulting DataFrame to the data_engineering table in the default schema, overwriting if it already exists
Dispose of engine's connection pool

I have also included a small test suite using Pytest. With more time, one should add additional tests and automate test execution and coverage reports via pre-commit hooks and CI processes.

To reproduce the ETL process:

Create an Azure subscription
Create an App Registration in Azure AD, generate credentials, and assign it the Contributor IAM role to the subscription in Step 1
Create a Terraform Cloud workspace and a TF Cloud API token
Create an .env file using .env.example as a template, and fill out the TF-related variables
cd db_python_etl/terraform/ and initialize with dotenv -f <path-to-env-file> run terraform init
Execute Terraform plan with dotenv -f <path-to-env-file> run terraform plan
Glean the MySQL connection details from the Terraform state and populate the remaining environment variables in your new .env file
cd to db_python_etl and execute dotenv -f <path-to-env-file> run python etl.py

Section 3: ML API

The relevant files for this section are located in the ml_api directory. I use a simple Flask application to serve a pre-trained image classification model from ImageAI over an API endpoint. This tooling should not be deployed to a production environment since it uses Flask's development web server and does not have artifact management, authentication, monitoring, robust error handling, tests, scaling, et cetera.

To reproduce:

Execute cd ml_api/docker and docker build --tag ml-api -f ml_api/docker/Dockerfile .
Execute docker run -p 5000:5000 ml-api
Alternatively, Flask can be invoked directly within the ml_api directory via flask run
The API endpoint expects a POST request with two parameters, image_url and prediction_count:

curl --location 'http://127.0.0.1:5000/' \
--header 'Content-Type: application/json' \
--data '{
    "image_url": "https://www.atlasandboots.com/wp-content/uploads/2019/05/ama-dablam2-most-beautiful-mountains-in-the-world.jpg",
    "prediction_count": 10
}'

It returns the following payload, which includes the original request parameters and key/value pairs for the inferences:

{
    "parameters": {
        "image_url": "https://www.atlasandboots.com/wp-content/uploads/2019/05/ama-dablam2-most-beautiful-mountains-in-the-world.jpg",
        "prediction_count": 10
    },
    "predictions": {
        "alp": 99.8894,
        "cliff": 0.002,
        "dogsled": 0.0007,
        "marmot": 0.0009,
        "mountain tent": 0.0678,
        "recreational vehicle": 0.0008,
        "ski": 0.0019,
        "snowmobile": 0.0033,
        "valley": 0.0049,
        "volcano": 0.0087
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
db_python_etl		db_python_etl
ml_api		ml_api
soft_skills		soft_skills
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-engineering

Section 1: Consulting Soft Skills

Section 2: Database & Python ETL

Section 3: ML API

About

Releases

Packages

Languages

License

sdaylor/data-engineering

Folders and files

Latest commit

History

Repository files navigation

data-engineering

Section 1: Consulting Soft Skills

Section 2: Database & Python ETL

Section 3: ML API

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages