This repository deploys Poetry for package management. Once installed,
the virtual environment can be created with the relevant dependencies via poetry install
. Execute poetry shell
to activate the virtual environment.
Please refer to the markdown file soft_skills.md.
The relevant files for this section are located in the db_python_etl directory. I elected to use the Pandas library over PySpark due to its first class Excel support, and over Dask or Pandas on PySpark since there were only 1,000 records so parallelization would not have a marked benefit.
The ETL script executes the following steps:
- Connects to the MySQL database created on Azure by Terraform
- Ingests the Excel file, which I included in the repository for simplicity's sake
- Creates a new boolean column,
is_public_ip
, via an anonymous function which determines if the user'sip_address
is publicly allocated - Concatenates the
first_name
andlast_name
columns to create a new columnfull_name
- Apply a MD5 hash function to the user's
email
to create a new columnobfuscated_email
- Drop the
first_name
,last_name
, andemail
columns - Write the resulting DataFrame to the
data_engineering
table in the default schema, overwriting if it already exists - Dispose of engine's connection pool
I have also included a small test suite using Pytest. With more time, one should add additional tests and automate test execution and coverage reports via pre-commit hooks and CI processes.
To reproduce the ETL process:
- Create an Azure subscription
- Create an App Registration in Azure AD, generate credentials, and assign it the Contributor IAM role to the subscription in Step 1
- Create a Terraform Cloud workspace and a TF Cloud API token
- Create an
.env
file using.env.example
as a template, and fill out the TF-related variables cd db_python_etl/terraform/
and initialize withdotenv -f <path-to-env-file> run terraform init
- Execute Terraform plan with
dotenv -f <path-to-env-file> run terraform plan
- Glean the MySQL connection details from the Terraform state and populate the remaining environment variables in
your new
.env
file cd
todb_python_etl
and executedotenv -f <path-to-env-file> run python etl.py
The relevant files for this section are located in the ml_api directory. I use a simple Flask application to serve a pre-trained image classification model from ImageAI over an API endpoint. This tooling should not be deployed to a production environment since it uses Flask's development web server and does not have artifact management, authentication, monitoring, robust error handling, tests, scaling, et cetera.
To reproduce:
- Execute
cd ml_api/docker
anddocker build --tag ml-api -f ml_api/docker/Dockerfile .
- Execute
docker run -p 5000:5000 ml-api
- Alternatively, Flask can be invoked directly within the
ml_api
directory viaflask run
- The API endpoint expects a POST request with two parameters,
image_url
andprediction_count
:
curl --location 'http://127.0.0.1:5000/' \
--header 'Content-Type: application/json' \
--data '{
"image_url": "https://www.atlasandboots.com/wp-content/uploads/2019/05/ama-dablam2-most-beautiful-mountains-in-the-world.jpg",
"prediction_count": 10
}'
It returns the following payload, which includes the original request parameters and key/value pairs for the inferences:
{
"parameters": {
"image_url": "https://www.atlasandboots.com/wp-content/uploads/2019/05/ama-dablam2-most-beautiful-mountains-in-the-world.jpg",
"prediction_count": 10
},
"predictions": {
"alp": 99.8894,
"cliff": 0.002,
"dogsled": 0.0007,
"marmot": 0.0009,
"mountain tent": 0.0678,
"recreational vehicle": 0.0008,
"ski": 0.0019,
"snowmobile": 0.0033,
"valley": 0.0049,
"volcano": 0.0087
}
}