In this assignment, you will build an Extract, Transform, and Load (ETL) pipeline with a focus on Test-Driven Development (TDD) by performing the following tasks:
- Read data from a CSV file (
employee_details.csv
) - Transform the data
- Load the data into a MongoDB database
- Implement the ETL pipeline using Docker Compose
- Python 3.8+
- Docker Compose
- MongoDB
- Any other libraries or tools are allowed as long as they are open-source and free to use
Create a read_csv()
method that takes a file path as an argument and reads the CSV file. Return an in-memory data structure such as a list of dictionaries or a Pandas DataFrame.
Create a transform_data()
method that takes the in-memory data structure from the previous step as an argument.
Perform the following transformation tasks:
- Convert the
BirthDate
from the formatYYYY-MM-DD
toDD/MM/YYYY
. - Do some cleaning on
FirstName
andLastName
columns when needed and remove any leading/trailing spaces. - Merge the
FirstName
andLastName
columns into a new column namedFullName
. - Calculate each employee's age from the
BirthDate
column using as reference Jan 1st, 2023. Add a new column namedAge
to store the computed age. - Add a new column named
SalaryBucket
to categorize the employees based on their salary as follows:
A
for employees earning below50.000
B
for employees earning between50.000
and100.000
C
for employees earning above100.000
- Drop columns
FirstName
,LastName
, andBirthDate
.
Create a load_data()
method that takes the transformed data as an argument, connects to the MongoDB database and inserts the given data into a predefined collection. Ensure that your function creates indexes, if required, to improve performance.
Create a Dockerfile
to create a Docker image of your Python application.
Set up a docker-compose.yml
file where you will define the services and configure the details such as containers, networks, and volumes for your application.
Configure two services: a Python application with your ETL pipeline and a MongoDB database. Ensure that both services can communicate with each other.
The minimum we expect to see in your submission is:
main.py
, the main ETL script containing the core methods:read_csv()
,transform_data()
, andload_data()
.- Test suites for everthing you think should be tested.
- A
Dockerfile
to create a Docker image of the Python application. - A
docker-compose.yml
file defining the required services. - A
README.md
file with instructions for setting up and running the ETL pipeline using Docker Compose.
Upon the completion of the assignment, you should have an ETL pipeline that reads data from a CSV file (employee_details.csv
), performs the required transformations, and loads the transformed data into a MongoDB database, using Docker Compose and with a focus on Test-Driven Development.
Please submit your assignment as a link to a GitHub repository containing the required files. You can also create a private repository and share it with us.
- We will value your focus on TDD throughout all the development process.
- If you're familiar with an ETL framework such as Apache Airflow or Luigi, feel free to use them to implement the ETL pipeline and document your approach accordingly in the
README.md
file.