Data Engineer Interview Test

We are looking for a high quality data engineer which can deliver comprehensive solutions for our continuity and business growth.

Our team drives the data culture, we want to change how we produce data from large batches to micro batching, from daily to near real-time/streaming processing, from tabular reports to insightful dashboards.

You can be part of an amazing team which deals with data all the time using different process, tools and technologies.

Following is a little treasure and challenge for those keen on joining this amazing company and team.

Junior/Mid

For a Junior/Mid role we are expecting good basic standards for code and reporting.

Senior

We are expecting the most from you. Show up your top skill levels.

The Project

The project is to ingest and process data from APIs and generate a basic report from them.

Tools and Technologies

We are a Python and SQL workshop, we would like to see this project using just those.

However, we are open to other tools and technologies if we are able to easily replicate on our side.

For the database, use a simple and light optimizer for it.

Please, avoid licensed products, this might block and limit us to review your work.

How to do it?

Fork/Copy this repo, build your data processing layer and follow the best practices in SDLC. Open a Pull Request and send us a message highlighting the test is completed.

Rules

it must come with step by step instructions to run the code.
please, be mindful that your code might be moved or deleted after we analyse the PR.
don't forget the best practices
be able to explain from the ground up the whole process on face to face interview

The challenge

You can choose one of the following exercises:

A. Cat Lover

B. User Data Check

C. The old fashion ETL Master

We are expecting your repo with the following:

High level summary of the architecture used
Package your code, it should be rerunable
List any extra objects required to complete the exercise
Explain how to schedule this process to run multiple times per day?
How would you deploy this project?

Bonus: Can you make it to run on a container (Docker)?

A. Cat Lovers

If you are a cat lover, you would enjoying processing cat fact.

For this exercise, build a comprehensive cat fact dataset with an autonomous data load.

We are expeting the following queries and some results:

Can you list the number of words only in the dataset?
What was the most common unicode character?
List the top 20 words based on number of facts?
List the most common Geographical country in the dataset?
Which fact check you found most interesting?
Bonus: Can you make any sentiment analysis from those facts?

B. User Data check

From a random list of users with name, DOB, gender and location. Can you cross check the data from other sources to compare the aged of the person using their name through Agify.io, the gender based on their name using the Genderize.io and the top 2 nationalities from Nationalize.io.

The names can be generated from randomuser.me if you don't have other sources.

We are expecting the following queries and some results:

What are the most common ageing discrepancies?
What is the gender distribution based on user's gender and the inferrered gender?
What is the most common nationalities?
Can you flag any discrepancies using those APIs?
Are there any rich features from those APIs we should look after?
Bonus: What percentage of accuracy are those API's for your dataset?

C. The old fashion ETL Master

The data for this exercise can be found on the data.zip file. Can you describe the file format?

Super Bonus: generate your own data through the instructions on the encoded file bonus_etl_data_gen.txt. To get the bonus points, please encoded the file with the instructions were used to generate the files.

Code you scripts to load the data into a database.
Design a star schema model which the data should flow.
Build your process to load the data into the star schema

Bonus point:

add a fields to classify the customer account balance in 3 groups
add revenue per line item
convert the dates to be distributed over the last 2 years

Bonus: What to do if the data arrives in random order and times via streaming?

Bonus: Would be a problem if the data from the source system is growing at 6.1-12.7% rate a month?

Data Reporting

One of the most important aspects to build a DWH is to deliver insights to end-users.

Can you using the designed star schema (or if you prefer the raw data), generate SQL statements to answer the following questions:

What are the top 5 nations in terms of revenue?
From the top 5 nations, what is the most common shipping mode?
What are the top selling months?
Who are the top customer in terms of revenue and/or quantity?
Compare the sales revenue of on current period against previous period?

Data profilling

Data profiling are bonus.

What tools or techniques you would use to profile the data?

What results of the data profiling can impact on your analysis and design?

ERD for the ETL Master option

Author: adilsonmendonca

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
bonus_etl_data_gen.txt		bonus_etl_data_gen.txt
data.zip		data.zip
ddl.sql		ddl.sql
erd.png		erd.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Interview Test

Junior/Mid

Senior

The Project

Tools and Technologies

How to do it?

Rules

The challenge

A. Cat Lovers

B. User Data check

C. The old fashion ETL Master

Data Reporting

Data profilling

ERD for the ETL Master option

About

Releases

Packages

GumGum-Inc/dataengineer-interview-test

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Interview Test

Junior/Mid

Senior

The Project

Tools and Technologies

How to do it?

Rules

The challenge

A. Cat Lovers

B. User Data check

C. The old fashion ETL Master

Data Reporting

Data profilling

ERD for the ETL Master option

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages