iFood Data Engineer test

Introduction

This project is a proposal for a solution to create an iFood order data processing environment. More information on: iFood Test

The solution architecture was developed focused on scalability..

The source of the data is formed by three files, constituting the entities.:

Order: containing data from the requests generated by customers
Consumer: customer information
Restaurant: Establishments registered on the iFood platform

We use Delta Lake as a layer above our Data Lake (Amazon S3), allowing us to work with accidents and better performance in frequent interactions with our Data Lake. It is also possible to obtain the batch and streaming experience in a unified way.

Lake Delta layers:

Bronze: Given in its native and gross format
Silver: Clean and transformed data
Gold: refined data

Data Flow

Before the ingestion in Delta Lake's bronze layer, the data is pseudonymized by generating a new table (lookup table) to store separately personal data. Then the information is saved in a raw state in the bronze layer. So, after some transformations, two new tables are generated:

Order Item: Contains the items of a particular request
Order Garnish Items: Additional items

Finally, all information related to orders is loaded to the Gold Layer to be consumed by different areas.

Privacy, Compliance and LGPD/GDPR

Pseudonimization

A LGPD treats pseudonymization as: "treatment through which data loses the possibility of a direct or indirect association to an individual, if not by the use of additional information maintained separately by the controller in a controlled and secure environment."

Firstly we identify that the database has personal data attributes (CPF and telephone), which, according to LGPD, are _"All information related to the natural person identified or identifiable" _. As a result, a 4-step policy was created to pseudonymize data, making them maintained and treated separately to ensure the non -attribution of an identifiable person:

Serialization: We created a serialized key, the result of the scale of attributes: CPF, name, telephone, and DDD; this helps to avoid comparisons by hash lists. We use a hash (SHA) function by applying mathematical algorithms to map the fixed-size bits input strings (hash). Functions like this perform unidirectional encryptions (which cannot be reversed computationally by reverse engineering).
Cryptography: Unidirectional hashes are insufficient since they can be reversed using Rainbow Tables (large data dictionaries with pre-calculated hashes). Therefore, for each generated hash, we apply a robust encryption algorithm (AES-125) to circumvent the risks of attacks by hash tables.
Lookup Tables: Finally, with the result of the above operations, we generated a new pseudonymized attribute (pseudonymous_id) and stored it separately in a mapping table with their respective personal data. This allows the entire data processing process only pseudonyms to be used. Our Lookup Table has personal data that can be used to identify a particular person. This process causes only non-personal data to be used in subsequent data manipulation processes.

The GDPR encourages us to pseudonimize data as soon as possible, so we chose to do pseudonymization Immediately before data ingestion at Delta Lake.

Thus, we make the data that will be manipulated with the order information not contain personal data. The example of "consumer" is presented below:

In turn, the Mapping Table can be stored in isolation.

Data Governance

Once personal data is isolated, we can easily create access policies, for example, blocking users' access to directory data where personal data (Delta_lake/Lookup/Consumer_order/) will be saved. As well as the implementation of access controls for the tables (which can be linked with identity providers).
Delta Lake helps us with retention policies and forgetting requests (one of the rights of the data holder). It becomes challenging in traditional data lake environments to map data entities completely. In Delta Lake, after the exclusion, commands such as "Vaccum" can be used to remove all files that are no longer referenced by their respective delta table. In addition, we get performance increment grouping of *"Order-Z" * in the scanning of the exclusion process. The Combination Delta Lake + Pseudonymous_id + Glue Catalog facilitates data mapping and control.

ETL Pipeline

Requirements

Databricks Standard Plan ou superior
Amazon AWS
- Amazon Redshift
- S3
- Glue Catalog
- Cloudformation
AWS CLI
Python 3.x

Creating the Environment

Amazon AWS

Let's use CloudFormation for the provision of the environment at AWS. All template files are in the IAC/CloudFormation/.

Comments:
- The commands below are ready to be executed at the root of this project. -We will use the US-WEST-2 region
- Heads up! Environmental creation procedures do not take security practices into account. These actions were created only to automate the provision of the environment as much as possible, and is easily performed in test environments. Therefore, adjust policies if necessary.
Change the iac/cloudformation/stack_data.yaml replacing the YOUR_ACCOUNT_ID by your Account ID from AWS:
Create a user that went with the following attached policies indicated below.
- AmazonEC2FullAccess
- IAMFullAccess
- AmazonS3FullAccess
- AmazonRedshiftFullAccess
- AWSCloudFormationFullAccess
Set your user's credentials running aws configure.
```
aws configure
```

Create the necessary roles and policies.

aws cloudformation create-stack --stack-name stack-access --template-body file://iac/cloudformation/stack_access.yaml --output text --capabilities CAPABILITY_NAMED_IAM

Wait for the end of the stack provisioning (status: CREATE_COMPLETE). You can follow the status by performing the following:
```
aws cloudformation describe-stacks --stack-name stack-access --query "Stacks[0].StackStatus" --output text
```
Now that the necessary policies have been created attach the Policy “Databricks” to their user.
In the Databricks account management enter your user's credentials.
Create a Bucket S3 (in the us-west-2 region) at AWS. This bucket will be used on ETL pipeline and Redshift Spectrum.

Run CloudFormation for the creation of the Data Stack.

aws cloudformation create-stack --stack-name stack-data --template-body file://iac/cloudformation/stack_data.yaml --output text --capabilities CAPABILITY_NAMED_IAM

Wait for the end of the stack provisioning (status: CREATE_COMPLETE). You can follow the status by performing the following:
```
aws cloudformation describe-stacks --stack-name stack-data --query "Stacks[0].StackStatus" --output text
```

After the finish, run the command below to get the endpoint of the Redshift Cluster.

aws cloudformation describe-stacks --stack-name stack-data --query "Stacks[0].Outputs[?OutputKey=='ClusterEndpoint'].OutputValue" --output text

Databricks cluster

Create the Instance Profile:

arn:aws:iam::YOUR_ACCOUNT_ID:instance-profile/Databricks_Glue_S3
Create the Cluster Spark:
- Configurations:
- Select a Instance Profile:
- Enable the Glue Catalog:
  
  spark.databricks.hive.metastore.glueCatalog.enabled true
  
  Note: The use of external metastore can considerably increase the latency of the data. To avoid this problem, we can enable the Glue Catalog cache. Just added:
```
spark.hadoop.aws.glue.cache.table.enable true
spark.hadoop.aws.glue.cache.table.size 1000
spark.hadoop.aws.glue.cache.table.ttl-mins 30
```
- Install libraries:
  
  pycrypto==2.6.1

Installing python dependencies.

sudo apt update
sudo apt install python3-pip

sudo apt-get install libpq-dev
pip3 install psycopg2==2.8.6

Create redshift tables:
```
python iac/redshift/app.py --endpoint-redshift "YOUR_REDSHIFT_ENDPOINT" --aws-account-id "YOUR_ACCOUNT_ID" --aws-s3-bucket "YOUR_BUCKET_NAME"
```
Exemplo: python iac/redshift/app.py --endpoint-redshift "stack-data-redshiftcluster-1n18jdnczvt.crmu3ndu7d.us-west-2.redshift.amazonaws.com" --aws-account-id "123456789123" --aws-s3-bucket "bucket-xyz2021"

✅ Running

After provisioning the environment, the notebook containing the Pipeline ETL is ready for execution in the Databricks:

👉 Notebook Pipeline ETL

Please, in the notebook, pay attention to informing the name of the bucket created.

After Pipeline ETL execution, the data is available on Redshift for SQL consultation and consumption through the API.

API

Docker imagem pull:
```
docker pull bfamorim/api-restaurant:1.0
```
Image run. Replace the <REDSHIFT_ENDPOINT> with the redshift endpoint.
```
docker run -p 8082:8082 -e REDSHIFT_ENDPOINT=<REDSHIFT_ENDPOINT> bfamorim/api-restaurant:1.0
```
Exemplo: docker run -p 8082:8082 -e REDSHIFT_ENDPOINT=stack-data-redshiftcluster-1n1k153xc0czvt.crdmkp74267d.us-west-2.redshift.amazonaws.com bfamorim/api-restaurant:1.0

Requests:

Counting per day to the state of São Paulo:

http://localhost:8082/order/count-per-day-for-city-state?orderDate=2018-12-08&state=SP

Counting per day to the state of São Paulo and the city of São Paulo:

http://localhost:8082/order/count-per-day-for-city-state?orderDate=2018-12-08&state=SP&city=SAO PAULO

Top 10 restaurants by costumers:

http://localhost:8082/order/top-10-consumed-restaurants-per-customer?customerId=2a68e52c-929a-4801-8128-86afb8cacb58

"# ifood-pedidos"

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
api/api-restaurant		api/api-restaurant
etl		etl
iac		iac
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iFood Data Engineer test

Introduction

Data Flow

Privacy, Compliance and LGPD/GDPR

Pseudonimization

Data Governance

ETL Pipeline

Requirements

Creating the Environment

Amazon AWS

Databricks cluster

✅ Running

API

About

Releases

Packages

Languages

brunofaustino/ifood-pedidos

Folders and files

Latest commit

History

Repository files navigation

iFood Data Engineer test

Introduction

Data Flow

Privacy, Compliance and LGPD/GDPR

Pseudonimization

Data Governance

ETL Pipeline

Requirements

Creating the Environment

Amazon AWS

Databricks cluster

✅ Running

API

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages