How to execute my project

As the final capstone, we are supposed to run fellow's projects at our machines. So I'll summarize here how you can run my project.

You will need a Mac or Linux Machine (like Ubuntu 22.04). Windows folks could use Ubuntu 22.04 at WSL2 without problems.

Setup Google Cloud Plataform for a new project

Access GCP New Project by clicking here
Define for Project name and for Project ID: br-weather-your-name
Go to IAM & Admin >> Service Accounts and create a New Service Account.
1. Put admin-svc in Service account name.
2. Assign these roles for your new account:
  - BigQuery Admin
  - Compute Admin
  - Storage Admin
  - Storage Object Admin
  - Viewer
For real world projects, you must create more granular rules for your service accounts.
1. At IAM & Admin >> Service Accounts click over admin-svc account.
2. In the new page, click in KEYS >> [ADD KEYS] >> Create new key >> * JSON >> CREATE
3. A new file will be downloaded, keep it safe, never publish it to public shares (GitHub, PasteBin, etc).
For the first use, go to Compute Engine and enable Compute Engine API in the new page.
Setup Google Cloud SDK at your computer (Item 4 for instructions)
Setup Terraform at your computer. Instructions here
OPTIONAL: configure SSH Key authentication in your project:
1. At your PC, run this:
```
ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USERNAME -b 2048
```
2. Go to Compute Engine >> Metada >> SSH KEYS
3. Click in EDIT >> + ADD ITEM
4. Copy/Paste at SSH Key n** the content of your generated file ~/.ssh/KEY_FILENAME.pub
5. To show your VMs External IP:
  - Go back to Compute Engine
  - Click COlumn display options...
  - Select External IP
6. For now, you are able to login to GCP VM with:
```
ssh -i ~/.ssh/KEY_FILENAME USERNAME@VM.EXTERNAL.IP
```
7. References: GPC Docs:

Git Clone Me

It's time to clone this repository to a folder in your PC. In your shell, run this:

cd ~
git clone https://github.com/romiof/brazil-weather.git
cd brazil-weather

How to use Terraform

We're going to start Terraform to construct our cloud infrastructure. At your shell, run this to download all artifacts:

cd ~/brazil-weather/terraform
terraform init

My recipe will setup three GCP object's at "us-west1" / "us-west1-a", to avail GCP Free Tier.

A Google Cloud Storage Bucket
A Big Query Dataset
A Ubuntu 22.04 VM type e2-medium (which charges you about $0.03 hourly)

Keep an eye here to see your free tier limits.

Also for the VM, will be created a swap file and all PIP requirements will be installed. Prefect agent will start with VM root user. Need to use this approach, because all of us are out of GCP 90-days trial, and now, we must pay for some resources 😉

Here how to plan / apply / destroy your cloud resources:

terraform plan \
    -var="project=your-gcp-project-id" \
    -var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
    -var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"

terraform apply \
    -var="project=your-gcp-project-id" \
    -var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
    -var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"

terraform destroy \
    -var="project=your-gcp-project-id" \
    -var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
    -var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"

Prefect Cloud

Use your Prefect Cloud Key / Workspace to complete above Terraform command variables.

API Keys here
Workspace Settings at main page

Now on Prefect Cloud, let's create all needed blocks.
Under Blocks Menu, create four itens:

GCP Credentials / gcp-login
- Put <your-gcp-project-id> in Project (Optional)
- Copy/Paste the content of Your GCP Service Account JSON Key in Service Account Info (Optional)
GCS / gcs-prefect
- Put bucket-<your-GCP-bucket-name>/flows in Bucket Path
- Copy/Paste the content of Your GCP Service Account JSON Key in Service Account Info (Optional)
BigQuery Warehouse / gcp-bq
- At combo-box, select gcp-login
GCS Bucket / gcs-bucket
- Put bucket-<your-GCP-bucket-name> in Name of Bucket
- Put data-lake in Bucket Folder
- At combo-box, select gcp-login

Python, VirtualEnv and Prefect Local

First of all, in your PC:

Install Python 3.10.x
Install VirtualEnv

Then execute this commands to create a new VENV:

cd ~/brazil-weather
virtualenv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

A VirtualEnv was created at your project folder and all PIP requirements has been downloaded. Now logon your local Prefect profile to Prefect Cloud.

cd ~/brazil-weather
prefect cloud login -k <your_prefect_cloud_token_api>
prefect cloud workspace set --workspace <prefect_cloud/workspace_string>
prefect config set PREFECT_API_ENABLE_HTTP2=false

Prefect Agents / Cloud Servers, are having some troubles with HTTP2, for this, I suggest you to disable it for a while. GitHub Issue

My .py files used for Prefect, consist in one main file and a extra function file. All csv files used for ELT, will be downloaded from https://portal.inmet.gov.br/uploads/dadoshistoricos/. This URL is inputed at the dict-key[BASE_URL].

Now let's deploy our flow to Cloud Workspace. It will use a sub-folder /flows/, under GCS Bucket (from Prefect Block), to save our .py files. After that, our Prefect Agent will download these files in each flow run.

cd ~/brazil-weather/prefect
prefect deployment build elt_flow.py:main_flow -n brazil-weather-flow -sb gcs/gcs-prefect -q default --cron "0 5 * * *" -o brazil-weather-flow.yaml

My default parameters are a JSON/Dict attributes, and I can't figure how to include them at deployment build. So, please edit the file brazil-weather-flow.yaml , under parameters key, to include this:

parameters:
  dict_param:
    BASE_URL: https://portal.inmet.gov.br/uploads/dadoshistoricos/
    DEST_DIR: ./dump_zips/
    FILE_EXT: .zip
    START_YEAR: 2013
    END_YEAR: 2023

Now let's apply it to Prefect Cloud, and you will see it at your environment:

prefect deployment apply brazil-weather-flow.yaml

This deployment has a schedule to run once a day, at UTC 05:00 AM.

dbt Cloud

On dbt website, you need to setup a connections to GCP Big Query and possibly to you own GitHub repo. So, maybe you need to fork my repo and start from it.

I'll describe my steps to use dbt Cloud

At Account Setting, click on Projects.
I put the name of my project to Analytics
In Repository, I connected my GH repo to my dbt account.
- Here you should use a fork, because I'm not sure if you could link to my repo.
In Connection, I choose for BigQuery.
- Upload Your GCP Service Account JSON Key
- For Timeout, insert 300
- For Location, insert us-west1
- Reference
Project subdirectory
- Insert here dbt (this link to a folder dbt at my repo)
For development, in dbt, use a git branch called dbt-cloud

Now, you must create a dbt Environment and after a dbt Job.

Go to Deploy >> Environments
1. + Create Environment
2. In Name put Prod
3. In Environment Type chose Deployment
4. In dbt Version Type chose 1.4 (latest)
5. In Deployment Credentials >> Dataset put br_weather
Go to Deploy >> Jobs
1. + Create Job
2. In Job Name put Job_Br_Weather
3. In Environment chose Prod
4. In Commands clean all and insert two commands:
  - dbt seed
  - dbt run
5. In Triggers >> Schedule turn on Run on schedule
6. In Schedule Days turn on only Monday
7. In At exact intervals put 6
My source is refreshed only once a month, so running dbt once a week is enough for the SLA.

I started my project using dimensional modeling.

But I don't have enough time to understand in Looker Studio, how can I connect fact and dim tables.
So I change to OBT, using only one final table in my dashboard.

After run dbt model, the final table is fact_weather_logs with almost 30 millions records.

Looker Studio

The final dashboard in Looker Studio can be reproduce with:

Setup a new dashboard
Choose BigQuery as Data Source
Select <your-gcp-project-id>
Select dataset br_weather
Select table fact_weather_logs
Now, draw like a artist (not my case 😁)

Epilogue

Dear colleague, I really hope you had success to reproduce my project.

It's not that fancy, but I've learned a lot since the beginning of bootcamp.

All the best for you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOW-TO-RUN.md

HOW-TO-RUN.md

How to execute my project

Setup Google Cloud Plataform for a new project

Git Clone Me

How to use Terraform

Prefect Cloud

Python, VirtualEnv and Prefect Local

dbt Cloud

Looker Studio

Epilogue

Files

HOW-TO-RUN.md

Latest commit

History

HOW-TO-RUN.md

File metadata and controls

How to execute my project

Setup Google Cloud Plataform for a new project

Git Clone Me

How to use Terraform

Prefect Cloud

Python, VirtualEnv and Prefect Local

dbt Cloud

Looker Studio

Epilogue