As the final capstone, we are supposed to run fellow's projects at our machines. So I'll summarize here how you can run my project.
You will need a Mac or Linux Machine (like Ubuntu 22.04). Windows folks could use Ubuntu 22.04 at WSL2 without problems.
- Access GCP New Project by clicking here
- Define for Project name and for Project ID:
br-weather-your-name
- Go to IAM & Admin >> Service Accounts and create a New Service Account.
- Put
admin-svc
in Service account name. - Assign these roles for your new account:
- BigQuery Admin
- Compute Admin
- Storage Admin
- Storage Object Admin
- Viewer
For real world projects, you must create more granular rules for your service accounts.
- At IAM & Admin >> Service Accounts click over
admin-svc
account. - In the new page, click in KEYS >> [ADD KEYS] >> Create new key >> * JSON >> CREATE
- A new file will be downloaded, keep it safe, never publish it to public shares (GitHub, PasteBin, etc).
- Put
- For the first use, go to Compute Engine and enable Compute Engine API in the new page.
- Setup Google Cloud SDK at your computer (Item 4 for instructions)
- Setup Terraform at your computer. Instructions here
- OPTIONAL: configure SSH Key authentication in your project:
- At your PC, run this:
ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USERNAME -b 2048
- Go to Compute Engine >> Metada >> SSH KEYS
- Click in EDIT >> + ADD ITEM
- Copy/Paste at SSH Key n** the content of your generated file
~/.ssh/KEY_FILENAME.pub
- To show your VMs External IP:
-
Go back to Compute Engine
-
Click COlumn display options...
-
Select External IP
-
- For now, you are able to login to GCP VM with:
ssh -i ~/.ssh/KEY_FILENAME USERNAME@VM.EXTERNAL.IP
- References: GPC Docs:
- At your PC, run this:
It's time to clone this repository to a folder in your PC. In your shell, run this:
cd ~
git clone https://github.com/romiof/brazil-weather.git
cd brazil-weather
We're going to start Terraform to construct our cloud infrastructure. At your shell, run this to download all artifacts:
cd ~/brazil-weather/terraform
terraform init
My recipe will setup three GCP object's at "us-west1" / "us-west1-a", to avail GCP Free Tier.
- A Google Cloud Storage Bucket
- A Big Query Dataset
- A Ubuntu 22.04 VM type
e2-medium
(which charges you about $0.03 hourly)
Keep an eye here to see your free tier limits.
Also for the VM, will be created a swap file and all PIP requirements will be installed. Prefect agent will start with VM root user. Need to use this approach, because all of us are out of GCP 90-days trial, and now, we must pay for some resources 😉
Here how to plan / apply / destroy your cloud resources:
terraform plan \
-var="project=your-gcp-project-id" \
-var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
-var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"
terraform apply \
-var="project=your-gcp-project-id" \
-var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
-var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"
terraform destroy \
-var="project=your-gcp-project-id" \
-var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
-var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"
Use your Prefect Cloud Key / Workspace to complete above Terraform command variables.
Now on Prefect Cloud, let's create all needed blocks.
Under Blocks Menu
, create four itens:
- GCP Credentials / gcp-login
- Put
<your-gcp-project-id>
inProject (Optional)
- Copy/Paste the content of Your GCP Service Account JSON Key in
Service Account Info (Optional)
- Put
- GCS / gcs-prefect
- Put
bucket-<your-GCP-bucket-name>/flows
inBucket Path
- Copy/Paste the content of Your GCP Service Account JSON Key in
Service Account Info (Optional)
- Put
- BigQuery Warehouse / gcp-bq
- At combo-box, select
gcp-login
- At combo-box, select
- GCS Bucket / gcs-bucket
- Put
bucket-<your-GCP-bucket-name>
inName of Bucket
- Put
data-lake
inBucket Folder
- At combo-box, select
gcp-login
- Put
First of all, in your PC:
- Install Python 3.10.x
- Install VirtualEnv
Then execute this commands to create a new VENV:
cd ~/brazil-weather
virtualenv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
A VirtualEnv was created at your project folder and all PIP requirements has been downloaded. Now logon your local Prefect profile to Prefect Cloud.
cd ~/brazil-weather
prefect cloud login -k <your_prefect_cloud_token_api>
prefect cloud workspace set --workspace <prefect_cloud/workspace_string>
prefect config set PREFECT_API_ENABLE_HTTP2=false
Prefect Agents / Cloud Servers, are having some troubles with HTTP2, for this, I suggest you to disable it for a while. GitHub Issue
My .py
files used for Prefect, consist in one main file and a extra function file.
All csv files used for ELT, will be downloaded from https://portal.inmet.gov.br/uploads/dadoshistoricos/
. This URL is inputed at the dict-key[BASE_URL].
Now let's deploy our flow to Cloud Workspace.
It will use a sub-folder /flows/
, under GCS Bucket (from Prefect Block), to save our .py
files.
After that, our Prefect Agent
will download these files in each flow run.
cd ~/brazil-weather/prefect
prefect deployment build elt_flow.py:main_flow -n brazil-weather-flow -sb gcs/gcs-prefect -q default --cron "0 5 * * *" -o brazil-weather-flow.yaml
My default parameters are a JSON/Dict attributes, and I can't figure how to include them at deployment build
.
So, please edit the file brazil-weather-flow.yaml
, under parameters key, to include this:
parameters:
dict_param:
BASE_URL: https://portal.inmet.gov.br/uploads/dadoshistoricos/
DEST_DIR: ./dump_zips/
FILE_EXT: .zip
START_YEAR: 2013
END_YEAR: 2023
Now let's apply it to Prefect Cloud, and you will see it at your environment:
prefect deployment apply brazil-weather-flow.yaml
This deployment has a schedule to run once a day, at UTC 05:00 AM.
On dbt website, you need to setup a connections to GCP Big Query and possibly to you own GitHub repo. So, maybe you need to fork my repo and start from it.
I'll describe my steps to use dbt Cloud
- At Account Setting, click on Projects.
- I put the name of my project to
Analytics
- In Repository, I connected my GH repo to my dbt account.
- Here you should use a fork, because I'm not sure if you could link to my repo.
- In Connection, I choose for BigQuery.
- Upload Your GCP Service Account JSON Key
- For
Timeout
, insert300
- For
Location
, insertus-west1
- Reference
- Project subdirectory
- Insert here
dbt
(this link to a folderdbt
at my repo)
- Insert here
- For development, in dbt, use a git branch called
dbt-cloud
Now, you must create a dbt Environment and after a dbt Job.
-
Go to Deploy >> Environments
- + Create Environment
- In
Name
putProd
- In
Environment Type
choseDeployment
- In
dbt Version Type
chose1.4 (latest)
- In
Deployment Credentials >> Dataset
putbr_weather
-
Go to Deploy >> Jobs
- + Create Job
- In
Job Name
putJob_Br_Weather
- In
Environment
choseProd
- In
Commands
clean all and insert two commands:dbt seed
dbt run
- In
Triggers >> Schedule
turn onRun on schedule
- In
Schedule Days
turn on onlyMonday
- In
At exact intervals
put6
My source is refreshed only once a month, so running dbt once a week is enough for the SLA.
I started my project using dimensional modeling.
But I don't have enough time to understand in Looker Studio, how can I connect fact and dim tables.
So I change to OBT, using only one final table in my dashboard.
After run dbt model, the final table is fact_weather_logs
with almost 30 millions records.
The final dashboard in Looker Studio can be reproduce with:
- Setup a new dashboard
- Choose BigQuery as Data Source
- Select
<your-gcp-project-id>
- Select dataset
br_weather
- Select table
fact_weather_logs
- Now, draw like a artist (not my case 😁)
Dear colleague, I really hope you had success to reproduce my project.
It's not that fancy, but I've learned a lot since the beginning of bootcamp.
All the best for you!