Skip to content

Latest commit

 

History

History
270 lines (220 loc) · 11.1 KB

HOW-TO-RUN.md

File metadata and controls

270 lines (220 loc) · 11.1 KB

How to execute my project

As the final capstone, we are supposed to run fellow's projects at our machines. So I'll summarize here how you can run my project.

You will need a Mac or Linux Machine (like Ubuntu 22.04). Windows folks could use Ubuntu 22.04 at WSL2 without problems.

Setup Google Cloud Plataform for a new project

  1. Access GCP New Project by clicking here
  2. Define for Project name and for Project ID: br-weather-your-name
  3. Go to IAM & Admin >> Service Accounts and create a New Service Account.
    1. Put admin-svc in Service account name.
    2. Assign these roles for your new account:
      • BigQuery Admin
      • Compute Admin
      • Storage Admin
      • Storage Object Admin
      • Viewer

    For real world projects, you must create more granular rules for your service accounts.

    1. At IAM & Admin >> Service Accounts click over admin-svc account.
    2. In the new page, click in KEYS >> [ADD KEYS] >> Create new key >> * JSON >> CREATE
    3. A new file will be downloaded, keep it safe, never publish it to public shares (GitHub, PasteBin, etc).
  4. For the first use, go to Compute Engine and enable Compute Engine API in the new page.
  5. Setup Google Cloud SDK at your computer (Item 4 for instructions)
  6. Setup Terraform at your computer. Instructions here
  7. OPTIONAL: configure SSH Key authentication in your project:
    1. At your PC, run this:
      ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USERNAME -b 2048
    2. Go to Compute Engine >> Metada >> SSH KEYS
    3. Click in EDIT >> + ADD ITEM
    4. Copy/Paste at SSH Key n** the content of your generated file ~/.ssh/KEY_FILENAME.pub
    5. To show your VMs External IP:
      • Go back to Compute Engine

      • Click COlumn display options...

      • Select External IP

        VM External IP

    6. For now, you are able to login to GCP VM with:
      ssh -i ~/.ssh/KEY_FILENAME USERNAME@VM.EXTERNAL.IP
    7. References: GPC Docs:

Git Clone Me

It's time to clone this repository to a folder in your PC. In your shell, run this:

cd ~
git clone https://github.com/romiof/brazil-weather.git
cd brazil-weather

How to use Terraform

We're going to start Terraform to construct our cloud infrastructure. At your shell, run this to download all artifacts:

cd ~/brazil-weather/terraform
terraform init

My recipe will setup three GCP object's at "us-west1" / "us-west1-a", to avail GCP Free Tier.

  • A Google Cloud Storage Bucket
  • A Big Query Dataset
  • A Ubuntu 22.04 VM type e2-medium (which charges you about $0.03 hourly)

Keep an eye here to see your free tier limits.

Also for the VM, will be created a swap file and all PIP requirements will be installed. Prefect agent will start with VM root user. Need to use this approach, because all of us are out of GCP 90-days trial, and now, we must pay for some resources 😉

Here how to plan / apply / destroy your cloud resources:

terraform plan \
    -var="project=your-gcp-project-id" \
    -var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
    -var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"

terraform apply \
    -var="project=your-gcp-project-id" \
    -var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
    -var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"

terraform destroy \
    -var="project=your-gcp-project-id" \
    -var="PREFECT_API_KEY=your_prefect_cloud_token_api" \
    -var="PREFECT_WORKSPACE=prefect_cloud/workspace_string"

Prefect Cloud

Use your Prefect Cloud Key / Workspace to complete above Terraform command variables.

Now on Prefect Cloud, let's create all needed blocks.
Under Blocks Menu, create four itens:

  1. GCP Credentials / gcp-login
  2. GCS / gcs-prefect
  3. BigQuery Warehouse / gcp-bq
    • At combo-box, select gcp-login
  4. GCS Bucket / gcs-bucket
    • Put bucket-<your-GCP-bucket-name> in Name of Bucket
    • Put data-lake in Bucket Folder
    • At combo-box, select gcp-login

Prefect Blocks

Python, VirtualEnv and Prefect Local

First of all, in your PC:

Then execute this commands to create a new VENV:

cd ~/brazil-weather
virtualenv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

A VirtualEnv was created at your project folder and all PIP requirements has been downloaded. Now logon your local Prefect profile to Prefect Cloud.

cd ~/brazil-weather
prefect cloud login -k <your_prefect_cloud_token_api>
prefect cloud workspace set --workspace <prefect_cloud/workspace_string>
prefect config set PREFECT_API_ENABLE_HTTP2=false

Prefect Agents / Cloud Servers, are having some troubles with HTTP2, for this, I suggest you to disable it for a while. GitHub Issue

My .py files used for Prefect, consist in one main file and a extra function file. All csv files used for ELT, will be downloaded from https://portal.inmet.gov.br/uploads/dadoshistoricos/. This URL is inputed at the dict-key[BASE_URL].

Prefect Flow "DAG"

Now let's deploy our flow to Cloud Workspace. It will use a sub-folder /flows/, under GCS Bucket (from Prefect Block), to save our .py files. After that, our Prefect Agent will download these files in each flow run.

cd ~/brazil-weather/prefect
prefect deployment build elt_flow.py:main_flow -n brazil-weather-flow -sb gcs/gcs-prefect -q default --cron "0 5 * * *" -o brazil-weather-flow.yaml

My default parameters are a JSON/Dict attributes, and I can't figure how to include them at deployment build. So, please edit the file brazil-weather-flow.yaml , under parameters key, to include this:

parameters:
  dict_param:
    BASE_URL: https://portal.inmet.gov.br/uploads/dadoshistoricos/
    DEST_DIR: ./dump_zips/
    FILE_EXT: .zip
    START_YEAR: 2013
    END_YEAR: 2023

Deployment Parameters

Now let's apply it to Prefect Cloud, and you will see it at your environment:

prefect deployment apply brazil-weather-flow.yaml

Prefect Cloud Deployment

This deployment has a schedule to run once a day, at UTC 05:00 AM.

Prefect Cloud Flows

dbt Cloud

On dbt website, you need to setup a connections to GCP Big Query and possibly to you own GitHub repo. So, maybe you need to fork my repo and start from it.

I'll describe my steps to use dbt Cloud

  1. At Account Setting, click on Projects.
  2. I put the name of my project to Analytics
  3. In Repository, I connected my GH repo to my dbt account.
    • Here you should use a fork, because I'm not sure if you could link to my repo.
  4. In Connection, I choose for BigQuery.
  5. Project subdirectory
    • Insert here dbt (this link to a folder dbt at my repo)
  6. For development, in dbt, use a git branch called dbt-cloud

Now, you must create a dbt Environment and after a dbt Job.

  1. Go to Deploy >> Environments

    1. + Create Environment
    2. In Name put Prod
    3. In Environment Type chose Deployment
    4. In dbt Version Type chose 1.4 (latest)
    5. In Deployment Credentials >> Dataset put br_weather
  2. Go to Deploy >> Jobs

    1. + Create Job
    2. In Job Name put Job_Br_Weather
    3. In Environment chose Prod
    4. In Commands clean all and insert two commands:
      • dbt seed
      • dbt run
    5. In Triggers >> Schedule turn on Run on schedule
    6. In Schedule Days turn on only Monday
    7. In At exact intervals put 6

    My source is refreshed only once a month, so running dbt once a week is enough for the SLA.

    dbt shcedule dbt job

I started my project using dimensional modeling.

But I don't have enough time to understand in Looker Studio, how can I connect fact and dim tables.
So I change to OBT, using only one final table in my dashboard.

After run dbt model, the final table is fact_weather_logs with almost 30 millions records.

Final Table

Looker Studio

The final dashboard in Looker Studio can be reproduce with:

  1. Setup a new dashboard
  2. Choose BigQuery as Data Source
  3. Select <your-gcp-project-id>
  4. Select dataset br_weather
  5. Select table fact_weather_logs
  6. Now, draw like a artist (not my case 😁)

Page 01

Page 01

Epilogue

Dear colleague, I really hope you had success to reproduce my project.

It's not that fancy, but I've learned a lot since the beginning of bootcamp.

All the best for you!