Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New pipeline to load BlackCat API data #3129

Merged
merged 15 commits into from
Dec 7, 2023

Conversation

kengie
Copy link
Contributor

@kengie kengie commented Nov 18, 2023

Description

This code creates a new data pipeline with 2 dags to ingest NTD submitted forms from BlackCat, a third-party vendor, into the Cal-ITP data warehouse. The first dag hits the API and saves all of its contents into one json zipped file in Google Cloud Storage, using the new operator blackcat_to_gcs.py. This code follows the same structure as the airtable_to_gcs.py operator. Hive partitions are created by date and timestamp, and date and timestamp also make up part of the file path to each report's jsonl zipped file. Even though it is a small amount of data and the pipeline will usually not run more than once per day, we included hive-partitioning by timestamp in the event that it is run > 1x/day so that filepaths will differ for each run. Otherwise, the Airflow job will fail since it cannot overwrite a JSONL file of the same name.

Secondly, we create an external table so it can be viewed in BigQuery. This is done just by adding a new folder in the create_external_tables dag with YAMLs for each table. They call upon the existing external_table.py operator.

Thirdly, we have 3 levels of dbt models to transform this data and conduct some validation checks on the submitted answers.

  • The staging/ntd_validation folder has SQL statements that take the API data from the external table and unnest the columns into individual tables (In the external table, each NTD table from the API is a nested column).
  • The intermediate/ntd_validation folder has dbt models (either SQL or python) that further massage the data into the form needed for the final validation checks.
  • the mart/mtd_validation folder has validation checks for RR-20. They are all of those that are in the rr20_service_check.py script and the 'rr20_financials_check.py` scripts, except for the one that checks against the vehicle inventory since the API doesn't have that info yet.

I also added columns for the date checked and added the NTD error ID into the validation tables.

image

Resolves #[issue]

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

The new blackcat_to_gcs.py operator have been tested using local Airflow, successfully writing JSONL files to GCS. One can view the API data saved in test runs, in the test GCS folder test-calitp-ntd-report-validation The dbt models have been tested using local dbt, and exist in BigQuery on the cal-itp-data-infra-staging project in the staging_staging database.

Post-merge follow-ups

  • No action required
  • Actions required (specified below)

This Airflow DAG will remain turned off. The NTD reporting season officially closed on October 31, so there is no purpose in keeping it running since it only pulls NTD-related data from BlackCat. @csuyat-dot on Cal-ITP team will own this pipeline in 2024, and turn it on when sub-recipients start submitting 2024 data, around September 2024.

@kengie kengie requested a review from tiffanychu90 as a code owner November 22, 2023 18:27
Copy link
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kim! Sorry for delayed review since I was out of office. I agree with Evan's comments above and have a few more -- I mostly just want to clarify the intention for using Python models. While we do technically support them, we don't actually use them anywhere in the project yet and it seems like most of the operations here can be performed in SQL in ways that might be easier to maintain (since it would be more in keeping with the rest of the warehouse.)

Edit: I do defer to the folks who will maintain this workflow long term on whether the Python vs. SQL consideration holds weight, it's just my instinct from a historical perspective to stick with SQL unless there's a specific reason otherwise.


return logger

def make_ratio_cols(df, numerator, denominator, col_name, logger, operation="sum"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused why this is a Python model; it seems that these ratio operations can be performed relatively straightforwardly in SQL (generally a Python model is only used if the relevant operations cannot be performed in SQL or would be very verbose)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that @csuyat-dot and I talked to Kim about translating the ratio calculation step to SQL next year, Kim will prioritize automating the other checks.

ignore_index=True).sort_values(by="Organization")

## Part 1: save Excel file to GCS
GCS_FILE_PATH_VALIDATED = f"gs://calitp-ntd-report-validation/validation_reports_{this_year}"
Copy link
Contributor

@lauriemerrell lauriemerrell Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in Slack, the intention for a Python model in dbt is not to output files to other locations but is simply to perform operations on data models (totally akin to SQL models). Exports should be orchestrated using Airflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving over to a different script, will try to deploy with Airflow as suggested.



def check_rr20_ratios(df, variable, threshold, this_year, last_year, logger):
'''Validation checks where a ratio must be within a certain threshold limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal to create columns with these ratio checks, or to treat this as a validation that the rows are valid? (I.e., do you want to have a column with a boolean value for pass/fail, or do you want to just alert if some fail? If the latter, could leverage dbt tests.)

Copy link

github-actions bot commented Nov 28, 2023

Warehouse report 📦

Checks/potential follow-ups

Checks indicate the following action items may be necessary.

  • For new models, do they all have a surrogate primary key that is tested to be not-null and unique?

New models 🌱

calitp_warehouse.mart.ntd_validation.fct_ntd_rr20_service_checks

calitp_warehouse.intermediate.ntd_validation.int_ntd_rr20_service_alldata

calitp_warehouse.intermediate.ntd_validation.int_ntd_rr20_service_ratios

calitp_warehouse.staging.ntd_validation.stg_ntd_2022_rr20_exp_by_mode

calitp_warehouse.staging.ntd_validation.stg_ntd_2022_rr20_financial

calitp_warehouse.staging.ntd_validation.stg_ntd_2022_rr20_service

calitp_warehouse.staging.ntd_validation.stg_ntd_2023_a10

calitp_warehouse.staging.ntd_validation.stg_ntd_2023_rr20_rural

calitp_warehouse.staging.ntd_validation.stg_ntd_2023_rr20_urban_tribal

calitp_warehouse.staging.ntd_validation.stg_ntd_subrecipients

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

@lauriemerrell
Copy link
Contributor

One stray comment -- I see that the lint GitHub Action is failing, you may want to use pre-commit locally to resolve any linting issues

@kengie
Copy link
Contributor Author

kengie commented Nov 28, 2023

Thanks @lauriemerrell , @evansiroky , I updated the code to pull everything from the API (which is indeed just one URL, and then save it as one external table. In this table, each NTD report is one a nested column in BigQuery. Then, dbt models will parse out the reports into separate tables.

In terms of python vs SQL, at this point its a partially a pragmatic action to start out with, since the python is already written and understood by @csuyat-dot who will be taking over, and I have about one week of dev time left on this project. My intention is to transfer all of the checks into dbt first, then as time allows convert as much python as possible into SQL. Imperfect but ensures coverage.

@evansiroky
Copy link
Member

It looks like there are other eyes on this. I like the refactor of downloading the raw data. Also, I generally agree with @lauriemerrell that dbt should be leveraged as much as possible to keep the coding style similar to the rest of the warehouse scripts.

@lauriemerrell lauriemerrell force-pushed the kim_load_blackcat_reports_to_bigquery branch from 49702cd to 91e7fad Compare December 7, 2023 18:06
@lauriemerrell lauriemerrell merged commit 451326a into main Dec 7, 2023
5 of 6 checks passed
@lauriemerrell lauriemerrell deleted the kim_load_blackcat_reports_to_bigquery branch December 7, 2023 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants