New pipeline to load BlackCat API data #3129

kengie · 2023-11-18T00:17:09Z

Description

This code creates a new data pipeline with 2 dags to ingest NTD submitted forms from BlackCat, a third-party vendor, into the Cal-ITP data warehouse. The first dag hits the API and saves all of its contents into one json zipped file in Google Cloud Storage, using the new operator blackcat_to_gcs.py. This code follows the same structure as the airtable_to_gcs.py operator. Hive partitions are created by date and timestamp, and date and timestamp also make up part of the file path to each report's jsonl zipped file. Even though it is a small amount of data and the pipeline will usually not run more than once per day, we included hive-partitioning by timestamp in the event that it is run > 1x/day so that filepaths will differ for each run. Otherwise, the Airflow job will fail since it cannot overwrite a JSONL file of the same name.

Secondly, we create an external table so it can be viewed in BigQuery. This is done just by adding a new folder in the create_external_tables dag with YAMLs for each table. They call upon the existing external_table.py operator.

Thirdly, we have 3 levels of dbt models to transform this data and conduct some validation checks on the submitted answers.

The staging/ntd_validation folder has SQL statements that take the API data from the external table and unnest the columns into individual tables (In the external table, each NTD table from the API is a nested column).
The intermediate/ntd_validation folder has dbt models (either SQL or python) that further massage the data into the form needed for the final validation checks.
the mart/mtd_validation folder has validation checks for RR-20. They are all of those that are in the rr20_service_check.py script and the 'rr20_financials_check.py` scripts, except for the one that checks against the vehicle inventory since the API doesn't have that info yet.

I also added columns for the date checked and added the NTD error ID into the validation tables.

Resolves #[issue]

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation

How has this been tested?

The new blackcat_to_gcs.py operator have been tested using local Airflow, successfully writing JSONL files to GCS. One can view the API data saved in test runs, in the test GCS folder test-calitp-ntd-report-validation The dbt models have been tested using local dbt, and exist in BigQuery on the cal-itp-data-infra-staging project in the staging_staging database.

Post-merge follow-ups

No action required
Actions required (specified below)

This Airflow DAG will remain turned off. The NTD reporting season officially closed on October 31, so there is no purpose in keeping it running since it only pulls NTD-related data from BlackCat. @csuyat-dot on Cal-ITP team will own this pipeline in 2024, and turn it on when sub-recipients start submitting 2024 data, around September 2024.

airflow/plugins/operators/blackcat_to_gcs.py

airflow/dags/ntd_report_validation/a10_submitted_for_ntd.yml

lauriemerrell

Thanks Kim! Sorry for delayed review since I was out of office. I agree with Evan's comments above and have a few more -- I mostly just want to clarify the intention for using Python models. While we do technically support them, we don't actually use them anywhere in the project yet and it seems like most of the operations here can be performed in SQL in ways that might be easier to maintain (since it would be more in keeping with the rest of the warehouse.)

Edit: I do defer to the folks who will maintain this workflow long term on whether the Python vs. SQL consideration holds weight, it's just my instinct from a historical perspective to stick with SQL unless there's a specific reason otherwise.

lauriemerrell · 2023-11-27T17:00:10Z

warehouse/models/intermediate/ntd_validation/int_ntd_rr20_service_ratios.py

+
+    return logger
+
+def make_ratio_cols(df, numerator, denominator, col_name, logger, operation="sum"):


I am a little confused why this is a Python model; it seems that these ratio operations can be performed relatively straightforwardly in SQL (generally a Python model is only used if the relevant operations cannot be performed in SQL or would be very verbose)

Confirming that @csuyat-dot and I talked to Kim about translating the ratio calculation step to SQL next year, Kim will prioritize automating the other checks.

lauriemerrell · 2023-11-27T17:02:07Z

warehouse/models/mart/ntd_validation/fct_ntd_rr20_service_checks.py

+                             ignore_index=True).sort_values(by="Organization")
+
+    ## Part 1: save Excel file to GCS
+    GCS_FILE_PATH_VALIDATED = f"gs://calitp-ntd-report-validation/validation_reports_{this_year}" 


As noted in Slack, the intention for a Python model in dbt is not to output files to other locations but is simply to perform operations on data models (totally akin to SQL models). Exports should be orchestrated using Airflow

moving over to a different script, will try to deploy with Airflow as suggested.

lauriemerrell · 2023-11-27T17:04:38Z

warehouse/models/mart/ntd_validation/fct_ntd_rr20_service_checks.py

+
+
+def check_rr20_ratios(df, variable, threshold, this_year, last_year, logger):
+    '''Validation checks where a ratio must be within a certain threshold limit


Is the goal to create columns with these ratio checks, or to treat this as a validation that the rows are valid? (I.e., do you want to have a column with a boolean value for pass/fail, or do you want to just alert if some fail? If the latter, could leverage dbt tests.)

github-actions · 2023-11-28T18:42:40Z

Warehouse report 📦

Checks/potential follow-ups

Checks indicate the following action items may be necessary.

For new models, do they all have a surrogate primary key that is tested to be not-null and unique?

New models 🌱

calitp_warehouse.mart.ntd_validation.fct_ntd_rr20_service_checks

calitp_warehouse.intermediate.ntd_validation.int_ntd_rr20_service_alldata

calitp_warehouse.intermediate.ntd_validation.int_ntd_rr20_service_ratios

calitp_warehouse.staging.ntd_validation.stg_ntd_2022_rr20_exp_by_mode

calitp_warehouse.staging.ntd_validation.stg_ntd_2022_rr20_financial

calitp_warehouse.staging.ntd_validation.stg_ntd_2022_rr20_service

calitp_warehouse.staging.ntd_validation.stg_ntd_2023_a10

calitp_warehouse.staging.ntd_validation.stg_ntd_2023_rr20_rural

calitp_warehouse.staging.ntd_validation.stg_ntd_2023_rr20_urban_tribal

calitp_warehouse.staging.ntd_validation.stg_ntd_subrecipients

DAG

Legend (in order of precedence)

Resource type	Indicator	Resolution
Large table-materialized model	Orange	Make the model incremental
Large model without partitioning or clustering	Orange	Add partitioning and/or clustering
View with more than one child	Yellow	Materialize as a table or incremental
Incremental	Light green
Table	Green
View	White

lauriemerrell · 2023-11-28T19:01:56Z

One stray comment -- I see that the lint GitHub Action is failing, you may want to use pre-commit locally to resolve any linting issues

kengie · 2023-11-28T19:09:03Z

Thanks @lauriemerrell , @evansiroky , I updated the code to pull everything from the API (which is indeed just one URL, and then save it as one external table. In this table, each NTD report is one a nested column in BigQuery. Then, dbt models will parse out the reports into separate tables.

In terms of python vs SQL, at this point its a partially a pragmatic action to start out with, since the python is already written and understood by @csuyat-dot who will be taking over, and I have about one week of dev time left on this project. My intention is to transfer all of the checks into dbt first, then as time allows convert as much python as possible into SQL. Imperfect but ensures coverage.

evansiroky · 2023-12-02T00:47:39Z

It looks like there are other eyes on this. I like the refactor of downloading the raw data. Also, I generally agree with @lauriemerrell that dbt should be leveraged as much as possible to keep the coding style similar to the rest of the warehouse scripts.

kengie requested review from hunterowens, KatrinaMKaiser and lauriemerrell November 18, 2023 00:17

kengie requested review from evansiroky and SorenSpicknall as code owners November 18, 2023 00:17

evansiroky reviewed Nov 18, 2023

View reviewed changes

airflow/plugins/operators/blackcat_to_gcs.py Outdated Show resolved Hide resolved

evansiroky reviewed Nov 18, 2023

View reviewed changes

airflow/dags/ntd_report_validation/a10_submitted_for_ntd.yml Outdated Show resolved Hide resolved

kengie requested a review from tiffanychu90 as a code owner November 22, 2023 18:27

lauriemerrell reviewed Nov 27, 2023

View reviewed changes

KatrinaMKaiser approved these changes Dec 5, 2023

View reviewed changes

lauriemerrell mentioned this pull request Dec 6, 2023

Document general workflow for bringing new data into the warehouse (external table etc.) #3166

Closed

2 tasks

lauriemerrell approved these changes Dec 7, 2023

View reviewed changes

kengie and others added 14 commits December 7, 2023 12:06

pipeline to load BlackCat API data

70e3d18

pull API data into 1 table, first dbt models

c0c8ae8

simplify external tables dag

9ef1361

start dag for excel publishing, dbt cleaning for rr20 service

3da7386

remove excel conversion from dbt model

ebe996e

adding rr20 financial checks

9d1a834

add VOMS check

e5c28f7

formatting

b3b2557

metadata for intermediate models

7b3c7f6

formatting

bc3cf8a

formatting

0abdfbf

schema fixes

ac91239

address linter issues

ac46ff1

more linter

7eed735

fix conflicts, final ratios corrections

91e7fad

lauriemerrell force-pushed the kim_load_blackcat_reports_to_bigquery branch from 49702cd to 91e7fad Compare December 7, 2023 18:06

lauriemerrell merged commit 451326a into main Dec 7, 2023
5 of 6 checks passed

lauriemerrell deleted the kim_load_blackcat_reports_to_bigquery branch December 7, 2023 18:10

This was referenced Dec 8, 2023

NTD validation mart dataset needed #3171

Closed

create mart_ntd_validation #3172

Merged

Document gotchas: Airflow DAGs paused; new dataset in warehouse #3173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New pipeline to load BlackCat API data #3129

New pipeline to load BlackCat API data #3129

kengie commented Nov 18, 2023 •

edited

Loading

lauriemerrell left a comment •

edited

Loading

lauriemerrell Nov 27, 2023

KatrinaMKaiser Nov 30, 2023

lauriemerrell Nov 27, 2023 •

edited

Loading

kengie Nov 28, 2023

lauriemerrell Nov 27, 2023

github-actions bot commented Nov 28, 2023 •

edited

Loading

lauriemerrell commented Nov 28, 2023

kengie commented Nov 28, 2023

evansiroky commented Dec 2, 2023


		return logger

		def make_ratio_cols(df, numerator, denominator, col_name, logger, operation="sum"):



		def check_rr20_ratios(df, variable, threshold, this_year, last_year, logger):
		'''Validation checks where a ratio must be within a certain threshold limit

New pipeline to load BlackCat API data #3129

New pipeline to load BlackCat API data #3129

Conversation

kengie commented Nov 18, 2023 • edited Loading

Description

Type of change

How has this been tested?

Post-merge follow-ups

lauriemerrell left a comment • edited Loading

Choose a reason for hiding this comment

lauriemerrell Nov 27, 2023

Choose a reason for hiding this comment

KatrinaMKaiser Nov 30, 2023

Choose a reason for hiding this comment

lauriemerrell Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

kengie Nov 28, 2023

Choose a reason for hiding this comment

lauriemerrell Nov 27, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 28, 2023 • edited Loading

Checks/potential follow-ups

New models 🌱

DAG

lauriemerrell commented Nov 28, 2023

kengie commented Nov 28, 2023

evansiroky commented Dec 2, 2023

kengie commented Nov 18, 2023 •

edited

Loading

lauriemerrell left a comment •

edited

Loading

lauriemerrell Nov 27, 2023 •

edited

Loading

github-actions bot commented Nov 28, 2023 •

edited

Loading