Data ingest workflow docs #3177

lauriemerrell · 2023-12-11T15:55:06Z

Description

Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.

This PR adds some additional documentation about the workflow to add a new data source to the Cal-ITP data warehouse and some additional contextual documentation about dbt based on the questions asked by some recent new contributors.

Resolves #3173
Resolves #3166

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation

How has this been tested?

Include commands/logs/screenshots as relevant.

N/A

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

No action required
Actions required (specified below)

github-actions · 2023-12-11T16:05:14Z

Preview url: https://docs-data-infra-3177--cal-itp-previews.netlify.app

evansiroky

Very helpful, thanks!

SorenSpicknall

No breaking issues identified, so approving this tentatively on the assumption that you can make changes you agree with from these suggestions and merge on your own time.

airflow/dags/create_external_tables/README.md

docs/architecture/data.md

SorenSpicknall · 2023-12-14T19:43:27Z

docs/architecture/data.md

+
+We often bring data into our environment in two steps, created as two separate Airflow [DAGs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html):
+
+- **Sync the fully-raw data in its original format:** See for example the changes in the `airflow/dags/sync_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). We do this to preserve the raw data in its original form. This data might be saved in a `calitp-<your-data-source>-raw` bucket.


Elavon may not be the best example here since we sync the world each day, which is generally not what we do for other pipelines. But if there isn't another clean-cut example that adheres to most of our other patterns, I'm more than willing to stick with this one.

After checking, there isn't actually an Airflow-managed pipeline that filters a request by date or similar -- most just scrape whatever they find at the time of search, so I actually think Elavon is not unusual in this regard. I am adding a little clarification here though

SorenSpicknall · 2023-12-14T19:44:05Z

docs/architecture/data.md

+- **Convert the saved raw data into a BigQuery-readable gzipped JSONL file:** See for example the changes in the `airflow/dags/parse_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). This prepares the data is to be read into BigQuery. **Conversion here should be limited to the bare minimum needed to make the data BigQuery-compatible, for example converting column names that would be invalid in BigQuery and changing the file type to gzipped JSONL.** This data might be saved in a `calitp-<your-data-source>-parsed` bucket.
+
+```{note}
+When you merge a pull request creating a new Airflow DAG, that DAG will be paused by default. To start the DAG, someone will need to log into the Airflow UI and unpause the DAG. 


Add a link to the Airflow UI where you say "the Airflow UI", maybe?

Added, not sure if link will change when you upgrade Composer -- if so, would be great if you can update at that time

docs/architecture/data.md

allejo

I'm only really qualified to review the warehouse docs but these are much welcomed improvements. Thank you for the improvements, Laurie!

lauriemerrell force-pushed the warehouse-workflow-docs branch 2 times, most recently from 1f420cf to a0363f8 Compare December 11, 2023 23:09

lauriemerrell marked this pull request as ready for review December 11, 2023 23:22

lauriemerrell requested review from evansiroky, SorenSpicknall and tiffanychu90 as code owners December 11, 2023 23:22

lauriemerrell marked this pull request as draft December 11, 2023 23:24

lauriemerrell force-pushed the warehouse-workflow-docs branch from 379d630 to 6c03e96 Compare December 12, 2023 22:48

lauriemerrell marked this pull request as ready for review December 12, 2023 22:48

evansiroky approved these changes Dec 14, 2023

View reviewed changes

lauriemerrell force-pushed the warehouse-workflow-docs branch from a46fa34 to 13faa7e Compare December 14, 2023 19:29

SorenSpicknall approved these changes Dec 14, 2023

View reviewed changes

lauriemerrell force-pushed the warehouse-workflow-docs branch from 13faa7e to 8f883d6 Compare December 14, 2023 23:04

lauriemerrell added 11 commits December 14, 2023 17:28

stub page

c5a8bf7

add what is dbt section

c92d5e6

write up data ingest steps

5cabcf1

fix ref

0376f6e

add example yaml to create external tables readme

f691d77

add testing information to external tables readme

3838da9

add more documentation of new data workflow and dbt context

59929f7

more dbt notes

65d6522

rearrange/more dbt context

9106af3

update name of google doc

2851cc6

address comments from PR review

8b95e36

lauriemerrell force-pushed the warehouse-workflow-docs branch from 58f3503 to 8b95e36 Compare December 14, 2023 23:28

allejo approved these changes Dec 15, 2023

View reviewed changes

lauriemerrell merged commit 29d614b into main Dec 15, 2023
2 checks passed

lauriemerrell deleted the warehouse-workflow-docs branch December 15, 2023 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data ingest workflow docs #3177

Data ingest workflow docs #3177

lauriemerrell commented Dec 11, 2023 •

edited

Loading

github-actions bot commented Dec 11, 2023

evansiroky left a comment

SorenSpicknall left a comment

SorenSpicknall Dec 14, 2023

lauriemerrell Dec 14, 2023

SorenSpicknall Dec 14, 2023

lauriemerrell Dec 14, 2023

allejo left a comment


		We often bring data into our environment in two steps, created as two separate Airflow [DAGs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html):

		- Sync the fully-raw data in its original format: See for example the changes in the `airflow/dags/sync_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). We do this to preserve the raw data in its original form. This data might be saved in a `calitp-<your-data-source>-raw` bucket.

Data ingest workflow docs #3177

Data ingest workflow docs #3177

Conversation

lauriemerrell commented Dec 11, 2023 • edited Loading

Description

Type of change

How has this been tested?

Post-merge follow-ups

github-actions bot commented Dec 11, 2023

evansiroky left a comment

Choose a reason for hiding this comment

SorenSpicknall left a comment

Choose a reason for hiding this comment

SorenSpicknall Dec 14, 2023

Choose a reason for hiding this comment

lauriemerrell Dec 14, 2023

Choose a reason for hiding this comment

SorenSpicknall Dec 14, 2023

Choose a reason for hiding this comment

lauriemerrell Dec 14, 2023

Choose a reason for hiding this comment

allejo left a comment

Choose a reason for hiding this comment

lauriemerrell commented Dec 11, 2023 •

edited

Loading