Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data ingest workflow docs #3177

Merged
merged 11 commits into from
Dec 15, 2023
Merged

Data ingest workflow docs #3177

merged 11 commits into from
Dec 15, 2023

Conversation

lauriemerrell
Copy link
Contributor

@lauriemerrell lauriemerrell commented Dec 11, 2023

Description

Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.

This PR adds some additional documentation about the workflow to add a new data source to the Cal-ITP data warehouse and some additional contextual documentation about dbt based on the questions asked by some recent new contributors.

Resolves #3173
Resolves #3166

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

Include commands/logs/screenshots as relevant.

N/A

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

  • No action required
  • Actions required (specified below)

Copy link

@lauriemerrell lauriemerrell force-pushed the warehouse-workflow-docs branch 2 times, most recently from 1f420cf to a0363f8 Compare December 11, 2023 23:09
@lauriemerrell lauriemerrell marked this pull request as ready for review December 11, 2023 23:22
@lauriemerrell lauriemerrell marked this pull request as draft December 11, 2023 23:24
@lauriemerrell lauriemerrell force-pushed the warehouse-workflow-docs branch from 379d630 to 6c03e96 Compare December 12, 2023 22:48
@lauriemerrell lauriemerrell marked this pull request as ready for review December 12, 2023 22:48
Copy link
Member

@evansiroky evansiroky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful, thanks!

@lauriemerrell lauriemerrell force-pushed the warehouse-workflow-docs branch from a46fa34 to 13faa7e Compare December 14, 2023 19:29
Copy link
Contributor

@SorenSpicknall SorenSpicknall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No breaking issues identified, so approving this tentatively on the assumption that you can make changes you agree with from these suggestions and merge on your own time.

airflow/dags/create_external_tables/README.md Outdated Show resolved Hide resolved
airflow/dags/create_external_tables/README.md Outdated Show resolved Hide resolved
docs/architecture/data.md Show resolved Hide resolved
docs/architecture/data.md Outdated Show resolved Hide resolved

We often bring data into our environment in two steps, created as two separate Airflow [DAGs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html):

- **Sync the fully-raw data in its original format:** See for example the changes in the `airflow/dags/sync_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). We do this to preserve the raw data in its original form. This data might be saved in a `calitp-<your-data-source>-raw` bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elavon may not be the best example here since we sync the world each day, which is generally not what we do for other pipelines. But if there isn't another clean-cut example that adheres to most of our other patterns, I'm more than willing to stick with this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking, there isn't actually an Airflow-managed pipeline that filters a request by date or similar -- most just scrape whatever they find at the time of search, so I actually think Elavon is not unusual in this regard. I am adding a little clarification here though

- **Convert the saved raw data into a BigQuery-readable gzipped JSONL file:** See for example the changes in the `airflow/dags/parse_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). This prepares the data is to be read into BigQuery. **Conversion here should be limited to the bare minimum needed to make the data BigQuery-compatible, for example converting column names that would be invalid in BigQuery and changing the file type to gzipped JSONL.** This data might be saved in a `calitp-<your-data-source>-parsed` bucket.

```{note}
When you merge a pull request creating a new Airflow DAG, that DAG will be paused by default. To start the DAG, someone will need to log into the Airflow UI and unpause the DAG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the Airflow UI where you say "the Airflow UI", maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, not sure if link will change when you upgrade Composer -- if so, would be great if you can update at that time

docs/architecture/data.md Outdated Show resolved Hide resolved
@lauriemerrell lauriemerrell force-pushed the warehouse-workflow-docs branch from 13faa7e to 8f883d6 Compare December 14, 2023 23:04
@lauriemerrell lauriemerrell force-pushed the warehouse-workflow-docs branch from 58f3503 to 8b95e36 Compare December 14, 2023 23:28
Copy link
Contributor

@allejo allejo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only really qualified to review the warehouse docs but these are much welcomed improvements. Thank you for the improvements, Laurie!

@lauriemerrell lauriemerrell merged commit 29d614b into main Dec 15, 2023
2 checks passed
@lauriemerrell lauriemerrell deleted the warehouse-workflow-docs branch December 15, 2023 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants