-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data ingest workflow docs #3177
Conversation
1f420cf
to
a0363f8
Compare
379d630
to
6c03e96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very helpful, thanks!
a46fa34
to
13faa7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No breaking issues identified, so approving this tentatively on the assumption that you can make changes you agree with from these suggestions and merge on your own time.
docs/architecture/data.md
Outdated
|
||
We often bring data into our environment in two steps, created as two separate Airflow [DAGs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html): | ||
|
||
- **Sync the fully-raw data in its original format:** See for example the changes in the `airflow/dags/sync_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). We do this to preserve the raw data in its original form. This data might be saved in a `calitp-<your-data-source>-raw` bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elavon may not be the best example here since we sync the world each day, which is generally not what we do for other pipelines. But if there isn't another clean-cut example that adheres to most of our other patterns, I'm more than willing to stick with this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After checking, there isn't actually an Airflow-managed pipeline that filters a request by date or similar -- most just scrape whatever they find at the time of search, so I actually think Elavon is not unusual in this regard. I am adding a little clarification here though
docs/architecture/data.md
Outdated
- **Convert the saved raw data into a BigQuery-readable gzipped JSONL file:** See for example the changes in the `airflow/dags/parse_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). This prepares the data is to be read into BigQuery. **Conversion here should be limited to the bare minimum needed to make the data BigQuery-compatible, for example converting column names that would be invalid in BigQuery and changing the file type to gzipped JSONL.** This data might be saved in a `calitp-<your-data-source>-parsed` bucket. | ||
|
||
```{note} | ||
When you merge a pull request creating a new Airflow DAG, that DAG will be paused by default. To start the DAG, someone will need to log into the Airflow UI and unpause the DAG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to the Airflow UI where you say "the Airflow UI", maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, not sure if link will change when you upgrade Composer -- if so, would be great if you can update at that time
13faa7e
to
8f883d6
Compare
58f3503
to
8b95e36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only really qualified to review the warehouse docs but these are much welcomed improvements. Thank you for the improvements, Laurie!
Description
Describe your changes and why you're making them. Please include the context, motivation, and relevant dependencies.
This PR adds some additional documentation about the workflow to add a new data source to the Cal-ITP data warehouse and some additional contextual documentation about dbt based on the questions asked by some recent new contributors.
Resolves #3173
Resolves #3166
Type of change
How has this been tested?
Include commands/logs/screenshots as relevant.
N/A
Post-merge follow-ups
Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.