Refactor Airflow ETL Pipeline DAG to incorporate feedback changes #4576

btylerburton · 2024-01-04T21:05:33Z

User Story

In order to incorporate updates to the datagov-harvesting-logic API, and feedback from the most recent design sessions, changes need to be made to the Airflow ETL pipeline DAG in order to fully test a DCAT-US record end-to-end.

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN I have added the updated datagov-harvesting-logic module as a dependency to the datagov-harvester
AND I have supplied a test JSON harvest source of N DCAT-US records that has generated a dynamic etl_pipeline DAG
WHEN I trigger a run of that DAG
THEN I expect the new ETL pipeline to process the source through the pipeline tasks and at completion to compile metrics from the tasks.

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

Refactor Extract task to take the harvest source config as a parameter and to expect a return value (dict or list) of which datasets need to be created, updated, destroyed
Refactor the tasks below to accept a harvest source config and return a success/fail metric on the operations:
- Delete
- Validate
- Load
Add a final rollup task to pull the values from the above tasks and print them to the console, along with execution time for each step.

Reference

btylerburton · 2024-01-23T22:38:52Z

Airflow has no problem connecting to CKAN or posting datasets.

Airflow logs

CKAN Dev UI

btylerburton · 2024-02-13T17:26:21Z

Running load test from local Docker and seeing similar results to GH Action.

btylerburton · 2024-02-13T17:28:08Z

Next steps will be to replicate CG infrastructure in Staging.
This allows us to validate clean setup for a new space and to move off dependency on local.

btylerburton · 2024-02-13T21:43:35Z

Results of load test.

Date: 02.13.24
Time: 01h 15m 55s

Logs:
[2024-02-13, 18:37:49 UTC] {harvest.py:336} INFO - expected operations to be done
[2024-02-13, 18:37:49 UTC] {harvest.py:337} INFO - {'delete': 0, 'create': 982, 'update': 0}
[2024-02-13, 18:37:49 UTC] {harvest.py:355} INFO - actual operations completed
[2024-02-13, 18:37:49 UTC] {harvest.py:356} INFO - {'deleted': 0, 'updated': 0, 'created': 937, 'nothing': 45}
[2024-02-13, 18:37:49 UTC] {harvest.py:359} INFO - validity of the records
[2024-02-13, 18:37:49 UTC] {harvest.py:360} INFO - {'valid': 982, 'invalid': 0}

btylerburton · 2024-02-20T20:36:35Z

The load test was performed and Airflow handled the job as expected. As our conversation about our use of the tool has evolved, the team has decided to pivot away from using Airflow--at least in the interim--due to the high cost of support in terms of infrastructure cost as well as time to learn the platform, versus the advantages that it was expected to bring. In short, our use case (high throughput, minimal analysis) does not overlap as nicely with Airflow's strengths as we'd expected.

btylerburton added this to data.gov team board Jan 4, 2024

btylerburton changed the title ~~Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo~~ [Placeholder] Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo Jan 4, 2024

btylerburton moved this to 📟 Sprint Backlog [7] in data.gov team board Jan 4, 2024

btylerburton self-assigned this Jan 4, 2024

btylerburton moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Jan 5, 2024

btylerburton mentioned this issue Jan 5, 2024

Load test H2.0 DCAT Pipeline in Staging Environment #4578

Closed

1 task

btylerburton changed the title ~~[Placeholder] Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo~~ Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo Jan 5, 2024

btylerburton changed the title ~~Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo~~ Refactor Airflow ETL Pipeline DAG to incorporate feedback changes Jan 9, 2024

btylerburton added the H2.0/Airflow label Jan 10, 2024

btylerburton moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Feb 20, 2024

btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Mar 4, 2024

btylerburton added H2.0/Harvest-General General Harvesting 2.0 Issues and removed dep_H2.0/Airflow labels May 13, 2024

btylerburton closed this as completed Sep 3, 2024

github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Sep 3, 2024

btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Airflow ETL Pipeline DAG to incorporate feedback changes #4576

Refactor Airflow ETL Pipeline DAG to incorporate feedback changes #4576

btylerburton commented Jan 4, 2024 •

edited

Loading

btylerburton commented Jan 23, 2024

btylerburton commented Feb 13, 2024

btylerburton commented Feb 13, 2024

btylerburton commented Feb 13, 2024

btylerburton commented Feb 20, 2024

Refactor Airflow ETL Pipeline DAG to incorporate feedback changes #4576

Refactor Airflow ETL Pipeline DAG to incorporate feedback changes #4576

Comments

btylerburton commented Jan 4, 2024 • edited Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

Reference

btylerburton commented Jan 23, 2024

btylerburton commented Feb 13, 2024

btylerburton commented Feb 13, 2024

btylerburton commented Feb 13, 2024

btylerburton commented Feb 20, 2024

btylerburton commented Jan 4, 2024 •

edited

Loading