Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Airflow ETL Pipeline DAG to incorporate feedback changes #4576

Closed
7 tasks
btylerburton opened this issue Jan 4, 2024 · 5 comments
Closed
7 tasks
Assignees
Labels
H2.0/Harvest-General General Harvesting 2.0 Issues

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Jan 4, 2024

User Story

In order to incorporate updates to the datagov-harvesting-logic API, and feedback from the most recent design sessions, changes need to be made to the Airflow ETL pipeline DAG in order to fully test a DCAT-US record end-to-end.

Related:

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN I have added the updated datagov-harvesting-logic module as a dependency to the datagov-harvester
    AND I have supplied a test JSON harvest source of N DCAT-US records that has generated a dynamic etl_pipeline DAG
    WHEN I trigger a run of that DAG
    THEN I expect the new ETL pipeline to process the source through the pipeline tasks and at completion to compile metrics from the tasks.

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • Refactor Extract task to take the harvest source config as a parameter and to expect a return value (dict or list) of which datasets need to be created, updated, destroyed
  • Refactor the tasks below to accept a harvest source config and return a success/fail metric on the operations:
    • Delete
    • Validate
    • Load
  • Add a final rollup task to pull the values from the above tasks and print them to the console, along with execution time for each step.

Reference

diagram

@btylerburton btylerburton changed the title Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo [Placeholder] Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo Jan 4, 2024
@btylerburton btylerburton moved this to 📟 Sprint Backlog [7] in data.gov team board Jan 4, 2024
@btylerburton btylerburton self-assigned this Jan 4, 2024
@btylerburton btylerburton moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Jan 5, 2024
@btylerburton btylerburton changed the title [Placeholder] Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo Jan 5, 2024
@btylerburton btylerburton changed the title Refactor Airflow ETL Pipeline DAG to account for changes to Harvesting Logic Repo Refactor Airflow ETL Pipeline DAG to incorporate feedback changes Jan 9, 2024
@btylerburton
Copy link
Contributor Author

Airflow has no problem connecting to CKAN or posting datasets.

Airflow logs
Image

CKAN Dev UI
Image

@btylerburton
Copy link
Contributor Author

Running load test from local Docker and seeing similar results to GH Action.

Image

@btylerburton
Copy link
Contributor Author

Next steps will be to replicate CG infrastructure in Staging.
This allows us to validate clean setup for a new space and to move off dependency on local.

@btylerburton
Copy link
Contributor Author

Results of load test.

Date: 02.13.24
Time: 01h 15m 55s

Logs:
[2024-02-13, 18:37:49 UTC] {harvest.py:336} INFO - expected operations to be done
[2024-02-13, 18:37:49 UTC] {harvest.py:337} INFO - {'delete': 0, 'create': 982, 'update': 0}
[2024-02-13, 18:37:49 UTC] {harvest.py:355} INFO - actual operations completed
[2024-02-13, 18:37:49 UTC] {harvest.py:356} INFO - {'deleted': 0, 'updated': 0, 'created': 937, 'nothing': 45}
[2024-02-13, 18:37:49 UTC] {harvest.py:359} INFO - validity of the records
[2024-02-13, 18:37:49 UTC] {harvest.py:360} INFO - {'valid': 982, 'invalid': 0}

Image

@btylerburton
Copy link
Contributor Author

The load test was performed and Airflow handled the job as expected. As our conversation about our use of the tool has evolved, the team has decided to pivot away from using Airflow--at least in the interim--due to the high cost of support in terms of infrastructure cost as well as time to learn the platform, versus the advantages that it was expected to bring. In short, our use case (high throughput, minimal analysis) does not overlap as nicely with Airflow's strengths as we'd expected.

@btylerburton btylerburton moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Feb 20, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Mar 4, 2024
@btylerburton btylerburton added H2.0/Harvest-General General Harvesting 2.0 Issues and removed dep_H2.0/Airflow labels May 13, 2024
@github-project-automation github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Sep 3, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-General General Harvesting 2.0 Issues
Projects
Archived in project
Development

No branches or pull requests

1 participant