Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏗️ Make ETL pipeline async #13

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

acemasterjb
Copy link
Owner

Summary

The aim of this PR is to move away from the current batch ETL workflow for creating pluto reports, to be replaced with an asynchronous workflow.

Details

The current workflow extracts data from APIs in batch, then cleans up and applies preliminary statistics and filtering on it in batch, and finally loads all data to the *.gzip files in batch.

This makes the workflow very unstable to unexpected API behaviour and other failures due to data inconsistencies.

image

Thus to make the workflow more robust to these potential issues, and reduce memory usage in report generation, This PR changes the architecture to the one described above.

Key

Gray lines are initiated by the pluto report generating user.

Blue lines are initiated by the book report generator/refresher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant