-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodically check for data updates and create automatic snapshots #3339
Comments
Seems very close to #3329, should we merge both issues? |
Proposal ASummaryWe have an ETL script that is executed every night. It automatically creates snapshots Unlike what #3016 suggests, this process does not update any step automatically. Only snapshots are updated, and it's up to us when and whether to update forward dependencies. Changes and explanationNew: snapshot scheduleOne option is to have a An alternative would be that each snapshot Schedule fields (rename as appropriate):
New: snapshot updater scriptProcess:
Changes in StepUpdater and ETL Dashboard:Changes:
Changes in
|
Nice write-up, thanks! As a first step, we could create a staging server and just try to execute as many snapshots as possible (without putting them to a special place). That should let us know how feasible this approach is. |
Thanks Mojmir! Good idea, I'll do that experiment. We could try that (with |
I'd go even further, do it with |
Hmm, in principle the idea (with the current proposal) is to update only snapshots, and not touch any data steps. I'm not proposing doing automatic updates of any public-facing data. To be able to use chart-diff, we'd need to update data steps, and use indicator upgrader for all datasets, which we currently can't do programmatically (as far as I know). |
Proposal BSuggested by @Marigold (feel free to rephrase). SummaryWe have a periodic process that not only fetches snapshots, but does "everything", from fetching the snapshot to creating a PR with chart diff ready to be approved and merged. There is no need for indicator upgrader because we use This would be the easiest way to automatize regular updates, given the current technical limitations. Changes in chart diffChart diff will need to show changes in data of dataset that have been modified in the current PR. Currently, this is in principle possible, but I have low confidence that it's doing it well. It's certainly not the default workflow. CaveatsI think |
Proposal CThis would be a trade-off between A and B. SummaryInstead of having a schedule of snapshots, and a schedule of data updates, we have just one schedule of data updates, which do "everything": It fetches the snapshots, runs This would be, after Proposal B, the easiest way to automatize regular updates, with the benefit that we will keep versions, and be able to use chart diff properly. Changes in indicator upgraderWe will need to have a way to run it programmatically (which I guess should be easy). New script to create updatesAs mentioned above, this script will not only fetch snapshots, but also run CaveatsIn Proposal A we had two schedules: one for snapshots, and one for data updates. That gives us the most flexibility, and it lets us visualize in the ETL dashboard whether updates from the data provider are available (and it's up to us to take action or plan it for the future). With Proposal C, we don't have that flexibility: We simply attempt to do the update, all in one go. If it turns out that there is no available snapshot, the PR can be automatically closed. If there is an available snapshot, the PR will stay open (taking staging resources) possibly for weeks, without anyone taking care of it (because it would be unplanned work). |
I'm indifferent between B and C, I'd try them both and see whatever is more convenient (whether checking data & metadata diff or two different versions). It's also possible that small updates won't really need a new version, and big updates would. (I'm not saying we should change versions to latest, but to update existing files while keeping their version) |
Summary
We should automatically check for data provider updates, and possibly create snapshots, in a periodic basis (e.g. every day, week or month), and create a simple way to visualize whether an update is available (e.g. in the ETL dashboard).
Problem
Ideally, we should update our most important data and charts as soon as there is a new release. However, in our current situation, the following issues can happen:
These issues can lead to the following undesired outcomes:
NOTE: Some of these issues cannot be totally fixed. But we can alleviate them as much as possible with any of the proposed solutions.
Related issues
#3016
#3329
The text was updated successfully, but these errors were encountered: