Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically check for data updates and create automatic snapshots #3339

Open
pabloarosado opened this issue Sep 30, 2024 · 9 comments
Open

Comments

@pabloarosado
Copy link
Contributor

pabloarosado commented Sep 30, 2024

Summary

We should automatically check for data provider updates, and possibly create snapshots, in a periodic basis (e.g. every day, week or month), and create a simple way to visualize whether an update is available (e.g. in the ETL dashboard).

Problem

Ideally, we should update our most important data and charts as soon as there is a new release. However, in our current situation, the following issues can happen:

  • A planned update cannot be achieved because the data provider has not yet updated it on its end. This has happened for GBD.
  • A planned update cannot be executed because the server of the data provider happens to be down on that day. This happens relatively often for climate change datasets.
  • A planned update may take longer because the data provider has changed their data or their architecture significantly, and we have no way to foresee this possibility before we are already working on the snapshot update.
  • Important updates by a data provider may take a long time to be reflected in our charts after they are released. This has happened for FAOSTAT QCL.

These issues can lead to the following undesired outcomes:

  • Important data takes longer to be visible on our site.
  • Time is wasted due to suboptimal planning, leading to delays in other important work.
  • Journalists decide to not cite us because we are not showing the latest data.
  • Data providers may be unhappy if we keep showing old, possibly misleading data, that has changed significantly in a recent release.
  • In the long run, users may rely less on us, because we are "too slow".

NOTE: Some of these issues cannot be totally fixed. But we can alleviate them as much as possible with any of the proposed solutions.

Related issues

#3016
#3329

@lucasrodes
Copy link
Member

Seems very close to #3329, should we merge both issues?

@pabloarosado
Copy link
Contributor Author

pabloarosado commented Oct 1, 2024

Proposal A

Summary

We have an ETL script that is executed every night. It automatically creates snapshots .dvc and .py files, and adds their URI to special dag files (snapshots_available.yml, snapshots_failed.yml, snapshots_unchanged.yml). Successful snapshots are added to R2. The changes are pushed to a PR and (possibly) automatically merged to master.

Unlike what #3016 suggests, this process does not update any step automatically. Only snapshots are updated, and it's up to us when and whether to update forward dependencies.

Changes and explanation

New: snapshot schedule

One option is to have a snapshot_schedule.yml file (as suggested in #3016 ) where each snapshot is named by their unique identifier (same as snapshot URI but without version)).

An alternative would be that each snapshot .dvc file contains a section schedule with the relevant fields.

Schedule fields (rename as appropriate):

  • snapshot_period: Number of days between consecutive snapshots.

New: snapshot updater script

Process:

  • Load the active dag (which includes the snapshots from snapshots_available.yml).
  • Load two additional lists, which contain the failed (snapshots_failed.yml), and the unchanged (snapshots_unchanged.yml) snapshots.
  • Load the snapshot schedule.
  • For each snapshot in the schedule:
    • Check if its latest version in the dag is older than snapshot_period days ago.
    • Create new snapshot .dvc and .py files (this can be achieved using StepUpdater).
    • Try to execute the snapshot.
      • If successful, compare the md5 of the new with the old snapshot, and:
        • If different:
          • Upload snapshot data file to R2.
          • Add snapshot URI to snapshots_available.yml.
        • If equal:
          • Add snapshot URI to snapshots_unchanged.yml.
      • If not successful, add snapshot URI to snapshots_failed.yml.
        NOTE: In all cases, the new .dvc and .py files are kept as part of the code. But we will have an easier way to prune such files and move them to the archive.
        UPDATE: After discussion, we think it may be more convenient to keep the .dvc and .py files of only successful snapshots. Otherwise the folder can get cluttered with non-working code. Moving it back and forth from the code archive does not sound ideal. But this is an implementation detail.
  • Create a PR. To begin with, send a slack notification, and someone will need to review it and merge it. But eventually the PR could be automatically merged.

Changes in StepUpdater and ETL Dashboard:

Changes:

  • We would need to adjust the logic of the status of each step, to properly signal whether an update is available.
  • Snapshots could have labels Updateable and Failed.
  • We could add a column date_last_checked, which is the last time that we checked for an update.
  • We can currently archive steps in the dag, which means moving steps from the active to the archive dag. But we should also be able to move steps from the code to the archive.
  • Change StepUpdater so that it attempts to load already existing snapshots rather than creating new dvc and py files.

Changes in etl.snapshot

We currently can't identify the script that generated a given Snapshot. We can only do that when the name of the .dvc file coincides with the .py file. But this is often not the case. We can solve this by adding a new Snapshot field, script_file.

Alternatively, we could impose a strict naming convention for snapshot files, but this would also imply a big, possibly tricky refactor.

When possible, each individual snapshot script could also contain code to be able to fetch additional fields from the provider, like date_published, or version_producer. Maybe additional assertions would be good to have, so we become aware of important changes in the data.

Nice to have

  • We could have a schedule field, e.g. snapshot_period_min, which is the minimum number of days to wait between consecutive snapshots. We could also avoid repeating a snapshot that has been failing more than a certain number of times.
  • Each snapshot in the schedule can have a needs_review field. Then, the snapshot updater process creates two PRs: One for snapshots that need review, and another for those that don't (which is automatically merged).
  • It would be good to have a way to check if there are updates, without downloading them. For example, for big datasets, like FAO.
  • The current proposal automatically downloads available snapshots. But it would also be able to simply signal that an update is available, without fetching it. I think there should be an easy way to achieve this (maybe we could have snapshots_ready and snapshots_available yaml files).

Notes

  • This proposal is to only create snapshots automatically, not ETL steps. So, nothing that is public-facing will be automatically updated. We could have a separate schedule for common updates, which would create new data steps, update the dag so that new steps use the latest non-failing snapshots, run indicator upgrader, and create a PR. We would then simply go through chart diff, approve changes, and merge.
  • This proposal should in principle be compatible with using latest snapshots and data steps. But, in my view, the only reason why use latest is for convenience, given our technical limitations. Ideally, if we manage to have a better update workflow, we should aim at keeping all versions, to be able to review all changes in public data.

Caveats

One potential downside is that there will be many folders of snapshots, which are not used. This can be inconvenient when searching for files in VSCode. We could consider keeping them elsewhere (in the archive, or in a separate folder ignored by VSCode), and moving them to to the active snapshots folder only when they are used.

@pabloarosado pabloarosado changed the title Periodically check for data updates and optionally create automatic snapshots Periodically check for data updates and create automatic snapshots Oct 1, 2024
@Marigold
Copy link
Collaborator

Marigold commented Oct 1, 2024

Nice write-up, thanks! As a first step, we could create a staging server and just try to execute as many snapshots as possible (without putting them to a special place). That should let us know how feasible this approach is.

@pabloarosado
Copy link
Contributor Author

Nice write-up, thanks! As a first step, we could create a staging server and just try to execute as many snapshots as possible (without putting them to a special place). That should let us know how feasible this approach is.

Thanks Mojmir! Good idea, I'll do that experiment. We could try that (with upload=False) to have an estimate of how many snapshots would be automatically re-runnable.
But I don't think that is necessarily a good proxy of the feasibility of this approach. In the end, even if many snapshots fail, it's better to know that snapshots are failing before you start to work on them (to be able to estimate how much time the update will take).
However, if we foresee that the majority of snapshots will fail, we may need to rethink how to handle failing snapshots.

@Marigold
Copy link
Collaborator

Marigold commented Oct 2, 2024

Thanks Mojmir! Good idea, I'll do that experiment. We could try that (with upload=False) to have an estimate of how many snapshots would be automatically re-runnable.

I'd go even further, do it with upload=True and then check data-diff and chart-diff to get a sense of how much stuff can be updated. It's definitely not going to be the final solution, but it's something that could be done very fast on a staging server.

@pabloarosado
Copy link
Contributor Author

Thanks Mojmir! Good idea, I'll do that experiment. We could try that (with upload=False) to have an estimate of how many snapshots would be automatically re-runnable.

I'd go even further, do it with upload=True and then check data-diff and chart-diff to get a sense of how much stuff can be updated. It's definitely not going to be the final solution, but it's something that could be done very fast on a staging server.

Hmm, in principle the idea (with the current proposal) is to update only snapshots, and not touch any data steps. I'm not proposing doing automatic updates of any public-facing data. To be able to use chart-diff, we'd need to update data steps, and use indicator upgrader for all datasets, which we currently can't do programmatically (as far as I know).
And does datadiff compare also snapshot data?

@pabloarosado
Copy link
Contributor Author

Proposal B

Suggested by @Marigold (feel free to rephrase).

Summary

We have a periodic process that not only fetches snapshots, but does "everything", from fetching the snapshot to creating a PR with chart diff ready to be approved and merged. There is no need for indicator upgrader because we use latest everywhere.

This would be the easiest way to automatize regular updates, given the current technical limitations.

Changes in chart diff

Chart diff will need to show changes in data of dataset that have been modified in the current PR. Currently, this is in principle possible, but I have low confidence that it's doing it well. It's certainly not the default workflow.

Caveats

I think latest is convenient, given our current limitations, but not ideal. Among other things, you often see changes in the data every time you pull changes from master, unrelated to your work. This happens because latest steps changed. And, while we could filter them out, I still think that it's more valuable and robust to keep versions, and make the updating process easier.

@pabloarosado
Copy link
Contributor Author

pabloarosado commented Oct 2, 2024

Proposal C

This would be a trade-off between A and B.

Summary

Instead of having a schedule of snapshots, and a schedule of data updates, we have just one schedule of data updates, which do "everything": It fetches the snapshots, runs StepUpdater to create new versions of data steps, runs indicator upgrader programmatically, and then creates a PR with chart diff ready to be reviewed.

This would be, after Proposal B, the easiest way to automatize regular updates, with the benefit that we will keep versions, and be able to use chart diff properly.

Changes in indicator upgrader

We will need to have a way to run it programmatically (which I guess should be easy).

New script to create updates

As mentioned above, this script will not only fetch snapshots, but also run StepUpdater to create new data steps and update the dag. Then run indicator upgrader and create a PR.

Caveats

In Proposal A we had two schedules: one for snapshots, and one for data updates. That gives us the most flexibility, and it lets us visualize in the ETL dashboard whether updates from the data provider are available (and it's up to us to take action or plan it for the future). With Proposal C, we don't have that flexibility: We simply attempt to do the update, all in one go. If it turns out that there is no available snapshot, the PR can be automatically closed. If there is an available snapshot, the PR will stay open (taking staging resources) possibly for weeks, without anyone taking care of it (because it would be unplanned work).

@Marigold
Copy link
Collaborator

Marigold commented Oct 2, 2024

I'm indifferent between B and C, I'd try them both and see whatever is more convenient (whether checking data & metadata diff or two different versions). It's also possible that small updates won't really need a new version, and big updates would. (I'm not saying we should change versions to latest, but to update existing files while keeping their version)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants