Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wizard: problems and improvements for etl dashboard and step updater #2365

Closed
10 of 16 tasks
pabloarosado opened this issue Mar 4, 2024 · 7 comments
Closed
10 of 16 tasks
Assignees
Labels
wizard Issues related to wizard tool wontfix This will not be worked on

Comments

@pabloarosado
Copy link
Contributor

pabloarosado commented Mar 4, 2024

One-liner

This tracking issue will list all problems and improvements for the ETL dashboard and StepUpdater.

Issues

  •  The button “Add all dependencies” in the Operations list becomes useless if one of the dependencies is population, because it suddenly adds lots of dependencies that do not need update. A possible solution would be to have a tick box to ignore population.
  • The "Clear Operations list" button needs to be pressed twice to work.
  • Bug: Column "step" in steps_df (from StepUpdater) contains only active steps (since they are the only needed ones). However, "direct_usages" (and probably other columns of dependencies) includes also archive steps. This means that, in the Operations list, when adding direct usages, archive steps suddenly appear (and also the dashboard raises an IndexError when trying to access archive steps in steps_df. Fixed by Let VersionTracker optionally ignore archive steps #2448
  • Bug: Steps:
    1. Add the latest garden step of long_term_crop_yields to the Operations list.
    2. In the Operations list, click on "Add direct dependencies" of long_term_crop_yields.
    3. Click on "Add direct dependencies" of long_term_wheat_yields.
    4. Click on "Add direct dependencies" of faostat_qcl.
  • Bug: Function that writes to dag removes the comment below a step that has been updated. For example, when updating the climate change impacts explorer latest step, it removes the comment of the next step. Fixed by 🐛 Improve function that writes to dag #2482
  • Loading and processing analytics makes VersionTracker and StepUpdater slower. And analytics are only needed for the ETL dashboard. A solution would be to add an optional flag argument to StepUpdater and VersionTracker so that analytics are loaded only optionally. Use it when using it for the dashboard, but not for updates. I tried doing this and the time difference was insignificant. Updating many steps (e.g. climate) is still very slow (even over 1 minute), but fetching analyltics or not doesn't make a significant difference.
  • Dependencies of the owid_co2 and owid_energy steps may appear as archivable, because those owid_* steps do not have any charts. Figure out a way to make them active (and possibly the same for other steps that are used by github repos).
  • When running step updater from the CLI in interactive mode, you can choose which .py file to update for each snapshot .dvc file. However, in wizard, that's currently not possible, and, if there are multiple .py files in the same folder of a .dvc file (and if none of the .py file names do not match the .dvc file name), then a file will be picked based on edit distance (which is prone to error). A warning is raised, but the unwanted files are duplicated anyway. We should have a way to select which .py files to update for each snapshot.

Improvements and ideas

  • It would be useful to be able to interact with the new steps after they have been created. One possibility would be to have a checkbox "Replace steps in operations list with updated ones". Alternatively, add another operations list below the update button for the new steps.
  • Add a button below the Operations list to remove non-updateable steps from the list.
  • Add a button below the Operations list to run ETL on all steps in the Operations list.
  • owid/owid-issues#1435
  • Add a button to each entry in the operations list to edit the metadata of the ETL step.
  • Instead of running updater and etl under the hood, it would be good to run it step by step, and have a progress bar (and log?) shown. If that's not possible, at least it would be good to show the logs in the terminal that runs the dashboard (otherwise, you may see something running for a long time, without knowing what's going on).
  • Pablo A suggested that updated "latest" steps should be moved to the bottom of the dag, instead of their dependencies being updated in place.
  • Add button to restart dashboard. Unselecting and selecting grapher and explorer steps is not enough. The button is useful, for example, after doing an update (without using the dashboard). Currently, the only way to update the data shown in the dashboard is to restart the wizard.
@paarriagadap
Copy link
Contributor

Hi! I am starting to update World Bank PIP. I selected all the grapher/explorer steps containing world_bank_pip, I added them to the Operations list, then I added all dependencies for the three steps I selected and then I updated the 7 resulting steps.

The files were generated, but the dag was updated not the way I expected:

  • The garden dependency from the explorer step was replaced, I assume because the platform always considers a date by default and not latest.
image
  • The updated steps were added at the bottom without the explorer step mentioned above:
image
  • The step updater deleted the comment to distinguish another set of steps from the World Inequality Database:
image

The step updater also adds the steps right next to the previous dependencies. It would be nice to have a line jump between them:
image

I think the new folders and files work fine, so I only need to correct the dag. Thank you very much!

@paarriagadap
Copy link
Contributor

The step updater gets rid of the wizard formatting, which might be clearer to read.
image
image

@pabloarosado
Copy link
Contributor Author

Hi @paarriagadap, thanks a lot for reporting these issues! The issue of removing comments is already known (listed above). I'll try to fix that soon.

If I understand correctly, the issues are related to the formatting of the written files (either the dag or the snapshot metadata files), but the step updater is behaving as expected. In other words, the code generated by the tool is correct, although not great in terms of style. Is that right?

Currently, the step updater either (1) writes new steps in the dag, or (2) overwrites dependencies of existing steps. Steps with "latest" version correspond to case (2) because the step already exists, and its dependencies need to be updated. So you would prefer those updated "latest" steps to be moved the bottom of the dag, as if they were new steps. Is that right?

Please let me know if I misunderstood your issues. Thanks!

@paarriagadap
Copy link
Contributor

Hi @pabloarosado, yes, it's mostly formatting and that is better that the latest steps go to the bottom (or it's replicated in both old and new steps). Thank you!

A tiny one I found now is that I had an additional script to extract the data from PIP in the snapshot folder (pip_api.py here), and it wasn't copied to the new step, while there are shared.py scripts in garden that were carried over to the new steps.

@pabloarosado
Copy link
Contributor Author

Thanks for the clarification, I added the suggestion about "latest" steps to the list of improvements.

Regarding pip_api.py file, I think that's expected. The expected scenario is that there is only one code file per snapshot. In this case, that file is actually not generating snapshot, so it's a bit of a special case. Maybe we can figure out a solution for it, but I guess it's very uncommon.

@lucasrodes
Copy link
Member

should we allocate time for this during this cycle, @pabloarosado ?

Copy link

stale bot commented Oct 29, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 29, 2024
@stale stale bot closed this as completed Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wizard Issues related to wizard tool wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants