Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the way we find and load a dataset/table/indicator #2648

Open
pabloarosado opened this issue May 16, 2024 · 5 comments
Open

Improve the way we find and load a dataset/table/indicator #2648

pabloarosado opened this issue May 16, 2024 · 5 comments

Comments

@pabloarosado
Copy link
Contributor

pabloarosado commented May 16, 2024

Summary

We need a better way to quickly access an ETL dataset/table/indicator from within Python.

Goal: Improve our catalog.find* function(s) (or create other another function).

  • Minimum requirements: That function makes it easier to find and load a specific dataset or table.
  • Better: That function would lets you search for indicators.
  • Optimal: That function lets you search for a dataset/table/indicator, based on exact or fuzzy matches, or semantic similarity of ETL paths as well as metadata.

Current workflow to search and load data

Currently, every time I need to access an ETL table (for example, to do a quick check) I need to go through one or more of the following steps:

  • Scrolling through the ETL dag files or snapshots/step files (which is quite inconvenient).
  • Going to the dataset admin.
  • Going to our home page to use the search bar. This is nice, but once you find the chart, then you have to click on "edit", then select the indicator, and copy the ETL path.
  • The indicator search is powerful, although in practice I have not found it particularly useful so far.

Then, once you know which dataset/table/indicator you need, you have to write quite a lot of code to load it, e.g.:

from etl.paths import DATA_DIR
from owid.catalog import Dataset
ds = Dataset(DATA_DIR / "garden/energy/2023-12-12/owid_energy")
tb = ds["owid_energy"].reset_index()

Note that here there are multiple things that need to be manually given (the namespace, version, short name of the dataset, and short name of the table).
Then, if I need a specific indicator, I'd need to do all this, plus then something like [c for c in tb.columns if "solar" in c] and then figure out which indicator I need.

Overall, it takes a few minutes to "quickly" access some data, which is not ideal.

Issues with catalog.find*

We currently have a catalog.find (and other related functions) that are supposed to help with this, but this function has issues. From the top of my head:

  • It finds tables, not datasets.
  • It doesn't let you load snapshots.
  • It's not very transparent in whether it loads from the remote catalog or local ETL.
  • It has problems (don't remember exactly what) when loading tables with similar short names (e.g. "population" and "population_density").
  • Also, a user gave us feedback on this function: Feedback for owid-catalog #2616

So, in practice, catalog.find is not useful. Therefore, we need a better function to quickly find and access our data. Superusers would also benefit from these improvements.

@pabloarosado pabloarosado changed the title Improve the way we load a catalog table Improve the way we load a dataset/table/indicator May 16, 2024
@pabloarosado pabloarosado changed the title Improve the way we load a dataset/table/indicator Improve the way we find and load a dataset/table/indicator May 16, 2024
@larsyencken
Copy link
Collaborator

Some background on why find() is not as great as it could be:

  • find() is defined on the owid.catalog module, but is really RemoteCatalog.find()
  • The RemoteCatalog can have methods on it like __getitem__ that you can't define on the module

In principal, we can make whatever we like here, so we should just get the interface right and then solve the (trivial) code to make it work smoothly.

Copy link

stale bot commented Sep 11, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 11, 2024
@Marigold
Copy link
Collaborator

I'd keep this open for a bit longer, although it doesn't seem to be a priority at the moment.

@stale stale bot closed this as completed Sep 18, 2024
@Marigold Marigold removed the wontfix This will not be worked on label Sep 19, 2024
@Marigold
Copy link
Collaborator

We still want this, though it might be a part of a larger API redesign.

@Marigold Marigold reopened this Sep 19, 2024
Copy link

stale bot commented Nov 19, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Nov 19, 2024
@Marigold Marigold added pinned and removed wontfix This will not be worked on labels Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants