Hardcoded paths in MDIMs make future updates harder #3723

pabloarosado · 2024-12-16T16:25:05Z

Context

Currently, mdim steps require a config yaml file, which includes full paths of indicators in tables of grapher steps.

Potential problems

You could reference data in the YAML that wasn't in the DAG dependencies
- This could lead to intermittent ETL failures, or the ETL not rebuilding the MDIM even though its data was stale
The hard-coding of versions in the YAML (e.g. https://github.com/owid/etl/blob/master/etl/steps/export/multidim/energy/latest/energy.yml) means that updating MDIMs is more annoying

Possible solution

Treat the MDIM yaml as a template, where we fill in some variables at the time we ship it.

Ideally, we would avoid hardcoding paths (in steps and yaml files), and all dependencies would be specified in the DAG.

After a discussion with @lucasrodes we thought of a possible solution. The config yaml file (for example of the covid mdim step) could have a special placeholder for a dataset path, e.g. {ds:short_name} (e.g. {ds:covid_cases}), specifying the short name of a dataset listed as a dependency of the mdim step in the DAG.
Then, the function paths.load_mdim_config would read the config yaml file, and replace those placeholders by the full URIs of the corresponding dataset.

Possible rabbit holes or related issues

But it is possible that multiple dependencies of an mdim step have the same short name. And we may also want to create an mdim that compares different versions of the same dataset. For such cases, we could define custom placeholders, e.g. {ds:custom_short_name}, and then pass a dictionary to paths.load_mdim_config mapping those custom short names to the corresponding dataset URI.
- As a reference, see a parallel implementation of safe short names in a related project: https://github.com/larsyencken/shelf/blob/main/src/shelf/tables.py#L200
We also noticed that it is inconvenient that Table does not have an URI, and we rely on Table.metadata.dataset.uri. Maybe tables should also have a URI attribute.
We may need an additional function of paths to get the URI of a table in a dataset. Currently, the way we'd do that is by, e.g. `paths.load_dataset("dataset_path...") + "/table_name...".

Impact

We're not encountering this problem so much yet, but it's more that we are currently setting precedents on how a large amount of work will be done, so we're interested in saving ourselves future work by getting this right.

The text was updated successfully, but these errors were encountered:

larsyencken · 2025-01-31T10:29:34Z

We discussed this in triage today.

@Marigold thought this could be good to tackle at a time when we are fixing a related issue, to make MDIMs on the ETL side...

build from data://grapher/... steps (filesystem -> DB)
rather than grapher://grapher/... steps (filesystem -> DB -> filesystem -> DB)

as it does now.

pabloarosado · 2025-02-06T10:51:52Z

To clarify, this should not include refactoring the currently existing mdims (e.g. covid). That should happen as a separate issue.

Marigold · 2025-02-14T10:56:58Z

This was fixed by #3945

It's now sufficient to specify table#indicator instead of the full catalog path. So grapher/energy/2024-06-20/energy_mix/energy_mix#coal__twh becomes energy_mix#coal__twh. Those paths get dynamically expanded in the last step in the upsert_multidim_data_page(...) function, you just have to pass extra argument dependencies=paths.dependencies so that the function knows dataset names.

pabloarosado · 2025-02-14T11:30:27Z

That sounds like a great solution, thanks a lot @Marigold!

github-actions bot added the needs triage label Dec 16, 2024

larsyencken changed the title ~~Track dependencies in mdim steps~~ Some MDIM steps hardcode paths making Jan 31, 2025

larsyencken changed the title ~~Some MDIM steps hardcode paths making~~ Some MDIM steps hardcode paths Jan 31, 2025

larsyencken added the priority 3 - nice to have label Jan 31, 2025

larsyencken changed the title ~~Some MDIM steps hardcode paths~~ Hardcoded paths in MDIMs make future updates harder Jan 31, 2025

pabloarosado added priority 2 - important priority 1 - essential and removed needs triage priority 3 - nice to have priority 2 - important labels Feb 6, 2025

pabloarosado assigned Marigold Feb 6, 2025

Marigold closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded paths in MDIMs make future updates harder #3723

Hardcoded paths in MDIMs make future updates harder #3723

pabloarosado commented Dec 16, 2024 •

edited by larsyencken

Loading

larsyencken commented Jan 31, 2025 •

edited

Loading

pabloarosado commented Feb 6, 2025

Marigold commented Feb 14, 2025

pabloarosado commented Feb 14, 2025

Hardcoded paths in MDIMs make future updates harder #3723

Hardcoded paths in MDIMs make future updates harder #3723

Comments

pabloarosado commented Dec 16, 2024 • edited by larsyencken Loading

Context

Potential problems

Possible solution

Possible rabbit holes or related issues

Impact

larsyencken commented Jan 31, 2025 • edited Loading

pabloarosado commented Feb 6, 2025

Marigold commented Feb 14, 2025

pabloarosado commented Feb 14, 2025

pabloarosado commented Dec 16, 2024 •

edited by larsyencken

Loading

larsyencken commented Jan 31, 2025 •

edited

Loading