Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track dependencies in mdim steps #3723

Open
pabloarosado opened this issue Dec 16, 2024 · 0 comments
Open

Track dependencies in mdim steps #3723

pabloarosado opened this issue Dec 16, 2024 · 0 comments

Comments

@pabloarosado
Copy link
Contributor

pabloarosado commented Dec 16, 2024

Context

Currently, mdim steps require a config yaml file, which includes full paths of indicators in tables of grapher steps. There is no way to ensure that those dataset dependencies are always specified in the DAG. This means that, after an update, we need to remember to manually replace versions in the config yaml files, which is additional work and prone to error.

Possible solution

Ideally, we would avoid hardcoding paths (in steps and yaml files), and all dependencies would be specified in the DAG.

After a discussion with @lucasrodes we thought of a possible solution. The config yaml file (for example of the covid mdim step) could have a special placeholder for a dataset path, e.g. {ds:short_name} (e.g. {ds:covid_cases}), specifying the short name of a dataset listed as a dependency of the mdim step in the DAG.
Then, the function paths.load_mdim_config would read the config yaml file, and replace those placeholders by the full URIs of the corresponding dataset.

Possible rabbit holes or related issues

  • But it is possible that multiple dependencies of an mdim step have the same short name. And we may also want to create an mdim that compares different versions of the same dataset. For such cases, we could define custom placeholders, e.g. {ds:custom_short_name}, and then pass a dictionary to paths.load_mdim_config mapping those custom short names to the corresponding dataset URI.
  • We also noticed that it is inconvenient that Table does not have an URI, and we rely on Table.metadata.dataset.uri. Maybe tables should also have a URI attribute.
  • We may need an additional function of paths to get the URI of a table in a dataset. Currently, the way we'd do that is by, e.g. `paths.load_dataset("dataset_path...") + "/table_name...".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant