Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable variable interpolation for the catalog with OmegaConfigLoader #2621

Merged
merged 10 commits into from
Jun 2, 2023
12 changes: 11 additions & 1 deletion RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,17 @@

## Migration guide from Kedro 0.18.* to 0.19.*

# Upcoming Release 0.18.10

## Major features and improvements
* Added support for variable interpolation in the catalog with the `OmegaConfigLoader`.

## Bug fixes and other changes

## Breaking changes to the API

## Upcoming deprecations for Kedro 0.19.0

# Release 0.18.9

## Major features and improvements
Expand Down Expand Up @@ -35,7 +46,6 @@ Many thanks to the following Kedroids for contributing PRs to this release:

## Upcoming deprecations for Kedro 0.19.0


# Release 0.18.8

## Major features and improvements
Expand Down
26 changes: 24 additions & 2 deletions docs/source/configuration/advanced_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@ Although Jinja2 is a very powerful and extremely flexible template engine, which


### How to do templating with the `OmegaConfigLoader`
#### Parameters
Templating or [variable interpolation](https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#variable-interpolation), as it's called in `OmegaConf`, for parameters works out of the box if the template values are within the parameter files or the name of the file that contains the template values follows the same config pattern specified for parameters.
By default, the config pattern for parameters is: `["parameters*", "parameters*/**", "**/parameters*"]`.
Suppose you have one parameters file called `parameters.yml` containing parameters with `omegaconf` placeholders like this:
Expand All @@ -236,10 +237,31 @@ data:

Since both of the file names (`parameters.yml` and `parameters_globals.yml`) match the config pattern for parameters, the `OmegaConfigLoader` will load the files and resolve the placeholders correctly.

```{note}
Templating currently only works for parameter files, but not for catalog files.
#### Catalog
From Kedro `0.18.10` templating also works for catalog files. To enable templating in the catalog you need to ensure that the template values are within the catalog files or the name of the file that contains the template values follows the same config pattern specified for catalogs.
merelcht marked this conversation as resolved.
Show resolved Hide resolved
By default, the config pattern for catalogs is: `["catalog*", "catalog*/**", "**/catalog*"]`.

Additionally, any template values in the catalog need to start with an underscore `_`. This is because of how catalog entries are validated. Templated values will neither trigger a key duplication error nor appear in the resulting configuration dictionary.

Suppose you have one catalog file called `catalog.yml` containing entries with `omegaconf` placeholders like this:

```yaml
companies:
type: ${_pandas.type}
filepath: data/01_raw/companies.csv
```

and a file containing the template values called `catalog_globals.yml`:
```yaml
_pandas:
type: pandas.CSVDataSet
```

Since both of the file names (`catalog.yml` and `catalog_globals.yml`) match the config pattern for catalogs, the `OmegaConfigLoader` will load the files and resolve the placeholders correctly.

#### Other configuration files
It's also possible to use variable interpolation in configuration files other than parameters and catalog, such as custom spark or mlflow configuration. This works in the same way as variable interpolation in parameter files. You can still use the underscore for the templated values if you want, but it's not mandatory like it is for catalog files.

### How to use custom resolvers in the `OmegaConfigLoader`
`Omegaconf` provides functionality to [register custom resolvers](https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#resolvers) for templated values. You can use these custom resolves within Kedro by extending the [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader) class.
The example below illustrates this:
Expand Down
3 changes: 2 additions & 1 deletion docs/source/configuration/configuration_basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ Kedro merges configuration information and returns a configuration dictionary ac
* If any two configuration files located inside the **same** environment path (such as `conf/base/`) contain the same top-level key, the configuration loader raises a `ValueError` indicating that duplicates are not allowed.
* If two configuration files contain the same top-level key but are in **different** environment paths (for example, one in `conf/base/`, another in `conf/local/`) then the last loaded path (`conf/local/`) takes precedence as the key value. `ConfigLoader.get` does not raise any errors but a `DEBUG` level log message is emitted with information on the overridden keys.

When using the default `ConfigLoader` or the `TemplatedConfigLoader`, any top-level keys that start with `_` are considered hidden (or reserved) and are ignored. Those keys will neither trigger a key duplication error nor appear in the resulting configuration dictionary. However, you can still use such keys, for example, as [YAML anchors and aliases](https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet#anchors).
When using any of the configuration loaders, any top-level keys that start with `_` are considered hidden (or reserved) and are ignored. Those keys will neither trigger a key duplication error nor appear in the resulting configuration dictionary. However, you can still use such keys, for example, as [YAML anchors and aliases](https://www.educative.io/blog/advanced-yaml-syntax-cheatsheet#anchors)
or [to enable templating in the catalog when using the `OmegaConfigLoader`](advanced_configuration.md#how-to-do-templating-with-the-omegaconfigloader).

### Configuration file names
Configuration files will be matched according to file name and type rules. Suppose the config loader needs to fetch the catalog configuration, it will search according to the following rules:
Expand Down
13 changes: 11 additions & 2 deletions kedro/config/omegaconf_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,13 @@ def load_and_merge_dir_config( # pylint: disable=too-many-arguments
return OmegaConf.to_container(
OmegaConf.merge(*aggregate_config, self.runtime_params), resolve=True
)
return OmegaConf.to_container(OmegaConf.merge(*aggregate_config), resolve=True)
return {
k: v
for k, v in OmegaConf.to_container(
OmegaConf.merge(*aggregate_config), resolve=True
).items()
if not k.startswith("_")
}
Comment on lines +289 to +295
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to point out that this doesn't affect catalog only, it is more generalised to anything other than parameters, for example sparks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point!! I'll clarify that in the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a bit more nuanced. Templating in other files don't need the _, that's specifically needed for the catalog validation, but it's true that the _ values in any files will be ignored from duplication checks etc.


def _is_valid_config_path(self, path):
"""Check if given path is a file path and file type is yaml or json."""
Expand All @@ -307,7 +313,10 @@ def _check_duplicates(seen_files_to_keys: dict[Path, set[Any]]):
for filepath2 in filepaths[i:]:
config2 = seen_files_to_keys[filepath2]

overlapping_keys = config1 & config2
combined_keys = config1 & config2
overlapping_keys = {
key for key in combined_keys if not key.startswith("_")
}

if overlapping_keys:
sorted_keys = ", ".join(sorted(overlapping_keys))
Expand Down
55 changes: 54 additions & 1 deletion tests/config/test_omegaconf_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ def create_config_dir(tmp_path, base_config, local_config):
base_catalog = tmp_path / _BASE_ENV / "catalog.yml"
base_logging = tmp_path / _BASE_ENV / "logging.yml"
base_spark = tmp_path / _BASE_ENV / "spark.yml"
base_catalog = tmp_path / _BASE_ENV / "catalog.yml"

local_catalog = tmp_path / _DEFAULT_RUN_ENV / "catalog.yml"

Expand Down Expand Up @@ -596,3 +595,57 @@ def test_runtime_params_not_propogate_non_parameters_config(self, tmp_path):
assert key not in credentials
assert key not in logging
assert key not in spark

def test_ignore_hidden_keys(self, tmp_path):
"""Check that the config key starting with `_` are ignored and also
don't cause a config merge error"""
_write_yaml(tmp_path / _BASE_ENV / "catalog1.yml", {"k1": "v1", "_k2": "v2"})
_write_yaml(tmp_path / _BASE_ENV / "catalog2.yml", {"k3": "v3", "_k2": "v4"})

conf = OmegaConfigLoader(str(tmp_path))
conf.default_run_env = ""
catalog = conf["catalog"]
assert catalog.keys() == {"k1", "k3"}

_write_yaml(tmp_path / _BASE_ENV / "catalog3.yml", {"k1": "dup", "_k2": "v5"})
pattern = (
r"Duplicate keys found in "
r"(.*catalog1\.yml and .*catalog3\.yml|.*catalog3\.yml and .*catalog1\.yml)"
r"\: k1"
)
with pytest.raises(ValueError, match=pattern):
conf["catalog"]

def test_variable_interpolation_in_catalog_with_templates(self, tmp_path):
base_catalog = tmp_path / _BASE_ENV / "catalog.yml"
catalog_config = {
"companies": {
"type": "${_pandas.type}",
"filepath": "data/01_raw/companies.csv",
},
"_pandas": {"type": "pandas.CSVDataSet"},
}
_write_yaml(base_catalog, catalog_config)

conf = OmegaConfigLoader(str(tmp_path))
conf.default_run_env = ""
assert conf["catalog"]["companies"]["type"] == "pandas.CSVDataSet"

def test_variable_interpolation_in_catalog_with_separate_templates_file(
self, tmp_path
):
base_catalog = tmp_path / _BASE_ENV / "catalog.yml"
catalog_config = {
"companies": {
"type": "${_pandas.type}",
"filepath": "data/01_raw/companies.csv",
}
}
tmp_catalog = tmp_path / _BASE_ENV / "catalog_temp.yml"
template = {"_pandas": {"type": "pandas.CSVDataSet"}}
_write_yaml(base_catalog, catalog_config)
_write_yaml(tmp_catalog, template)

conf = OmegaConfigLoader(str(tmp_path))
conf.default_run_env = ""
assert conf["catalog"]["companies"]["type"] == "pandas.CSVDataSet"