Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated keys in parameters or catalog should raise an error or a warning #825

Closed
Galileo-Galilei opened this issue Jul 8, 2021 · 4 comments
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@Galileo-Galilei
Copy link
Member

Description

When a configuration file (e.g. parameters.yml) has duplicated entries within the same environment (e.g base/), Kedro keeps only the last entry it encounters without raising any warning.

Context

I was struggling to debug a big kedro project written by someone else, with several big parameters files (the projects had hundreds of parameters). Regardless of whether this fits or not into "best practices", I noticed the following (weird, and likely incorrect) behaviour of kedro: some parameters had duplicated entries (mainly because the number was so high and the file was poorly organized, so these duplicated entries were a mistake of the original developer), and only the last entry was loaded.

I had a hard time debugging this, because when I was changing some parameters in the config file, there was no change when running the pipeline (and no warning at all). My error was that I was changing the first key, but a duplicated key elsewhere in the file (or even in another parameters_second_file.yml) was overriding it.

Note that will happen with all files loaded with the ConfigLoader, and I faced the same issue with credentials : the project had two entries with the same credentials name, kedro loaded the 2nd one which was unfortunately incorrect, and I struggled a lot to figure out why I could not connect to the database because Kedro kept complaining about wrong credentials, but the file I was looking at had the right ones.

Steps to Reproduce

  1. Create a conda environment
conda create -n bug configloader_dupkeys python=3.7
conda activate configloader_dupkeys
pip install kedro==0.17.4
  1. Create a kedro project
kedro new

Enter "configloader dupkeys" and type enter 3 times.

  1. In conf/base/parameters, create a duplicated entry:
# conf/base/parameters

dupkey: 1
dupkey: 2
  1. In a script (or a notebook, or in cli):
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project

PROJECT_PATH=r"/path/to/project/configloader-dupkeys"
bootstrap_project(PROJECT_PATH)
session=KedroSession.create(project_path=PROJECT_PATH)

catalog=session.load_context().catalog

catalog.list()
print(catalog.load("params:dupkey")) # -> return 2 without warning

Expected Result

When the ConfigLoader loads a file withith duplicated keys in the same environment, it should either:

  • raise an error (preferred, since we know the first one won't be used)
  • raise a warning

Actual Result

The first key is ignored without any information. This is not harmful in the toy example above, but becomes a debugging nightmare in projects with a lot of config files and a lot of keys in these config files.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): tried with 0.16.5 and 0.17.4
  • Python version used (python -V): 3.6.8
  • Operating system and version: tried on Windows 10 and CentOS Linux 7 (Core)
@Galileo-Galilei Galileo-Galilei added the Issue: Bug Report 🐞 Bug that needs to be fixed label Jul 8, 2021
@datajoely
Copy link
Contributor

Hi @Galileo-Galilei thanks for raising this issue - it's a bit of a funny one and I agree something that it would be a pain to debug.

Currently YAML parsing is delegated to the anyconfig library. We actually check if duplicate keys exist across multiple files, but the contents of a single file is resolved by anyconfig as is.

https://github.com/quantumblacklabs/kedro/blob/6a6e33dd4e99b12a3cffa04b2778301ca3b92ec6/kedro/config/config.py#L156

Looking at the YAML spec there has been some evolution on the topic between 1.1 and 1.2 and the semantics of 'should' versus 'must' be unique. In PyCharm the syntax highlighter raises this as an issue, but it will still be marked as valid on this online YAML validator.

We'd welcome a pull request improving the UX of this - we use dynaconf elsewhere in Kedro and that potentially has a better mechanism for dealing with this situation.

@Galileo-Galilei
Copy link
Member Author

Thank you for the quick reply @datajoely. I quickly browse the web and surprisingly it seems that anyconfig cannot handle this use case. The reference you provide shows that it should be the right way to do this though. I may open an issue in anyconfig's repo.

That said, I don't have time right now to dive into dynaconf and check if it would be a better fit, and it is likely that modifying the internals of a core component would bring me into unexpected troubles so i won't open a PR for now.

Should we keep the issue opened for reference or do you prefer I close it since it will likely not be fixed?

@datajoely
Copy link
Contributor

I think it's something that we will likely not fix on Kedro core side - so I think it's best to close and perhaps raise an issue on the anyconfig side. This is perhaps something that a custom CI step could check for you?

@Galileo-Galilei
Copy link
Member Author

I understand you will not support any specific trick in the core library to fix it soon, but it may be something to have in mind when you will be working on configuration refactoring as described here because it will affect some of the features described in this thread (like enabling a non destructive merge of parameters to supercharge nested structure in the parameters.yml file), don't you think so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

No branches or pull requests

2 participants