Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Kedro instantiate datasets from kedro_datasetwith higher priority than kedro.extras.datasets #1734

Merged
merged 16 commits into from
Nov 2, 2022

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Jul 28, 2022

Signed-off-by: Nok Chan nok.lam.chan@quantumblack.com

Description

Close #1494

We are going to introduce kedro-datasets as a new Kedro's plugins for datasets. It will be initiated with a higher priority in case of conflicting datasets. The kedro.extras.datasets will be removed in 0.19.0 completely, but for 0.18.x we will keep it compatible thus we need to make sure the kedro-datasets package has a higher priority.

Development notes

Tested manually. Adding an automatic test is a tricky since we don't want to introduce kedro-datasets into the test dependency.

from kedro.io import DataCatalog


config = {
        "catalog": {
            "boats": {"type": "pandas.CSVDataSet", "filepath": "123"},
        },
    }

catalog = DataCatalog.from_config(**config)


print(catalog._data_sets)
# {'boats': <kedro_datasets.pandas.csv_dataset.CSVDataSet at 0x7f96b1493d60>}

More notes on test

I try to avoid touching the sys.modules as it seems to be problematic and creates some unknown side effects in our tests. We did it for a few tests to force reload. I can't find a better way to test a non-existing module, mocking a module is not trivial.

In the end, I go with spying, which basically swaps the method with a custom one (It's a bit annoying that it has to be a class, you can't spy a function), but it works. I would be very pleased if there is a better way to do this.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

noklam added 2 commits July 28, 2022 14:02
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam marked this pull request as ready for review July 29, 2022 12:53
@noklam noklam requested a review from idanov as a code owner July 29, 2022 12:53
@noklam noklam removed the request for review from AhdraMeraliQB July 29, 2022 12:58
@noklam noklam marked this pull request as draft July 29, 2022 12:58
@noklam
Copy link
Contributor Author

noklam commented Aug 25, 2022

This was tested manually with local testing, maybe that's enough already. Optionally we could do add a test to make sure the kedro-dataset takes higher priority when it get instantiated, but that need to be done when the kedro-dataset package is published properly.

…o_datasets_higher_priority-#1494

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam changed the title Make Kedro instantiate datasets from kedro.datasets with higher priority than kedro.extras.datasets Make Kedro instantiate datasets from kedro_datasetwith higher priority than kedro.extras.datasets Oct 25, 2022
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam requested review from merelcht and jmholzer and removed request for idanov October 26, 2022 15:53
@noklam noklam marked this pull request as ready for review October 26, 2022 15:53
RELEASE.md Outdated
@@ -11,8 +11,9 @@
# Upcoming Release 0.18.4

## Major features and improvements
* The config loader objects now implement `UserDict` and the configuration is accessed through `conf_loader['catalog']`
* You can configure config file patterns through `settings.py` without creating a custom config loader
* Make Kedro instantiate datasets from `kedro.datasets` with higher priority than `kedro.extras.datasets`. `kedro.datasets` is the namespace for the new `kedro-datasets` python package.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be kedro_datasets instead of kedro.datasets right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

noklam and others added 4 commits October 27, 2022 16:26
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
…ub.com:kedro-org/kedro into feat/init_kedro_datasets_higher_priority-#1494

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you managed to create a test for this!! 👏 🏆

Copy link
Contributor

@jmholzer jmholzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Really clever test, but I wonder if it could be a bit clearer. I left two comments but otherwise happy to approve.

def test_config_import_kedro_datasets(self, sane_config, mocker):
"""Test kedro.extras.datasets default path to the dataset class"""
# Spy _load_obj because kedro_datasets is installed
class DummyMock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DummyMock and dummy_mock could have more meaningful names, would DummyLoader work here?

Copy link
Contributor Author

@noklam noklam Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmholzer I manage to remove the dummy object, turns out I can spy on the kedro.io.core module object directly, so the patch is not needed.

def test_config_import_kedro_datasets(self, sane_config, mocker):
"""Test kedro.extras.datasets default path to the dataset class"""
# Spy _load_obj because kedro_datasets is installed
class DummyMock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a function instead of a class? That may also make the test clearer. I checked the pytest-mock docs, it's not obvious if it can be in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. This isn't possible, the spy need to work with a class and method. I will add comments on it.

noklam and others added 3 commits November 1, 2022 11:52
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
…ub.com:kedro-org/kedro into feat/init_kedro_datasets_higher_priority-#1494

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam requested a review from jmholzer November 1, 2022 15:14
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
noklam and others added 3 commits November 2, 2022 10:59
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Copy link
Contributor

@jmholzer jmholzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work 🎉.

I like the test, the comments and structure make things clear.

@noklam noklam merged commit 6343ee4 into main Nov 2, 2022
@noklam noklam deleted the feat/init_kedro_datasets_higher_priority-#1494 branch November 2, 2022 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Kedro instantiate datasets from kedro-datasets with higher priority than kedro.extras.datasets
3 participants