Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package kedro.extras.datasets into its own kedro-datasets package #1457

Closed
18 tasks done
idanov opened this issue Apr 20, 2022 · 6 comments
Closed
18 tasks done

Package kedro.extras.datasets into its own kedro-datasets package #1457

idanov opened this issue Apr 20, 2022 · 6 comments

Comments

@idanov
Copy link
Member

idanov commented Apr 20, 2022

Problem

Currently Kedro as a framework and kedro.extras.datasets are one package and this has caused a number of issues:

  • Kedro Framework was held back adding support for Python 3.9/3.10 by breaking changes in the datasets dependencies
  • Adding support for newer versions of the datasets was held back by the need for more conservative versioning of the framework
  • The difference of breaking change velocity between the Framework (more stable) and the datasets (less stable) meant a painful accounting of what was new in 0.18 and what was merely a change in a dataset

Staged approach

Ripping kedro.extras.datasets from the framework is itself a breaking change, but we can phase this out in two stages:

  1. Make a separate package and alias everything in kedro.extras.datasets to that new package, including installing kedro[xxx]
  2. When we ship 0.19, we can officially remove kedro.extras.datasets and now both kedro and kedro-datasets will be installed separately

Tasks for first stage (0.18.X)

Tasks for second stage (preparing release 0.19)

@antonymilne
Copy link
Contributor

antonymilne commented Apr 21, 2022

This is great! 🚀 Just to wanted another motivation for doing this, and also some more tasks that will be needed. Datasets and their dependencies are the number 1 cause of random CI failures which have slowed us down quite a lot. I presume that your intention was that all the dataset tests should move to kedro-datasets also? If so then I'd add to "tasks for first stage":

  • Move all dataset tests to kedro-datasets
  • Try and figure out a nicer way of installing all the dependencies listed in the setup.py file rather than needing to maintain a separate test_requirements.txt file there
  • Clean up kedro's test_requirements.txt to remove requirements that aren't needed any more
  • Clean up kedro's CI config and Makefile. e.g. no need to conda install pytables any more or have the separate make test-no-spark. There's quite a few dataset-specific workarounds that we could remove

Also I think we should have:

  • Add deprecation warnings to all the kedro.extras.datasets (or the AbstractDataset classes? Not sure whether they're also moving) to explain they're moving

One small question: were you thinking of kedro-datasets living in kedro-plugins or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.

@idanov idanov moved this to Todo in Kedro Framework Apr 21, 2022
@merelcht merelcht added this to the kedro-datasets milestone Apr 25, 2022
@merelcht
Copy link
Member

merelcht commented Apr 26, 2022

One small question: were you thinking of kedro-datasets living in kedro-plugins or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.

I like this suggestion! True kedro-datasets isn't technically a plugin, but if we add it to kedro-plugins we can make use of the already existing CI setup and add in automatic push to pypi for all plugins in one go.

Some more things we should do as part of never work in kedro.extras.datasets is make sure this is clear to open source contributors by adding it to the contribution guide, and at the top of every dataset class and perhaps also in the PR template.

@merelcht merelcht removed the status in Kedro Framework May 24, 2022
@merelcht merelcht changed the title Package kedro.extras.datasets into its own kedro-datasets package [Parent] Package kedro.extras.datasets into its own kedro-datasets package May 24, 2022
@yetudada yetudada added this to Roadmap Jun 20, 2022
@yetudada yetudada moved this to Delivery in Roadmap Jun 20, 2022
@yetudada yetudada changed the title [Parent] Package kedro.extras.datasets into its own kedro-datasets package Package kedro.extras.datasets into its own kedro-datasets package Jun 23, 2022
@noklam
Copy link
Contributor

noklam commented Jun 27, 2022

Follow up on some unresolved issues discussed in kedro-org/kedro-plugins#38

  1. How will docs work for kedro-datasets? We will need to generate documentation (at least the dataset part) from kedro-datasets repo.
  2. Deepyaman has an idea with namespace package (i.e. we keep the namespace kedro.extra.datasets in the kedro-datasets repo.

@idanov
Copy link
Member Author

idanov commented Jan 3, 2023

Here's just a wild idea, I've played a bit with Python's meta_path and came up with an interesting PoC about aliasing imports globally as shown here:

import sys
import importlib
from importlib.abc import MetaPathFinder

class Custom(MetaPathFinder):
    def find_spec(fullname, path, target=None):
        if not 'kedro.datasets' in fullname:
            return None
       return importlib.util.find_spec('kedro_datasets')

sys.meta_path.append(Custom)

import kedro.datasets as kd

The code here allows you to import anything from kedro_datasets as if it were kedro.datasets. I did not explore it enough for potential problems, but maybe we can investigate whether we can somehow create the alias kedro.datasets work out of the box for any Kedro project? WDYT?

The modification of the meta_path can happen in configure_project so it works in all entrypoints. I am not sure how it will work with language servers or IDE support though...

@deepyaman
Copy link
Member

deepyaman commented Jan 3, 2023

@idanov That's very interesting, and I didn't know about this functionality. That being said, if we're going to investigate this, I think we should prioritize spending some time to make it work using namespace packages, because that would ideally be the "standard" solution to the problem (that doesn't involve path hacking). I just saw some of the discussion around why namespace packages may not work in #1758, and even though I don't fully understand all of the reasons having not spent time on it (except for some of the challenges with pip install -e), this could be a good alternative.

@yetudada yetudada moved this from Delivery - Now ⌛ to Shipped 🚀 in Roadmap Feb 23, 2023
@merelcht
Copy link
Member

Closing this because all sub-tasks are done and Kedro 0.19.0 and kedro-datasets 2.0.0 will be released soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Shipped 🚀
Development

No branches or pull requests

6 participants