Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Adding Contributions Space for Experimental Datasets #517

Closed
merelcht opened this issue Jan 16, 2024 · 10 comments
Closed

Proposal for Adding Contributions Space for Experimental Datasets #517

merelcht opened this issue Jan 16, 2024 · 10 comments

Comments

@merelcht
Copy link
Member

merelcht commented Jan 16, 2024

Description

Introduce a contrib folder within the Kedro datasets repository to accommodate contributions that are more experimental and may not fully adhere to the usual standards, such as being fully tested. This space will allow for the inclusion of datasets that are in the early stages of development or might not meet the criteria for being part of the core Kedro datasets.

An example of such datasets are the langchain based datasets for which we have an open draft PR: #434

Key Points:

  1. The contrib folder is designated for experimental contributions and should not be held to the same maintenance standards as the core Kedro datasets.
  2. Contributions within the contrib folder are owned by their primary authors, and the core Kedro team is not responsible for their active maintenance.
  3. Datasets within the contrib folder may evolve and improve over time. Successful and well-maintained contributions can graduate from the contrib folder and move to the regular kedro_datasets space.
  4. Establish clear criteria and guidelines for determining when a contribution is considered a regular (non-experimental) contribution versus an experimental one. This will help contributors and maintainers understand the expectations and classification of datasets.

Considerations:

  • Define specific criteria for moving a dataset from the contrib folder to the regular kedro_datasets space.
  • Ensure that the contrib folder is clearly communicated as an experimental space, encouraging users to be cautious when relying on datasets from this folder.
  • Document the rules and guidelines for contributing to both the contrib and regular kedro_datasets spaces in the project's documentation.

Next steps

  • Seek feedback from the Kedro TSC on the proposed structure and guidelines.
  • Finalise and document the criteria for contributions to the contrib and regular kedro_datasets spaces.
  • Implement the contrib folder and update the documentation accordingly.

Note:
This issue serves as a proposal and discussion point. Further details and decisions will be made in collaboration with the Kedro community and maintainers.

Examples

Projects that have a similar contribution space:

@noklam
Copy link
Contributor

noklam commented Jan 23, 2024

Add tensorflow as an exampler https://www.tensorflow.org/guide/migrate/migrate_tf2#:~:text=Remove%20old%20tf.contrib%20symbols%20(check%20TF%20Addons%20and%20TF%2DSlim).

Remove old tf.contrib symbols (check TF Addons and TF-Slim).

The route that they took is different, tensorflow 1.0 was growing fast and tf.contrib was partially merged into tensorflow 2.0, and became two other sub-modulea:

@astrojuanlu
Copy link
Member

More questions:

  • Does contrib get installed? In other words, can users do from kedro_datasets.contrib import xyz after pip install kedro-datasets?
  • Should we enact a process to "graduate" datasets from contrib?
  • Aside from experimental datasets, and talking only about contributed datasets that are in good shape, should we accept all of them? Otherwise, should we limit the scope of kedro-datasets, and if so, to what extent?

@datajoely
Copy link
Contributor

Fun fact our 'extras' folder was called 'contrib' in v0.0.1

Two ideas:

  • Can we make the custom dataset definition simpler and less verbose? Copying fsspec each time feels super redundant.
  • Can we highlight and showcase our community contributions on kedro.org?

@datajoely
Copy link
Contributor

Also we have some great community contributions in the form of the following:

We don't do a great job regarding discoverability and aren't really 'blessed' as Kedro approved.

@datajoely
Copy link
Contributor

datajoely commented Jan 24, 2024

What about leaning into the community side via the website? We could do something like this where the user just needs to provide a class, a sample file and we do the rest?
https://huggingface.co/docs/datasets/upload_dataset

@merelcht
Copy link
Member Author

We discussed this issue in technical design and some of the points that came up were the following:

  • The "graduation" and "demotion" process of contributions:

    • What is required for a contribution to be considered ready to go into the regular kedro_datasets space
    • Would we ever demote contributions and when? Is that time bound?
    • Can we gather telemetry to inform ourselves about which dataset contributions should be graduated/demoted?
  • The experimental/contrib model can be implemented in different ways:

    1. A contrib/experimental directory inside the kedro-datasets repo (as proposed in this issue)
    2. Annotate datasets with "experimental"
    3. Create and publish a separate package for experimental contributions
  • We voted on whether we even want to consider implementing a model for experimental contributions and 100% of the attendees voted "yes".

  • As a follow up action there was a push to get pros/cons for the various contribution models. We did a preliminary vote on the three models above and it resulted in a tie for option 1 and 2. Only one attendee voted for creating a separate package.

@merelcht
Copy link
Member Author

merelcht commented Jan 26, 2024

Pros/cons of proposed experimental contribution model:

1. A contrib/experimental directory inside the kedro-datasets repo

Pros

One-off

  • No need to create a new repo, CI/CD setup, etc.

Continuous

  • Easy to discover the experimental datasets, because they are part of the existing package.
  • Easy to tell from the import path whether a dataset is experimental or not (e.g. experimental.ExperimentalDataset vs kedro_datasets.pandas.CSVDataset)

Cons

One-off

  • Need to find a way to separate experimental dependencies from the dependencies required to run the tests etc.

Continuous

  • Could cause some friction with PRs being created and needing to ask authors to move it to the contrib/experimental folder.

2. Annotate datasets with "experimental"

Pros

One-off

  • No need to create a new repo

Continuous

  • Easy to discover the experimental datasets, because they are part of the existing package
  • @noklam "when something is not "experimental" anymore, there will be no breaking change because import will stay the same." + same goes for "downgrading" a dataset.

Cons

One-off

  • Need to find a way to separate experimental dependencies from the dependencies required to run the tests etc.

Continuous

  • Need to adjust CI/CD and test-coverage to skip any datasets annotated with experimental (not super hard)
  • Won't be immediately clear on import that a dataset is "experimental". A warning could be logged on kedro run. And users don't usually read warnings 🥲

3. Create and publish a separate package for experimental contributions

Pros

One-off
Continuous

  • Very clear that these datasets are separate from the core supported kedro-datasets
  • Experimental dependencies won't clutter the kedro-dataset repo.

Cons

One-off

  • Requires a complete new repo, CI/CD, pypi setup

Continuous

  • Another repo to maintain and watch for PR contributions.
  • Harder to discover the experimental datasets

@astrojuanlu
Copy link
Member

I would separate one-off costs (like creating new repos) from ongoing costs when weighing pros and cons. Creating a new repo is trivial (even with linting, CI/CD, PyPI publishing etc). "Need to adjust CI/CD and test-coverage to skip any datasets annotated with experimental", on the other hand, sounds like an ongoing pain. And "Harder to discover the experimental datasets" could be an ongoing problem too.

@noklam
Copy link
Contributor

noklam commented Feb 2, 2024

I am happy with either 1/2, I fear yet another package is going to diverging users from the main one.

For 3., we may need to add another RTD project which feels confusing. Actually, I am not sure how RTD will work because it required the package to be installed, so the pre-requisite is installing all the package?

For 2. I will add one pro compare to 2, which is when something is not "experimental" anymore, there will be no breaking change because import will stay the same.

@merelcht
Copy link
Member Author

merelcht commented Feb 7, 2024

This topic was again discussed in technical design on 7/02/2024. The pros and cons of the various contribution models were discussed and then we voted again for the model, the outcome was as follows:

  1. A contrib/experimental directory inside the kedro-datasets repo: 6/14 votes
  2. Annotate datasets with "experimental": 5/14 votes
  3. Create and publish a separate package for experimental contributions: 3/14 votes

The majority of votes went to number 1 and thus that will be the model we'll implement. Further decisions will need to be made about the graduation/demotion process of experimental datasets. New issues will be opened to address that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants