Proposal for Adding Contributions Space for Experimental Datasets #517

merelcht · 2024-01-16T16:58:00Z

Description

Introduce a contrib folder within the Kedro datasets repository to accommodate contributions that are more experimental and may not fully adhere to the usual standards, such as being fully tested. This space will allow for the inclusion of datasets that are in the early stages of development or might not meet the criteria for being part of the core Kedro datasets.

An example of such datasets are the langchain based datasets for which we have an open draft PR: #434

Key Points:

The contrib folder is designated for experimental contributions and should not be held to the same maintenance standards as the core Kedro datasets.
Contributions within the contrib folder are owned by their primary authors, and the core Kedro team is not responsible for their active maintenance.
Datasets within the contrib folder may evolve and improve over time. Successful and well-maintained contributions can graduate from the contrib folder and move to the regular kedro_datasets space.
Establish clear criteria and guidelines for determining when a contribution is considered a regular (non-experimental) contribution versus an experimental one. This will help contributors and maintainers understand the expectations and classification of datasets.

Considerations:

Define specific criteria for moving a dataset from the contrib folder to the regular kedro_datasets space.
Ensure that the contrib folder is clearly communicated as an experimental space, encouraging users to be cautious when relying on datasets from this folder.
Document the rules and guidelines for contributing to both the contrib and regular kedro_datasets spaces in the project's documentation.

Next steps

Seek feedback from the Kedro TSC on the proposed structure and guidelines.
Finalise and document the criteria for contributions to the contrib and regular kedro_datasets spaces.
Implement the contrib folder and update the documentation accordingly.

Note:
This issue serves as a proposal and discussion point. Further details and decisions will be made in collaboration with the Kedro community and maintainers.

Examples

Projects that have a similar contribution space:

The text was updated successfully, but these errors were encountered:

noklam · 2024-01-23T23:05:14Z

Add tensorflow as an exampler https://www.tensorflow.org/guide/migrate/migrate_tf2#:~:text=Remove%20old%20tf.contrib%20symbols%20(check%20TF%20Addons%20and%20TF%2DSlim).

Remove old tf.contrib symbols (check TF Addons and TF-Slim).

The route that they took is different, tensorflow 1.0 was growing fast and tf.contrib was partially merged into tensorflow 2.0, and became two other sub-modulea:

TF Addons (deprecate in May 2024) - https://github.com/tensorflow/addons
TF slim - https://github.com/google-research/tf-slim with very little development

astrojuanlu · 2024-01-24T13:28:18Z

More questions:

Does contrib get installed? In other words, can users do from kedro_datasets.contrib import xyz after pip install kedro-datasets?
Should we enact a process to "graduate" datasets from contrib?
Aside from experimental datasets, and talking only about contributed datasets that are in good shape, should we accept all of them? Otherwise, should we limit the scope of kedro-datasets, and if so, to what extent?

datajoely · 2024-01-24T15:06:03Z

Fun fact our 'extras' folder was called 'contrib' in v0.0.1

Two ideas:

Can we make the custom dataset definition simpler and less verbose? Copying fsspec each time feels super redundant.
Can we highlight and showcase our community contributions on kedro.org?

datajoely · 2024-01-24T15:09:17Z

Also we have some great community contributions in the form of the following:

We don't do a great job regarding discoverability and aren't really 'blessed' as Kedro approved.

datajoely · 2024-01-24T15:56:05Z

What about leaning into the community side via the website? We could do something like this where the user just needs to provide a class, a sample file and we do the rest?
https://huggingface.co/docs/datasets/upload_dataset

merelcht · 2024-01-26T12:20:18Z

We discussed this issue in technical design and some of the points that came up were the following:

The "graduation" and "demotion" process of contributions:
- What is required for a contribution to be considered ready to go into the regular kedro_datasets space
- Would we ever demote contributions and when? Is that time bound?
- Can we gather telemetry to inform ourselves about which dataset contributions should be graduated/demoted?
The experimental/contrib model can be implemented in different ways:
1. A contrib/experimental directory inside the kedro-datasets repo (as proposed in this issue)
2. Annotate datasets with "experimental"
3. Create and publish a separate package for experimental contributions
We voted on whether we even want to consider implementing a model for experimental contributions and 100% of the attendees voted "yes".
As a follow up action there was a push to get pros/cons for the various contribution models. We did a preliminary vote on the three models above and it resulted in a tie for option 1 and 2. Only one attendee voted for creating a separate package.

merelcht · 2024-01-26T12:25:23Z

Pros/cons of proposed experimental contribution model:

1. A `contrib`/`experimental` directory inside the `kedro-datasets` repo

Pros

One-off

No need to create a new repo, CI/CD setup, etc.

Continuous

Easy to discover the experimental datasets, because they are part of the existing package.
Easy to tell from the import path whether a dataset is experimental or not (e.g. experimental.ExperimentalDataset vs kedro_datasets.pandas.CSVDataset)

Cons

One-off

Need to find a way to separate experimental dependencies from the dependencies required to run the tests etc.

Continuous

Could cause some friction with PRs being created and needing to ask authors to move it to the contrib/experimental folder.

2. Annotate datasets with "experimental"

Pros

One-off

No need to create a new repo

Continuous

Easy to discover the experimental datasets, because they are part of the existing package
@noklam "when something is not "experimental" anymore, there will be no breaking change because import will stay the same." + same goes for "downgrading" a dataset.

Cons

One-off

Need to find a way to separate experimental dependencies from the dependencies required to run the tests etc.

Continuous

Need to adjust CI/CD and test-coverage to skip any datasets annotated with experimental (not super hard)
Won't be immediately clear on import that a dataset is "experimental". A warning could be logged on kedro run. And users don't usually read warnings 🥲

3. Create and publish a separate package for experimental contributions

Pros

One-off
Continuous

Very clear that these datasets are separate from the core supported kedro-datasets
Experimental dependencies won't clutter the kedro-dataset repo.

Cons

One-off

Requires a complete new repo, CI/CD, pypi setup

Continuous

Another repo to maintain and watch for PR contributions.
Harder to discover the experimental datasets

astrojuanlu · 2024-01-30T09:45:33Z

I would separate one-off costs (like creating new repos) from ongoing costs when weighing pros and cons. Creating a new repo is trivial (even with linting, CI/CD, PyPI publishing etc). "Need to adjust CI/CD and test-coverage to skip any datasets annotated with experimental", on the other hand, sounds like an ongoing pain. And "Harder to discover the experimental datasets" could be an ongoing problem too.

noklam · 2024-02-02T13:31:39Z

I am happy with either 1/2, I fear yet another package is going to diverging users from the main one.

For 3., we may need to add another RTD project which feels confusing. Actually, I am not sure how RTD will work because it required the package to be installed, so the pre-requisite is installing all the package?

For 2. I will add one pro compare to 2, which is when something is not "experimental" anymore, there will be no breaking change because import will stay the same.

merelcht · 2024-02-07T17:29:01Z

This topic was again discussed in technical design on 7/02/2024. The pros and cons of the various contribution models were discussed and then we voted again for the model, the outcome was as follows:

A contrib/experimental directory inside the kedro-datasets repo: 6/14 votes
Annotate datasets with "experimental": 5/14 votes
Create and publish a separate package for experimental contributions: 3/14 votes

The majority of votes went to number 1 and thus that will be the model we'll implement. Further decisions will need to be made about the graduation/demotion process of experimental datasets. New issues will be opened to address that.

merelcht self-assigned this Jan 16, 2024

merelcht added datasets Stage: Technical Design 🎨 labels Jan 16, 2024

merelcht added this to the Improvements to datasets as a whole milestone Jan 16, 2024

noklam mentioned this issue Jan 29, 2024

How to maintain external datasets contributions #535

Open

merelcht closed this as completed Feb 7, 2024

merelcht modified the milestones: Improvements to datasets as a whole, Experimental dataset contribution model Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for Adding Contributions Space for Experimental Datasets #517

Proposal for Adding Contributions Space for Experimental Datasets #517

merelcht commented Jan 16, 2024 •

edited

Loading

noklam commented Jan 23, 2024

astrojuanlu commented Jan 24, 2024

datajoely commented Jan 24, 2024

datajoely commented Jan 24, 2024

datajoely commented Jan 24, 2024 •

edited

Loading

merelcht commented Jan 26, 2024

merelcht commented Jan 26, 2024 •

edited

Loading

astrojuanlu commented Jan 30, 2024

noklam commented Feb 2, 2024

merelcht commented Feb 7, 2024

Proposal for Adding Contributions Space for Experimental Datasets #517

Proposal for Adding Contributions Space for Experimental Datasets #517

Comments

merelcht commented Jan 16, 2024 • edited Loading

Description

Next steps

Examples

noklam commented Jan 23, 2024

astrojuanlu commented Jan 24, 2024

datajoely commented Jan 24, 2024

datajoely commented Jan 24, 2024

datajoely commented Jan 24, 2024 • edited Loading

merelcht commented Jan 26, 2024

merelcht commented Jan 26, 2024 • edited Loading

Pros/cons of proposed experimental contribution model:

1. A contrib/experimental directory inside the kedro-datasets repo

Pros

Cons

2. Annotate datasets with "experimental"

Pros

Cons

3. Create and publish a separate package for experimental contributions

Pros

Cons

astrojuanlu commented Jan 30, 2024

noklam commented Feb 2, 2024

merelcht commented Feb 7, 2024

merelcht commented Jan 16, 2024 •

edited

Loading

datajoely commented Jan 24, 2024 •

edited

Loading

merelcht commented Jan 26, 2024 •

edited

Loading

1. A `contrib`/`experimental` directory inside the `kedro-datasets` repo