How to maintain external datasets contributions #535

noklam · 2023-06-23T14:16:41Z

Description

Why this is raised?

With more incoming datasets PR, it become harder to maintain all the datasets. Particularly for the exotic datasets, we don't have the setup for every possible environment (e.g. snowflake/databricks). This create challenge for maintaining all the datasets since we don't have the re

This also lead to the question "Does every datasets belongs to kedro-datasets?

The answer is no, since there are few popular datasets maintained separately in kedro-mlflow as well.

Possible Action

CSVDataSet is more robust than say ManagedTableDataSet, can we signal this better through our docs? We did something similar to Deployment docs

More Discussion

How to we want to maintain the contributions? How do we draw the line that something should be a separate plugins or going into kedro-datasets Cc @astrojuanlu

Idea raised during retro:

datasets could be maintained as a separate plugins. i.e. kedro-mlflow has its own datasets.

The text was updated successfully, but these errors were encountered:

noklam · 2024-01-29T13:59:10Z

Link: #517 (comment)

Maybe we can close this ticket?

astrojuanlu · 2024-01-30T09:49:47Z

kedro-org/kedro#517 was a different (although related) discussion. In the middle of it though, I raised the question "Should we accept every dataset that is in good shape in kedro-datasets?" and the answer seemed to be yes. However, this was at the very end of our meeting and there was nearly not enough time to weigh pros and cons of this.

So I'd say we keep it open.

Having said that though, there's a number of pull requests open already, and I think it's unfair that we hold them because of lack of firm consensus on this topic.

astrojuanlu · 2024-01-30T09:53:07Z

For example, consider discoverability. The fact that the current monorepo approach already hinders the visibility of the individual plugins, as described in #401

For datasets inside kedro-datasets, the effect is even larger. On top of that, the actual business logic of custom datasets is hidden behind private methods that don't get documented by default kedro-org/kedro#1936 (comment)

astrojuanlu · 2024-01-30T09:53:29Z

(And this is aside from the maintenance issues @noklam mentioned)

astrojuanlu · 2024-02-01T12:14:22Z

I think we are underestimating the maintenance burden of the current approach.

Lots of people in the team have trouble building the docs locally, because one has to install all the dependencies of all datasets for that to work. @rashidakanchwala can attest - she struggled a lot, and now I'm unable to do it myself (troubleshooting some weird conflicts raised by pip).

On the other hand, there have been users in the past that have been confused and couldn't even run the test suite. It happened for #360 and also for #435.

I think it's time to seriously consider breaking kedro-datasets apart.

datajoely · 2024-02-02T11:10:29Z

I do keep wondering if we could have a Low-code dataset contribution workflow on the website that allowed us to accept contributions and manage the test suite for users.

astrojuanlu · 2024-03-06T16:39:26Z

A user literally ran out of disk space when trying to install kedro-datasets test dependencies while troubleshooting a pip conflict #597 (comment)

lrcouto · 2024-04-12T21:57:26Z

A user literally ran out of disk space when trying to install kedro-datasets test dependencies while troubleshooting a pip conflict #597 (comment)

This happened to me this week while running tests to figure out the issues with the kedro-datasets dependencies 😬

noklam mentioned this issue Jun 27, 2023

Curate plugins by maintenance activity in Kedro's documentation kedro-org/kedro#2291

Closed

astrojuanlu mentioned this issue Feb 1, 2024

Allow for dynamic SQL filtering of datasets through lazy loading kedro-org/kedro#2374

Closed

merelcht transferred this issue from kedro-org/kedro Feb 2, 2024

merelcht added this to the Improvements to datasets as a whole milestone Feb 2, 2024

astrojuanlu mentioned this issue Mar 5, 2024

feat(datasets): add dataset to load/save with Ibis #560

Merged

4 tasks

astrojuanlu added the datasets label Mar 6, 2024

astrojuanlu mentioned this issue Mar 12, 2024

Decide on definitions of regular and experimental contributions #583

Closed

astrojuanlu mentioned this issue Apr 11, 2024

Users cannot install specific components of Kedro separately kedro-org/kedro#3659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to maintain external datasets contributions #535

How to maintain external datasets contributions #535

noklam commented Jun 23, 2023 •

edited

Loading

noklam commented Jan 29, 2024

astrojuanlu commented Jan 30, 2024

astrojuanlu commented Jan 30, 2024

astrojuanlu commented Jan 30, 2024

astrojuanlu commented Feb 1, 2024

datajoely commented Feb 2, 2024

astrojuanlu commented Mar 6, 2024

lrcouto commented Apr 12, 2024 •

edited

Loading

How to maintain external datasets contributions #535

How to maintain external datasets contributions #535

Comments

noklam commented Jun 23, 2023 • edited Loading

Description

Why this is raised?

Possible Action

More Discussion

noklam commented Jan 29, 2024

astrojuanlu commented Jan 30, 2024

astrojuanlu commented Jan 30, 2024

astrojuanlu commented Jan 30, 2024

astrojuanlu commented Feb 1, 2024

datajoely commented Feb 2, 2024

astrojuanlu commented Mar 6, 2024

lrcouto commented Apr 12, 2024 • edited Loading

noklam commented Jun 23, 2023 •

edited

Loading

lrcouto commented Apr 12, 2024 •

edited

Loading