Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: Extend KedroDataCatalog with dict interface #4175

Closed
ElenaKhaustova opened this issue Sep 18, 2024 · 12 comments
Closed

[DataCatalog2.0]: Extend KedroDataCatalog with dict interface #4175

ElenaKhaustova opened this issue Sep 18, 2024 · 12 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Sep 18, 2024

Description

As a continuation of a work on DataCatalog redesign (#3995 (comment)) we suggest simplifying the interface for KedroDataCatalog and make it similar to UserDict

The reasoning: this way we want to make the catalog functioning as a collection of datasets with an interface familiar to users.

Related tickets that will be addressed:

Context

Possible Implementation

Change the KedroDataCatalog API to support __iter__, __getitem__, __setitem__, keys(), items() similar to the dict.

Replace old API such as save(), load(), list(), etc.

The first iteration can be done without breaking changes if keep old interface and reuse newly added methods.

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Sep 18, 2024
@datajoely
Copy link
Contributor

datajoely commented Sep 18, 2024

I'm actually less excited about this than a lot of the other proposed ideas for this redesign.

What about putting it inside a property or something like a .dict_like() method?

@ElenaKhaustova
Copy link
Contributor Author

I'm actually less excited about this than a lot of the other proposed ideas for this redesign.

What about putting it inside a property or something like a .dict_like() method?

That's an attempt to align on the interface changes before introducing new features, so it doesn't cancel the rest of the redesign idea. But since this is a proposal of a possible interface, it will be helpful at this stage if you can expand your thoughts about why it feels not good.

@noklam
Copy link
Contributor

noklam commented Sep 23, 2024

Related idea discussed in the past: #3914 (comment)

@ElenaKhaustova ElenaKhaustova self-assigned this Sep 23, 2024
@astrojuanlu
Copy link
Member

astrojuanlu commented Sep 23, 2024

We discussed this on backlog grooming. To clarify, the user impact of this one is that rather than doing

>>> catalog.list()
>>> catalog.load("my_dataset")
>>> catalog.save("my_dataset", data)
>>> for dataset in catalog.datasets: pass

the user would do

>>> list(catalog)  # __iter__
>>> catalog["my_dataset"]  # __getitem__
>>> catalog["my_dataset"] = data  # __setitem__
>>> for dataset in catalog: pass  # __iter__

While I'm not opposed to have the dict-like interface, I think it's worth discussing whether we should keep the method-based interface. We're not forced to keep it, because it's a new class and we allow ourselves to do breaking changes between minor releases. But if we do include it, I expect migration between 0.19.* and 0.20.* will definitely be easier.

@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Oct 8, 2024

With this update, we aimed to rework the existing interface to improve naming, remove repeating methods, clarify the differences between methods working with datasets and data, and suggest functionality to iterate through the catalog and access datasets.

Suggested changes to the interface

  1. We provide dict-like interface to work with Datasets - keys(), values(), items(), __iter__(), __getitem__(), __setitem__(), __len__(), get()
  2. We keep some method-based interface to work with Datasets - from_config(), release(), confirm(), exists() - the rest three will also be reworked in future.
  3. We keep method-based interface to work with data - load(), save().
  4. We added filter() method to filter dataset names registered in the catalog.

Implementation

We intentionally do not inherit from dict or UserDict but rather add the implementation of some methods as we do not expect to maintain all dict-specific functionality like pop(), popitem(), del by key, etc

Examples

Initialization

catalog = KedroDataCatalog(datasets={"test": dataset})

catalog = KedroDataCatalog.from_config(**correct_config)

Properties

config_resolver = catalog.config_resolver
ds_config = catalog.config_resolver.resolve_pattern(ds_name)  # resolve dataset patterns
patterns = catalog.config_resolver.list_patterns() # list al patterns available in the catalog

logger = catalog._logger

Interface to work with datasets

print(catalog)  # __repr__

ds_count = len(catalog)  # __len__

if ds_name in catalog:  # __contains__
    pass

if catalog_first == catalog_second:  # __eq__
    pass

for ds_name in catalog:  # __iter__
    pass

for ds_name in catalog.keys():  # keys()
    pass

for ds in catalog.values():  # values()
    pass

for ds_name, ds in catalog.items():  # items()
    pass

ds = catalog["reviews"]  # __getitem__
ds = catalog.get("reviews", default=default_ds)  # get()

catalog["reviews"] =  ds  #__setitem__
catalog["reviews"] = 123

ds_names = catalog.list(regex_search, regex_flags) - # filter datasets' names

# release(), confirm(), exists() remained for now

Interface to work with data

df = catalog.load("cars")  # load_data
catalog.save("cars", df)  # save_data

@datajoely
Copy link
Contributor

for ds_name, ds in catalog.items(regex_search="reviews"): # items()

Is it keys, values or both that are subject to the regex filter?

@ElenaKhaustova
Copy link
Contributor Author

for ds_name, ds in catalog.items(regex_search="reviews"): # items()

Is it keys, values or both that are subject to the regex filter?

Only keys so far, but subject to change: #3917

@rashidakanchwala
Copy link
Contributor

Thanks, @ElenaKhaustova. This works for us on Kedro-Viz and will help us access the catalog information without needing to resort to private methods

@marrrcin
Copy link
Contributor

Thanks @ElenaKhaustova for the summary. In general, the interface looks nice.
A few questions, concerns:

  • __setitem__ - do you plan to perform type validation at runtime to prevent setting non-dataset objects? Same for keys - enforce str type.
  • what about "old" interfaces such as shallow_copy / add_feed_dict etc?

@ElenaKhaustova
Copy link
Contributor Author

Thanks @ElenaKhaustova for the summary. In general, the interface looks nice. A few questions, concerns:

  • __setitem__ - do you plan to perform type validation at runtime to prevent setting non-dataset objects? Same for keys - enforce str type.
  • what about "old" interfaces such as shallow_copy / add_feed_dict etc?

Thanks @marrrcin,

  • __setitem__ will handle setting both datasets and non-datasets as it was for add_feed_dict
  • add_feed_dict and shallow_copy will be removed

@marrrcin
Copy link
Contributor

What will be an alternative to shallow_copy though? Kedro-Boot uses it for example (https://github.com/takikadiri/kedro-boot/blob/b9cd2d0c4bb9dbe4f65b1c9b37a3af5cf33dd77b/kedro_boot/framework/adapter.py#L58)

@ElenaKhaustova
Copy link
Contributor Author

What will be an alternative to shallow_copy though? Kedro-Boot uses it for example (https://github.com/takikadiri/kedro-boot/blob/b9cd2d0c4bb9dbe4f65b1c9b37a3af5cf33dd77b/kedro_boot/framework/adapter.py#L58)

shallow copy was mainly used to add runtime patterns but an actual copy is not needed to run:

catalog = catalog.shallow_copy(

We added a dedicated method to add runtime patterns to catalog and it will be used instead when introducing breaking changes:

self._config_resolver.add_runtime_patterns(extra_dataset_patterns)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

6 participants