Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Lazy dataset loading #4270

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Oct 30, 2024

Description

Implementation of #3935 (comment)

The PR is done on top of #4262

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova changed the title [DataCatalog]: Lazy loading draft solution [DataCatalog]: Lazy dataset loading Nov 1, 2024
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself looks clean to me and I don't have more to add.

I am still not completely clear what problem this is solving. As I understand there are always 2 conflicting requirements:

  1. Eagerly failure is good
  2. Lazy loading (load as little dependencies as possible).

At which point datasets are getting materialized? Is it at Runner?

@ElenaKhaustova
Copy link
Contributor Author

The PR itself looks clean to me and I don't have more to add.

I am still not completely clear what problem this is solving. As I understand there are always 2 conflicting requirements:

  1. Eagerly failure is good
  2. Lazy loading (load as little dependencies as possible).

At which point datasets are getting materialized? Is it at Runner?

Just in case it was missed:

Long story short: Now, we initialize the datasets before the pipeline run (or anytime we get the dataset from the catalog) and only require the datasets used in the run; before, we initialized all the datasets in the catalog when creating the catalog object, no matter whether we used them in the run or not.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, I think the solution make sense to only materialise the needed datasets.

ds_config = self._config_resolver.resolve_pattern(key)
if ds_config:
self._add_from_config(key, ds_config)

non_initialized_dataset = self._lazy_datasets.pop(key, None)
Copy link
Member

@Galileo-Galilei Galileo-Galilei Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that we also want this PR to pave the way for #3932 by storing the configuration in the catalog for lazy dataset? If so, should we really remove the dataset from _lazy_datasets once it is materialized? If we do this it prevents further access to the original raw configuration (this can still be reassessed further, and I am not sure we want to tackle both problems in this PR, but it's worth thinking it it will fits well in the broader roadmap)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think that proper solution for #3932 will require storing configuration at the dataset level and dataset.to_yaml(), dataset.from_yaml() for the following reasons:

  • We allow to initializing catalog from datasets or their configurations. If it's the first case currently we are not able to retrieve configuration from the dataset object and thus cannot serialize it.
  • And the same problem appears if someone adds dataset to the catalog which is probably the main use case for catalog dump and load.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean solution! 🌟 The explanations also make perfect sense to me, thanks for adding that as context.

Copy link
Contributor

@rashidakanchwala rashidakanchwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I’m hopeful that lazy loading datasets will also improve the initial load time in Kedro-Viz. Looking forward to seeing this!

RELEASE.md Outdated Show resolved Hide resolved
self, ds_name: str, dataset: AbstractDataset, replace: bool = False
self,
ds_name: str,
dataset: AbstractDataset | _LazyDataset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we'd like to expose _LazyDataset to the users? If yes, then maybe we need to make it a proper LazyDataset that can be instantiated even outside of the catalog. Otherwise, I'd prefer for us to hide it from this method signature.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants