[DataCatalog]: Spike - Lazy dataset loading #3935

ElenaKhaustova · 2024-06-06T12:05:16Z

Description

Users are required to install all dependencies even for unused datasets, leading to unnecessary complexity and confusion.

We propose implementing a lazy dataset loading feature to allow users to load only the datasets they need without causing pipeline failures.

Relates to #2829

Context

"You need to install all dependencies even for unused datasets (in case you want to run pipeline partially or do not load some dataset when standalone catalog usage)."
"We have a lot of data entries and different dependencies and when we just want to rerun an anaysis partially, we are frustrated because we need to install all the packages to just load one data source. Why would I need to install excel dependencies to instantiate the DataCatalog to load a csv which does not need Excel?"
The error users get now in case of missing dependencies is unclear [DataCatalog]: Error message is confusing when kedro-dataset is not installed #3911

DatasetError: An exception occurred when parsing config for dataset 'companies':
No module named 'pandas'. Please see the documentation on how to install relevant dependencies for kedro_datasets.pandas.CSVDataset:

Spike task

Investigate how to actually do the lazy loading: 1) the actual lazy loading of datasets, only import the datasets at the time we load the data 2) understand what part of the pipeline needs to be run, and only import what's required for that run.

The text was updated successfully, but these errors were encountered:

ElenaKhaustova · 2024-10-23T09:39:06Z

Kedro Viz workflow

kedro_viz -> integrations -> kedro -> data_loader -> _load_data_helper (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/integrations/kedro/data_loader.py#L58 ) - This creates all required Kedro objects like Session, Context, DataCatalog and Pipelines

Specially for datasets we need DataCatalog object

kedro_viz -> integrations -> kedro -> data_loader -> _load_data_helper -> catalog = context.catalog (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/integrations/kedro/data_loader.py#L101 )
Once we have DataCatalog -> we populate viz backend repositories (viz classes). One of the field in DataAccessManager is catalog which holds the CatalogRepository instance. Here we also resolve dataset factory patterns (kedro_viz -> data_access -> managers.py -> DataAccessManager -> resolve_dataset_factory_patterns (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/managers.py#L74), https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/managers.py#L94 ) before we assign catalog field inside CatalogRepository
Now that we have a DataCatalog object, we use _get_dataset to get the dataset instance (kedro_viz -> data_access -> repositories -> catalog.py -> CatalogRepository -> get_dataset -> self._catalog._get_dataset(dataset_name, suggest=False) (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/repositories/catalog.py#L124 )
While resolving pipelines, we create DataNode using the above dataset instance (kedro_viz -> data_access -> managers.py -> DataAccessManager -> add_dataset (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/managers.py#L303 )
Now we have DataNode in the flowchart -> If someone clicks on the DataNode in UI -> [GET] REST call to /nodes/{node_id} is made -> This is when the actual data is loaded in case the dataset has preview
For loading data we use - cls.dataset.preview(**preview_args) (kedro_viz -> models -> flowchart.py -> DataNodeMetadata -> set_preview (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/models/flowchart.py#L694 ))
The preview method is implemented in each dataset under kedro-datasets.
For example in case of pandas.CSVDataset -> (kedro_datasets -> pandas -> csv_dataset.py -> CSVDataset -> preview (https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/pandas/csv_dataset.py#L204 ) -> This calls the underlying load() method with any load_args

Thank you, @ravi-kumar-pilla

ElenaKhaustova · 2024-10-23T17:00:51Z

Kedro Viz workflow - lite mode

When running Kedro-Viz in lite mode AbstractDatasetLite - a custom implementation of Kedro's AbstractDataset is used. It provides an UnavailableDataset instance by overriding from_config of AbstractDataset. It allows initializing catalog without required datasets installed.

https://github.com/kedro-org/kedro-viz/blob/9996c9950f60810cdaeb7c439614597572354a71/package/kedro_viz/integrations/kedro/data_loader.py#L97

https://github.com/kedro-org/kedro-viz/blob/9996c9950f60810cdaeb7c439614597572354a71/package/kedro_viz/integrations/kedro/abstract_dataset_lite.py#L15

ElenaKhaustova · 2024-10-23T17:13:47Z

Summary on Kedro Viz workflow

Both Kedro Viz modes (default and lite) use lazy data loading;
The default model requires all the datasets to be installed as they create a session with the catalog inside (currently catalog init does datasets init) though the dataset object is not needed for the pipeline preview, it's only needed when the user clicks on the node when the actual dataset.load() is happening;
The lite mode doesn't require all the datasets to be installed and it handles missing imports errors with the AbstractDatasetLite;
They consider making lite mode a default behaviour in future.

ElenaKhaustova · 2024-10-23T18:40:05Z

Problem

Based on the context above we can conclude that data loading is already done in a lazy manner and the scope of the problem is bounded by the lazy dependencies resolution:

Users struggle to run partial pipelines (several nodes, slices) without installing all datasets required for the full pipeline run
Kedro Viz doesn't need datasets installed for the graph preview until the user expands the node.

Solution proposed

Introduce NonInitializedDataset storing the configuration needed to initialize the actual dataset (not inherited from AbstractDataset).
In the catalog constructor initialize only NonInitializedDataset's instead of the actual and store the in a separate dictionary.
Materialize actual datasets when someone gets the dataset from the catalog (get(), __iter__(), keys(), values(), items()) and add the to the _datasets dictionary.
Catalog and dataset printing should not meterialize datasets and print them as they are, so __repr__ should be implemented for NonInitializedDataset.
In order to avoid cases when pipeline execution breaks because of the missing dependencies we can do a warm-up in the runner, particularly in the AbstractRunner.run() method before calling _run(). For that, we need to have a filtered pipeline and materialize its datasets, so we make it only for the datasets required for the run.

kedro/kedro/runner/runner.py

Line 115 in c2d7100

self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type]

The solution proposed avoids dataset initializations that are not part of the run and ensures that execution will not fail because of the missing imports. On the Viz side, we can only materialize a dataset when the user expands the node, so datasets installation is not required for graph preview.

The solution proposed will also solve the ThreadRunner problem (#4250) as the warm-up will be done for all the runners in the common AbstractRunner.run(). But first, we suggest solving #4250 by moving the existing warm-up to the AbstractRunner.run().

ElenaKhaustova · 2024-10-30T14:44:33Z

Draft implementation and issues identified

We implemented a draft for the solution proposed above: #4270

When testing the implementation we found out that there are different aspects of `lazy loading` problem related to dependencies:

dependencies used in the pipeline itself - they are loaded with find_pipelines function;
dependencies required for dataset init - they are loaded when a dataset is initialized and defined in its implementation;
dependencies required for dataset safe/load - they are loaded when calling load/safe methods and they're defined in the dataset's requirements;

The solution proposed addresses problem 2 from the above with several buts:

We cannot solve problem 1 without changing the implementation of find_pipelines function.
Under the hood, the find_pipelines() function traverses the src/<package_name>/pipelines/ directory and returns a mapping from pipeline directory name to Pipeline object by:

Importing the <package_name>.pipelines.<pipeline_name> module;
Calling the create_pipeline() function exposed by the <package_name>.pipelines.<pipeline_name> module;
Validating that the constructed object is a Pipeline;
By default, if any of these steps fail, find_pipelines() raises an appropriate warning and skips the current pipeline but continues traversal. During development, this enables you to run your project with some pipelines, even if other pipelines are broken.

When we install any of kedro-datasets we always install all the datasets (as they are part of the package) and the dependencies for the specific dataset. To initialize a dataset we need the dataset to be installed and all the dependencies defined in its implementation. So there are cases when one is able to initialize a dataset without actually installing it as some other dataset was installed previously as well as the dependencies. Thus we cannot guarantee warm-up solves the missing dependencies problem as some of the dependencies will only be imported at the save/load time, meaning a pipeline can fail during the run.

Alignment with issues reported by users

If referring back to the user reported issues they all can be summarised as You need to install all dependencies even for unused datasets (in case you want to run pipeline partially or do not load some dataset when standalone catalog usage).

The proposed solution solves it, but problem 1 still remains. So, there might be a case where one wants to run the pipeline partially and datasets are loading lazily, but the dependencies are still required for the pipeline discovery step before we even instantiate a catalog. And the entire pipelines is exclude from the run.

Next steps

We need to define the desired behaviour for problems 1, 2, and 3 and agree on how/whether we want to cover all these cases and whether we're happy with the original solution proposed and the draft implementation.

ElenaKhaustova · 2024-11-01T14:37:27Z

Next steps
We need to define the desired behaviour for problems 1, 2, and 3 and agree on how/whether we want to cover all these cases and whether we're happy with the original solution proposed and the draft implementation.

After discussing this with @idanov, we decided to address problems 1 and 3 separately and proceed with the suggested solution.

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 6, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jun 6, 2024

ElenaKhaustova mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Open

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

ElenaKhaustova mentioned this issue Aug 8, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

merelcht mentioned this issue Oct 21, 2024

Lazy Loading of Catalog Items #2829

Closed

merelcht changed the title ~~[DataCatalog]: Lazy dataset loading~~ [DataCatalog]: Spike - Lazy dataset loading Oct 21, 2024

This comment was marked as off-topic.

Sign in to view

merelcht assigned ElenaKhaustova and lrcouto Oct 28, 2024

This was referenced Oct 28, 2024

ThreadRunner Dataset DatasetAlreadyExistsError: Dataset has already been registered #4250

Open

Move warm-up from session to runner #4262

Merged

[DataCatalog]: Lazy dataset loading #4270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Spike - Lazy dataset loading #3935

[DataCatalog]: Spike - Lazy dataset loading #3935

ElenaKhaustova commented Jun 6, 2024 •

edited by merelcht

Loading

ElenaKhaustova commented Oct 23, 2024

ElenaKhaustova commented Oct 23, 2024

ElenaKhaustova commented Oct 23, 2024

ElenaKhaustova commented Oct 23, 2024 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

ElenaKhaustova commented Oct 30, 2024

ElenaKhaustova commented Nov 1, 2024

[DataCatalog]: Spike - Lazy dataset loading #3935

[DataCatalog]: Spike - Lazy dataset loading #3935

Comments

ElenaKhaustova commented Jun 6, 2024 • edited by merelcht Loading

Description

Context

Spike task

ElenaKhaustova commented Oct 23, 2024

Kedro Viz workflow

ElenaKhaustova commented Oct 23, 2024

Kedro Viz workflow - lite mode

ElenaKhaustova commented Oct 23, 2024

Summary on Kedro Viz workflow

ElenaKhaustova commented Oct 23, 2024 • edited Loading

Problem

Solution proposed

This comment was marked as off-topic.

This comment was marked as off-topic.

ElenaKhaustova commented Oct 30, 2024

Draft implementation and issues identified

When testing the implementation we found out that there are different aspects of lazy loading problem related to dependencies:

The solution proposed addresses problem 2 from the above with several buts:

Alignment with issues reported by users

Next steps

ElenaKhaustova commented Nov 1, 2024

ElenaKhaustova commented Jun 6, 2024 •

edited by merelcht

Loading

ElenaKhaustova commented Oct 23, 2024 •

edited

Loading

When testing the implementation we found out that there are different aspects of `lazy loading` problem related to dependencies: