Check whether `DataCatalog` changes are reflected in the session #2728

astrojuanlu · 2023-06-26T18:11:35Z

Today I wanted to apply an "advanced" hook use case: storing the catalog and then injecting datasets on the fly. However, it doesn't work:

class MissingDatasetHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        self._catalog = catalog

    @hook_impl
    def before_dataset_loaded(self, dataset_name):
        dataset = self._catalog._get_dataset(dataset_name)
        try:
            dataset.load()
        except DataSetError:
            # Create EmptyDataset on the fly
            logger.warning("Attempted to load dataset %s which doesn't exist yet, injecting it", dataset_name)
            missing_dataset = MissingDataSet(dataset=dataset)
            self._catalog.add(dataset_name, missing_dataset, replace=True)

the self._catalog that gets saved receives the .add(..., replace=True) correctly, but the catalog.load that comes immediately after the before_dataset_loaded hook still has the old dataset:

kedro/kedro/runner/runner.py

Lines 403 to 404 in fd8162d

    
           hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node) 
        
           inputs[name] = catalog.load(name)

Using after_context_created gets the same result.

Context: I was trying to give a workaround for https://stackoverflow.com/q/76557758/554319.

Is this behavior expected?

Originally posted by @astrojuanlu in #2690 (comment)

The text was updated successfully, but these errors were encountered:

gitgud5000 · 2023-07-17T04:38:43Z

I've been trying to achieve this aswell. It would be a great feature to add.

astrojuanlu · 2023-07-17T13:43:53Z

@gitgud5000 Can you detail a bit more what are you trying to achieve and why? We are trying to come up with use cases for this change.

gitgud5000 · 2023-07-23T22:29:54Z

@astrojuanlu I'm trying to pass additional save_args based on the output of the node that creates a pandas.SQLTableDataSet as output.

My plan was to modify the catalog in a before_dataset_saved hook.

Similar to this: #910
This is also related: #898

gitgud5000 · 2023-07-30T00:02:43Z

What I ended up doing was implementing a subclass of pandas.SQLTableDataSet

noklam · 2023-07-31T14:07:01Z

I don't think mutating DataCatalog is a good thing to do. I believe the hook system was designed in a way to avoid this exactly (correct me if I am wrong).

I am unsure if this example correct though because dataset.load() just load the dataset but not returning anything, how would it work? Maybe need an example to play with to investigate this.

There are however use cases that is very useful.
i.e. If you developing in a notebook environment, it is very convenient if you can inject/change data to test your pipeline without re-running the whole pipeline

astrojuanlu · 2023-07-31T14:31:45Z

I don't think mutating DataCatalog is a good thing to do.

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

noklam · 2023-07-31T16:59:13Z

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

I think we need to clarify what's not working here. DataCatalog.add should work in a standalone mode? What doesn't work here is because catalog is not stored in KedroContext but rather hot-reload everytime it gets called.

Removing KedroContext params and catalog hot-reload #1460

I think there are two conflicting features here and we need to think of it carefully.

Hot-reload feature, which is useful for notebook because it saves you from keep creating new catalog when you are editing catalog.yml on the side. This because less important with %reload_kedro, thus Removing KedroContext params and catalog hot-reload #1460. However, if we are now steering to using components separately. We may need to re-think what should be done here.
- Make it easier to use the Config Loader #2819
Immutable KedroContext - Revisit: Make KedroContext a dataclass and add config_loader as a property #1459, I re-open this issue recently.
- This was merged already, but as we approaching 0.19, I want to revise if it's a good idea to freeze KedroContext.
  We should also consider how easy for plugin developer to use KedroContext. In the past we were more negative about storing context and states in Hooks for following hooks. However in Document more advanced hook use cases #2690, we may want to document this. If this is not a bad idea, do we need to re-visit the decision we made?

In any case, we should list out why we want to make this a singleton. I don't have the full context why it was designed that way, may be good to dig out the old issues and PR, but I don't have it now.

astrojuanlu · 2023-11-15T07:36:04Z

Another user was confused by this but found a workaround that worked for their use case: using runner.run(..., catalog=new_catalog) https://linen-slack.kedro.org/t/16062852/hi-all-how-can-i-update-replace-catalog-entries-from-an-exis#e163f246-de25-405d-9462-7fbc757bf927

noklam · 2024-03-19T11:46:01Z

Maybe the action item for this ticket is:

Create documentation to explain DataCatalog immutability

Is there a good reason Kedro should start supporting this?

astrojuanlu · 2024-11-04T22:59:41Z

@ElenaKhaustova to check again if this happens with DataCatalog and/or with KedroDataCatalog, should be quick

ElenaKhaustova · 2024-11-05T11:25:29Z

I double-checked that this behaviour is only relevant for DataCatalog and for KedroDataCatalog, modification of catalog object will be reflected in the further hooks.

This change is explained by the removing shallow_copy() method for KedroDataCatalog and the execution order:

First, we create catalog in the session

kedro/kedro/framework/session/session.py

Line 382 in a1fae50

catalog = context._get_catalog(
Then after_catalog_created hook is called where we save catalog object for further use

kedro/kedro/framework/context/context.py

Line 244 in a1fae50

self._hook_manager.hook.after_catalog_created(
Then AbstractRunner.run is called where we make a shallow copy of catalog (for DataCatalog)

kedro/kedro/runner/runner.py

Line 93 in a1fae50

catalog = catalog.shallow_copy(
Then AbstractRunner._run is called based on the runner set

kedro/kedro/runner/runner.py

Line 107 in a1fae50

self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type]
Then before_dataset_loaded hook is called, so we modify the catalog that we saved but not the catalog used in the run

kedro/kedro/runner/task.py

Line 151 in a1fae50

hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)

Note: For DataCatalog, we used the shallow copy method to add runtime patterns to the catalog before the run. Now, we have a dedicated method to add just patterns for KedroDataCatalog, so shallow copy is not done anymore

kedro/kedro/io/kedro_data_catalog.py

Line 577 in a1fae50

def shallow_copy(

@astrojuanlu, @merelcht, @ankatiyar, @noklam, @lrcouto, @DimedS, based on the above, I suggest closing the ticket. It works as expected for the new catalog, which will soon replace the old one.

astrojuanlu · 2024-11-05T15:19:43Z

Thanks @ElenaKhaustova ! Closing

astrojuanlu added this to Kedro Framework Jun 26, 2023

merelcht added the Community Issue/PR opened by the open-source community label Jul 17, 2023

astrojuanlu changed the title ~~Make DataCatalog a context-wide singleton?~~ DataCatalog can be mutated but changes are not reflected in the session Nov 15, 2023

astrojuanlu mentioned this issue Nov 29, 2023

Add docs on difference between OmegaConf and OmegaConfigLoader #3352

Merged

7 tasks

merelcht removed the Community Issue/PR opened by the open-source community label May 24, 2024

merelcht added this to the Redesign the API for IO (catalog) milestone May 24, 2024

ElenaKhaustova mentioned this issue Jun 5, 2024

[DataCatalog]: Provide public methods to modify catalog #3930

Open

astrojuanlu mentioned this issue Jun 6, 2024

[DataCatalog]: add_feed_dict() performance bottleneck #3912

Closed

astrojuanlu mentioned this issue Sep 3, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

astrojuanlu changed the title ~~DataCatalog can be mutated but changes are not reflected in the session~~ Check whether DataCatalog changes are reflected in the session Nov 4, 2024

ElenaKhaustova self-assigned this Nov 5, 2024

ElenaKhaustova moved this to To Do in Kedro Framework Nov 5, 2024

astrojuanlu closed this as completed Nov 5, 2024

github-project-automation bot moved this from In Review to Done in Kedro Framework Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check whether `DataCatalog` changes are reflected in the session #2728

Check whether `DataCatalog` changes are reflected in the session #2728

astrojuanlu commented Jun 26, 2023

gitgud5000 commented Jul 17, 2023

astrojuanlu commented Jul 17, 2023

gitgud5000 commented Jul 23, 2023 •

edited

Loading

gitgud5000 commented Jul 30, 2023

noklam commented Jul 31, 2023

astrojuanlu commented Jul 31, 2023

noklam commented Jul 31, 2023

astrojuanlu commented Nov 15, 2023

noklam commented Mar 19, 2024

astrojuanlu commented Nov 4, 2024

ElenaKhaustova commented Nov 5, 2024

astrojuanlu commented Nov 5, 2024

Check whether DataCatalog changes are reflected in the session #2728

Check whether DataCatalog changes are reflected in the session #2728

Comments

astrojuanlu commented Jun 26, 2023

gitgud5000 commented Jul 17, 2023

astrojuanlu commented Jul 17, 2023

gitgud5000 commented Jul 23, 2023 • edited Loading

gitgud5000 commented Jul 30, 2023

noklam commented Jul 31, 2023

astrojuanlu commented Jul 31, 2023

noklam commented Jul 31, 2023

astrojuanlu commented Nov 15, 2023

noklam commented Mar 19, 2024

astrojuanlu commented Nov 4, 2024

ElenaKhaustova commented Nov 5, 2024

astrojuanlu commented Nov 5, 2024

Check whether `DataCatalog` changes are reflected in the session #2728

Check whether `DataCatalog` changes are reflected in the session #2728

gitgud5000 commented Jul 23, 2023 •

edited

Loading