Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check whether DataCatalog changes are reflected in the session #2728

Closed
astrojuanlu opened this issue Jun 26, 2023 · 12 comments
Closed

Check whether DataCatalog changes are reflected in the session #2728

astrojuanlu opened this issue Jun 26, 2023 · 12 comments
Assignees

Comments

@astrojuanlu
Copy link
Member

Today I wanted to apply an "advanced" hook use case: storing the catalog and then injecting datasets on the fly. However, it doesn't work:

class MissingDatasetHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        self._catalog = catalog

    @hook_impl
    def before_dataset_loaded(self, dataset_name):
        dataset = self._catalog._get_dataset(dataset_name)
        try:
            dataset.load()
        except DataSetError:
            # Create EmptyDataset on the fly
            logger.warning("Attempted to load dataset %s which doesn't exist yet, injecting it", dataset_name)
            missing_dataset = MissingDataSet(dataset=dataset)
            self._catalog.add(dataset_name, missing_dataset, replace=True)

the self._catalog that gets saved receives the .add(..., replace=True) correctly, but the catalog.load that comes immediately after the before_dataset_loaded hook still has the old dataset:

hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)
inputs[name] = catalog.load(name)

Using after_context_created gets the same result.

Context: I was trying to give a workaround for https://stackoverflow.com/q/76557758/554319.

Is this behavior expected?

Originally posted by @astrojuanlu in #2690 (comment)

@gitgud5000
Copy link

I've been trying to achieve this aswell. It would be a great feature to add.

@astrojuanlu
Copy link
Member Author

@gitgud5000 Can you detail a bit more what are you trying to achieve and why? We are trying to come up with use cases for this change.

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Jul 17, 2023
@gitgud5000
Copy link

gitgud5000 commented Jul 23, 2023

@astrojuanlu I'm trying to pass additional save_args based on the output of the node that creates a pandas.SQLTableDataSet as output.

My plan was to modify the catalog in a before_dataset_saved hook.

Similar to this: #910
This is also related: #898

@gitgud5000
Copy link

What I ended up doing was implementing a subclass of pandas.SQLTableDataSet

@noklam
Copy link
Contributor

noklam commented Jul 31, 2023

I don't think mutating DataCatalog is a good thing to do. I believe the hook system was designed in a way to avoid this exactly (correct me if I am wrong).

I am unsure if this example correct though because dataset.load() just load the dataset but not returning anything, how would it work? Maybe need an example to play with to investigate this.

There are however use cases that is very useful.
i.e. If you developing in a notebook environment, it is very convenient if you can inject/change data to test your pipeline without re-running the whole pipeline

@astrojuanlu
Copy link
Member Author

I don't think mutating DataCatalog is a good thing to do.

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

@noklam
Copy link
Contributor

noklam commented Jul 31, 2023

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

I think we need to clarify what's not working here. DataCatalog.add should work in a standalone mode? What doesn't work here is because catalog is not stored in KedroContext but rather hot-reload everytime it gets called.

I think there are two conflicting features here and we need to think of it carefully.

In any case, we should list out why we want to make this a singleton. I don't have the full context why it was designed that way, may be good to dig out the old issues and PR, but I don't have it now.

@astrojuanlu
Copy link
Member Author

Another user was confused by this but found a workaround that worked for their use case: using runner.run(..., catalog=new_catalog) https://linen-slack.kedro.org/t/16062852/hi-all-how-can-i-update-replace-catalog-entries-from-an-exis#e163f246-de25-405d-9462-7fbc757bf927

@astrojuanlu astrojuanlu changed the title Make DataCatalog a context-wide singleton? DataCatalog can be mutated but changes are not reflected in the session Nov 15, 2023
@noklam
Copy link
Contributor

noklam commented Mar 19, 2024

Maybe the action item for this ticket is:

  • Create documentation to explain DataCatalog immutability

Is there a good reason Kedro should start supporting this?

@merelcht merelcht removed the Community Issue/PR opened by the open-source community label May 24, 2024
@astrojuanlu astrojuanlu changed the title DataCatalog can be mutated but changes are not reflected in the session Check whether DataCatalog changes are reflected in the session Nov 4, 2024
@astrojuanlu
Copy link
Member Author

@ElenaKhaustova to check again if this happens with DataCatalog and/or with KedroDataCatalog, should be quick

@ElenaKhaustova ElenaKhaustova self-assigned this Nov 5, 2024
@ElenaKhaustova ElenaKhaustova moved this to To Do in Kedro Framework Nov 5, 2024
@ElenaKhaustova
Copy link
Contributor

I double-checked that this behaviour is only relevant for DataCatalog and for KedroDataCatalog, modification of catalog object will be reflected in the further hooks.

This change is explained by the removing shallow_copy() method for KedroDataCatalog and the execution order:

  1. First, we create catalog in the session
    catalog = context._get_catalog(
  2. Then after_catalog_created hook is called where we save catalog object for further use
    self._hook_manager.hook.after_catalog_created(
  3. Then AbstractRunner.run is called where we make a shallow copy of catalog (for DataCatalog)
    catalog = catalog.shallow_copy(
  4. Then AbstractRunner._run is called based on the runner set
    self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type]
  5. Then before_dataset_loaded hook is called, so we modify the catalog that we saved but not the catalog used in the run
    hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)

Note: For DataCatalog, we used the shallow copy method to add runtime patterns to the catalog before the run. Now, we have a dedicated method to add just patterns for KedroDataCatalog, so shallow copy is not done anymore

def shallow_copy(

@astrojuanlu, @merelcht, @ankatiyar, @noklam, @lrcouto, @DimedS, based on the above, I suggest closing the ticket. It works as expected for the new catalog, which will soon replace the old one.

@astrojuanlu
Copy link
Member Author

Thanks @ElenaKhaustova ! Closing

@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

5 participants