Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Loading of Catalog Items #2829

Closed
m-gris opened this issue Jul 22, 2023 · 30 comments · Fixed by kedro-org/kedro-plugins#281
Closed

Lazy Loading of Catalog Items #2829

m-gris opened this issue Jul 22, 2023 · 30 comments · Fixed by kedro-org/kedro-plugins#281
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@m-gris
Copy link

m-gris commented Jul 22, 2023

Hi everyone,

Currently, running modular pipelines requires loading all datasets before pipeline executions, which, to me, seems not really ideal...

Let's assume a simple project, with 2 modular pipelines, A and B.

If B requires datasets created via api calls or interactions with remote databases , then, when working offline, one cannot do a simple kedro run --pipeline A, since Kedro will fail at loading B's dataset... which are not needed to run A !
A bit of a pity...

A first, impulsive reaction might lead to commenting out those un-needed datasets in the local conf, but since commentaries are not evaluated, this would simply 'do nothing'...

Granted, one could simply comment out those datasets in the base config...
But "messing around" with base to handle a temporary / local need seems, to me at least, to deviate from the "Software engineering best practices" that kedro aims at fostering.

To avoid this, one would therefore have to create some sort of dummy dataset in the local conf... Totally do-able, but not super convenient and definitely not frictionless...

I guess that such "troubles" could be avoided if datasets were loaded lazily, only when needed, or by building the DAG first and then filtering datasets that are not actually needed...

I hope that this suggestion will both sound relevant and won't represent much of a technical challenge.

Best Regards
Marc

@m-gris m-gris added the Issue: Feature Request New feature or improvement to existing feature label Jul 22, 2023
@noklam
Copy link
Contributor

noklam commented Jul 24, 2023

Aleksander Jaworski
1 hour ago
[Kedro-version : 0.18.6 currently]
Hi, I am working on a sort of 'pipeline monorepo' where I have dozens of pipelines.
I have a question: would some sort of lazy-configuration-validation be a useful feature for kedro?
I have 2 reasons for asking:
It feels a bit cumbersome that even a simple hello_world.py will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a hello_world.py
When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated.
Are there any solutions or modifications I could use to improve my approaches here? Thanks in advance! :)
:kedroid-party:
1

I ask the user to comment out any SQLDataSet to see how it impacts performance. It shows that it contribute to a significant amount of time.

~21 secs for kedro catalog list
~6 secs after commenting out the SQL and GBQ datasets

@noklam
Copy link
Contributor

noklam commented Jul 24, 2023

To scope the discussion better.

@deepyaman deepyaman self-assigned this Jul 25, 2023
@deepyaman
Copy link
Member

Hey @m-gris! Which dataset were you facing an issue with on #2829? I think I've "fixed" it in kedro-org/kedro-plugins#281 (just for pandas.SQLTableDataSet; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. 🙂 In the meantime, I'll work on adding it across other datasets.

To test, I modified Spaceflights to make companies a pandas.SQLTableDataSet:

# conf/base/catalog.yml
companies:
   type: pandas.SQLTableDataSet
   credentials: db_credentials
   table_name: shuttles
   load_args:
     schema: dwschema
   save_args:
     schema: dwschema
     if_exists: replace

# conf/local/credentials.yml
db_credentials:
   con: postgresql://scott:tiger@localhost/test

This dataset doesn't exist, so kedro run fails, but kedro run --pipeline data_science still works (if you've previously generated model_input_table). (You still do need to have the dependencies for the dataset installed; e.g. psycopg2).

@deepyaman
Copy link
Member

Hey @m-gris! Which dataset were you facing an issue with on #2829? I think I've "fixed" it in kedro-org/kedro-plugins#281 (just for pandas.SQLTableDataSet; I haven't applied it to other instances yet). If it's that one, maybe you can try it out and see if it fixes your issue. 🙂 In the meantime, I'll work on adding it across other datasets.

To test, I modified Spaceflights to make companies a pandas.SQLTableDataSet:

# conf/base/catalog.yml
companies:
   type: pandas.SQLTableDataSet
   credentials: db_credentials
   table_name: shuttles
   load_args:
     schema: dwschema
   save_args:
     schema: dwschema
     if_exists: replace

# conf/local/credentials.yml
db_credentials:
   con: postgresql://scott:tiger@localhost/test

This dataset doesn't exist, so kedro run fails, but kedro run --pipeline data_science still works (if you've previously generated model_input_table). (You still do need to have the dependencies for the dataset installed; e.g. psycopg2).

Confirmed offline with @m-gris that kedro-org/kedro-plugins#281 works for him, so I'll get it ready for review when I get a chance! :)

@m-gris
Copy link
Author

m-gris commented Aug 12, 2023

Hi everyone,

Apologies for my silence / lack of reactivity & Thanks to Deepyaman for reaching out and coming up with a fix for this issue.

I'm happy to confirm that it does work, in the sense that I can run pipeline A, even if some sql datasets needed for pipeline B can’t be loaded because I’m offline / without access to the DB.

However... With further tests I noticed something that, to me at least, seems not ideal:
The con is still needed even if not used.

I know… I’m being a bit picky.
But here is the case I have in mind:
A team member working on a new modular pipeline C created a new sql dataset in the catalog.
However this being a pipeline I do not work on, I do not have a valid con to provide.
Granted, I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...

Instead of putting garbage in the conf wouldn’t it be better to "simply" have the ability to leave con empty until I really do need it ?

Thanks in advance,
Regards
M

@deepyaman
Copy link
Member

However... With further tests I noticed something that, to me at least, seems not ideal: The con is still needed even if not used.

I know… I’m being a bit picky. But here is the case I have in mind: A team member working on a new modular pipeline C created a new sql dataset in the catalog. However this being a pipeline I do not work on, I do not have a valid con to provide. Granted, I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...

Hi again @m-gris! Based on just my evaluation, I think what you're requesting is feasible, and looks reasonably straightforward. When constructing DataCatalog.from_config(), the parsed dataset configuration is materialized (see https://github.com/kedro-org/kedro/blob/0.18.12/kedro/io/data_catalog.py#L299-L301). Instead of this, it should be possible to just store the dataset configuration, and construct the dataset instance the first time it's referenced (using a lazy-loading approach, like you suggest).

However... I do think the change could have a significant effect, potentially breaking things (e.g. plugins) that parse the data catalog. I'm almost certain this would be a breaking change that would have to go in 0.19 or a future minor release. Additionally, users may benefit from the eager error reporting, but perhaps the lazy loading could be enabled as an option (reminiscent of the discussion about find_pipelines() in #2910).

Let me request some others view on this, especially given a lot of people have been looking at the data catalog lately.

P.S. Re "I could hack my way around by passing a dummy uri...But this feels a bit odd / not ideal...", my first impression was, this was what we would do back when I was primarily a Kedro user myself. We'd usually define a dummy uri with the template in the base environment (e.g. postgresql://<username>:<password>@localhost/test), and team members would put the actual credentials following the template in their local environment (e.g. postgresql://deepyaman:5hhhhhh!@localhost/test). So, I don't 100% know if I find it that odd myself. That being said, that's also a specific example, and I don't think it more broadly addresses the lazy loading request, hence I'm just including it as an aside. :)

@m-gris
Copy link
Author

m-gris commented Aug 15, 2023

Thx for your answer. Good point regarding regarding the potential for breaking changes.
Local forward what others will have to say :)

@noklam
Copy link
Contributor

noklam commented Aug 15, 2023

As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.

This look like a dataset issue to me and we should fix dataset instead of making DataCatalog lazily load. Regard to the problem of con, isn't this simply because the dataset assume credentials always come with a con key?

Would this be enough?
self._connection_str = credentials.get("con", None)

@deepyaman
Copy link
Member

As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised.

This would be addressed by my open PR. I think there are definite benefits to doing this (e.g. not unnecessarily creating connections well before they need to be used, if at all.

This look like a dataset issue to me and we should fix dataset instead of making DataCatalog lazily load. Regard to the problem of con, isn't this simply because the dataset assume credentials always come with a con key?

Would this be enough? self._connection_str = credentials.get("con", None)

Nit: There's also a block further up, to validate the con argument, that would need to be moved or turned off.

But, more broadly, I don't know how I feel about this. I feel like providing con is required (in the dataset credentials, but it's also a positional argument for pd.read_sql_table()). Why would it be a special case to delay validation that just con is provided? What if I don't know where my colleague put some local CSV file for a pandas.CSVDataSet; why does my catalog validation fail if I don't have a filepath to provide?

@rishabhmulani-mck
Copy link

rishabhmulani-mck commented Aug 16, 2023

We are trying to isolate catalogs within Kedro based on pipelines. What this means is: pipeline A should use catalog A, pipeline B should use catalog B. I have used Kedro and used multiple catalogs but it seems like all catalog files (with the prefix catalog ) get read in regardless of which pipeline is actually run.

More context/background: The DE pipeline is set up to read from an Oracle database and create data extracts. The only way to go ahead is with data extracts as we have READ-ONLY access on the database and the DE pipeline needs to do some transformations before the data is ready for DS.
Now once these transformations are done, these extracts are ready for use by DS. This means, that they don't really need to use the DE pipeline again, and they do run only their specific pipeline. However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.

Posting at the request of @astrojuanlu for feature prioritization (if actually determined to be a needed feature)

@astrojuanlu
Copy link
Member

Thanks a lot @m-gris for opening the issue and @rishabhmulani-mck for sharing your use case.

One quick comment: is using environments a decent workaround? https://docs.kedro.org/en/stable/configuration/configuration_basics.html#how-to-specify-additional-configuration-environments One that contains the "problematic"/"heavy" datasets, and another one that is isolated from it. Then one could do kedro run --env=desired-env and avoid the accidental loading of the database connection entirely.

@deepyaman
Copy link
Member

However, even while doing this, the entire catalog ends up being read. One of the catalog entries is the Oracle database and it requires setting up credentials in the local environment, and setting up the required dependencies for cx_Oracle/oracledb. DS doesn't need these and they do not want to invest time or effort in having to set this up, so they want to isolate their catalog.

@rishabhmulani-mck I think we all agree that it's a problem to connect to all of the databases upon parsing the catalog, which would be resolved by something like kedro-org/kedro-plugins#281. I can finish this PR.

There is a second question of con being a required field. Would your DS care if a dummy con was in the base environment (thereby satisfying the constraint of having a valid config, but it won't be loaded if kedro-org/kedro-plugins#281 is implemented).

@astrojuanlu @noklam my opinion is to split this up, and finish fleshing out kedro-org/kedro-plugins#281 (I believe it's an improvement regardless, can be added for all the datasets with a DB connection), and decide in the meantime whether to support anything further. Unless it's just me, I think making con essentially optional is more contentious.

@noklam
Copy link
Contributor

noklam commented Aug 21, 2023

That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.

@astrojuanlu
Copy link
Member

astrojuanlu commented Aug 22, 2023

I don't think we have spent enough time thinking about how to solve this in a generic way for all datasets beyond database connections. The reason I'm saying this is because if we come up with a generic solution (like "let's not instantiate the dataset classes until they're needed"?) then maybe kedro-org/kedro-plugins#281 is not even needed.

I'd like to hear @merelcht perspective when she's back in a few days.

@AlekJaworski
Copy link

Since my slack message was quoted above, I'll add my 2 cents for what it's worth.

I have in the end solved it also using a templated uri string, but was left slightly unsatisfied, as there is just "no need" to construct everything (or there doesnt seem to be) and it both impacts performance and ease of use.

Having the whole configuration valid while perhaps desirable, it is kind of annoying to not be able to run an unrelated pipeline perhaps during development. "Why would X break Y???" reaction is what I had personally. So in this case I would love to also see lazy loading of only the necessary config components, but I am not that knowledgeable in Kedro and what sort of things it'd break.

@deepyaman deepyaman removed their assignment Aug 22, 2023
@deepyaman
Copy link
Member

That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made.

Great! I will unassigned myself from this issue (and it's larger scope), but finish the other PR when I get a chance. :)

@noklam noklam linked a pull request Aug 29, 2023 that will close this issue
4 tasks
@noklam
Copy link
Contributor

noklam commented Aug 29, 2023

Link kedro-org/kedro-plugins#281 which is a partial solution to this issue.

@idanov
Copy link
Member

idanov commented Sep 14, 2023

Currently the catalog is lazily instantiated on a session run:

catalog = context._get_catalog( # noqa: protected-access
save_version=save_version,
load_versions=load_versions,
)

(not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)

It should be possible to add an argument to context._get_catalog(..., needed=pipeline.data_sets()) (if it won't create any weird implications) that will filter the conf_catalog before it creates the DataCatalog object with the list of datasets from the pipeline.

conf_catalog = self.config_loader["catalog"]

Something like this:

conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}

Parameter names and points of interjection should be more carefully examined though, since this is just a sketch solution. The good news is that it's a non-breaking one, if we implement it.

@noklam
Copy link
Contributor

noklam commented Sep 18, 2023

(not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)

@idanov I think that's because save_version and load_version are only available in session but not context? Sure we can pass all these information down to KedroContext and I don't see a big problem with that. The weird bit may be save_version is also session_id and it make more sense to have it in KedroSession.

Noted @ankatiyar added the save_version and load_version into DataCatalog for data factories. (This is relatvely new), so catalog actually has sufficient information to infer the version now and doesn't need to rely on context/session anymore.
#2635

@takikadiri
Copy link

takikadiri commented Oct 2, 2023

This is really something needed when the codebase grow or when we do Monorepo with many pipelines that need/could be orchestrated independently with different scheduling contexts (for example : training, evaluation and inference pipelines).

Currenly when our codebase grow, we create multiple packages inside the same kedro project (src/package1, src/package2, ...), which means multiple kedro contexts, one context per "deployabe" pipelines. So we can have a "space" to manage configs that belong to the same execution context. However this increase our cognitive context switching and duplication of boilerplate code.

I wonder if the proposed solution solve entirely the problem, as it's not just about initializing datasets lazily, but somehow about materializing the dataset lazily. The catalog will still be entirely materialized from yaml to python object by the config_loader. kedro will still ask for some globals that are not filled but not really needed by the selected pipeline (in user perspectif).

Maybe pushing a little further the namespace concept could solve this. But for now, the poposed solution below (lazily instantiate datasets) is a big step toward pipeline modularity and monorepos, and have the merit of being no breaking. I'm looking forward to it.

It should be possible to add an argument to context._get_catalog(..., needed=pipeline.data_sets()) (if it won't create any weird implications) that will filter the conf_catalog before it creates the DataCatalog object with the list of datasets from the pipeline.

conf_catalog = self.config_loader["catalog"]

Something like this:

conf_catalog = {k: conf_catalog[k] for k in conf_catalog if k in needed}

@astrojuanlu
Copy link
Member

kedro-org/kedro-plugins#281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht

@merelcht merelcht reopened this Oct 19, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Kedro Framework Oct 19, 2023
@deepyaman
Copy link
Member

kedro-org/kedro-plugins#281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht

Yeah, we should; I'm not really sure why this got closed in #281, as that was linked to #366 and not this...

@takikadiri
Copy link

We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B.

Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform.

@astrojuanlu
Copy link
Member

Adding this to the "Redesign the API for IO (catalog)" for visibility.

@astrojuanlu
Copy link
Member

Noting that this was also reported as #3804.

@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Jun 14, 2024
@noklam
Copy link
Contributor

noklam commented Jul 23, 2024

I believed the original issue was resolved by fixing the lazy connection in SQL-related dataset already. There seems to be some other issue raised by @takikadiri

@astrojuanlu
Copy link
Member

It wasn't, generic datasets are still eagerly loaded

@jround1
Copy link

jround1 commented Aug 14, 2024

We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B.

Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform.

@takikadiri We also run a monorepo, with several overlapping projects/pipelines. We segment the projects by subfolder both in src code and parameters, pipelines and datacatalog. We control this via env vars that are read in settings.py (catalog and parameters) and pipeline_registry.py for pipelines. There are also universal/general catalog and parameters that are loaded for all projects/pipelines. This way, a session/context is created with only the catalog and parameters that are needed for the pipeline that we want to run and a developer working within one project cannot break another project.

However, we have the opposite problem when it comes to testing, we cannot iterate through different sessions to test the makeup of the catalogs because subsequent sessions created in code (confirmed different session ids, etc.) for some reason still contains the first context. see: #4087

astrojuanlu added a commit to astrojuanlu/kedro-catalog-experiment that referenced this issue Sep 3, 2024
@astrojuanlu
Copy link
Member

#3811 (comment)

AWS Secrets bills on READ ⚠️ 💵 , making it crucial that only credentials that are actually needed are being loaded.

Our problem is compounded by the fact that, in the @after_context_created hook, there is unfortunately no access to tags, pipeline name, namespace ..., i.e data that would help in filtering things out and minimise AWS billing.

@merelcht
Copy link
Member

Closing this in favour of #3935 , which will address the issues raised here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants