Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal Kedro deployment (Part 1) - Separate external and applicative configuration to make Kedro cloud native #770

Open
Galileo-Galilei opened this issue May 19, 2021 · 13 comments

Comments

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented May 19, 2021

Preamble

Dear Kedro team,

I've been using Kedro since July 2019 (kedro==0.14.3, which is quite different from what kedro is now) and my team has deployed in the past 2 years a few dozens of machine learning pipelines in production with Kedro. I want to give you some feedback on my Kedro experience along this journey, and the advantages and drawbacks of Kedro from my team's point of view with the current versions (0.16.x and 0.17.x):

Advantages:

  • the level of standardisation across projects has tremendously increased. As a consequence:
    • Maintenance is much easier, because in the worst case, if you open an undocumented project you can easily find out how to launch it and modify it because each object has one single and well identified place in the template
    • Collaboration and reuse are much better because anyone can read/modify any other project even if it is poorly documented
  • the team's software engineering skills have improved a lot and data scientists have much better habits: even if they have to work within a a non-kedro project, they are more inclined to spontaneously separate configuration from code execution.

Drawbacks:

  • Kedro's is still early stage and the frequent changes (especially in the template) make hard to upgrade between versions.
  • Kedro's deployment is not standard and is still hard to deploy. You have made a bunch of either plugins (kedro-airflow, kedro-docker) or tutorials (argo, prefect, kubeflow...) to explain how to deploy a kedro project. However, these examples suffer from several drawbacks:
    • Project orchestration: you assume we will map kedro nodes to the orchestrator nodes. This is not realistic, and in a discussion with @limdauto here we agreed on the fact that the conversion to the pipeline's nodes is complicated and must be thoroughly thought by the person in charge of the deployment

      Galileo-Galilei In a nutshell, I don't think there is an exact mapping between kedro nodes (designed by a data scientist for code readibility, easy debugging and partial execution...) and orchestrator nodes (designed for system robustness, ease of asynchronous execution, retry strategies, efficient compute...). kedro nodes are much more low-level in my opinion than orchestrator nodes.

      limdauto The reason why we don't make all of them into plugins were precisely because of what you say here: how you slice your pipeline and map them to the orchestrator's nodes is up to you. A good pattern is using tag.[... or] you can map them based on namespace if you construct your pipeline out of modular pipelines.

    • Configuration management: All deployment tutorials assume that configuration will be changed directly inside the kedro project (e.g., modify the catalog to persist some objects, change path to make them relative...). This makes the very strong assumption (which does not hold often in my personal experience) that the person which will deploy the project (the ops) has access to the underlying application (i.e. the code folder). This is the issue addressed in this design document.

This issue is likely the first one of a serie, and I will focus specifically on Kedro's configuration management system. To give some credits, hereafter suggestions come in a vast majority from discussions, trials and errors with @takikadiri when trying to deploy our Kedro projects.

Disclaimer : I may use the words "should" or "must" in the following design document, and use very assertive sentences which reflect my personal opinion. Theses terms must be understood in regards to the underlying software engineering principles I describe explicitly when needed. My sincere apologies if it offends you, it is by no mean an order to do a specific action, I know you have your own clear vision of what Kedro should bend towards.

Context

Deploying a kedro application

A brief description of the workflow

A common workflow (at least for me, as a dev) is to expose some functionalities to an external person (an ops) which will be in charge to create the orchestration pipeline. A sketch of the workflow is the following:

  • the dev creates several kedro pipelines (say pipeline_1, pipeline_2). These pipelines can be run as usual with kedro run --pipeline=<pipeline_name>
  • the dev describes to the ops the orchestration logic (e.g. "run pipeline_1, then run pipeline_2 if it succeeds"). Notice that it does not make sense to create a single pipeline_3=pipeline_1+pipeline_2, because we often do not want to execute them at the same time, and we want to have retry strategies because the logic is much more complex than this example
  • the ops creates the orchestration in a dedicated tool (e.g. Oozie, Control-M, Gitlab are the most used in my firm, but the same would apply with Airflow or any orchestration tool).

Deployment constraints to deal with

  • The dev and the ops do not use the same tools
  • The dev and the ops do not use the same programming languages
  • Each orchestration job created by the ops is launched with a CLI command.

Note that changing the workflow or asking the ops to modify the kedro project are out of the list of possible solutions, since I work in a huge organisation with strictly standardized processes that cannot be modified only for my team.

Challenges created by Kedro's configuration management implementation

Identifying the missing functionality: overriding configuration at runtime

In regards of previously described workflow, it should be clear that the ops must be able inject some configuration at runtime, e.g. some credentials (password to database connexion, to mlflow), some path to the data, eventually some parameters... This should be done without modifying the yaml config files : the project folder is not even visible for the ops, and we want to avoid operational risk if he were to modify configuration of a project he knows nothing about.

Overview of potential solutions and their associated issues as of kedro==0.17.3

With the current version of kedro, we have two possibilities when packaging our project to make it "executable":

  1. either we do not package the configuration with the code, and we expect our users to recreate the entire configuration and the folder structure by themselves. In my experience, this is really something users struggle with because:
    • the configuration files are often complex (dozens or even hundreds of rows in a catalog.yml seems common)
    • you need to have some knowledge about Kedro and about the business logic of the underlying pipelines to recreate these files, and this is hardly possible without having the code
    • even worse, this is not even something acceptable in the workflow described above.
  2. or we do package the entire configuration with the code (e.g. by moving the conf folder to src/, or by packaging the entire folder (e.g. with a run.sh file at the root to make it "executable like"). This is roughly what is suggested by @WaylonWalker in Package conf with the project package #704 and while it is in my opinion better than the previous bullet point, it is not acceptable as is for the following reasons:
    • The major issue with this solution is that you need to redeploy the entire app each time you want to change the configuration (even if the modification is not related to the "business logic" of your pipeline, e.g. the path to your output file). This completely breaks the build once, deploy everywhere principle.
    • This also tightens a lot the business logic of your pipelines with the environment where the code is deployed which stands against software engineering best practices IMHO.
    • You will likely need to package some sensitive configuration to make the package work (e.g. credentials like passwords for production database connexion), which is definitely a no go.

As a conclusion, both solutions have critical flaws and cannot be considered as the correct way to handle configuration management when deploying a kedro project as a standalone application.

Thoughts and design suggestions for refactoring configuration management

Underlying software engineering principles : decoupling the applicative configuration from the external configuration

All the problems come from the fact that Kedro currently consider all configuration files as identical while they have different roles:

  • on the one hand, the catalog.yml and the parameters.yml are project specific (they contain the business logic) and we do not expect our users to modify them, except maybe some very small and specific parts that the dev must choose and control. It is not reasonable to assume that the person which will deploy the app knows Kedro's specificities and the underlying business logic. These files are the applicative configuration and must be packaged with the project. We should likely package the logging.yml file too, because it is very likely that only advanced users will need to modify it.
  • on the other hand, the credentials.yml (and the globals.yml if one uses the TemplatedConfigLoader as suggested in your documentation are exposed to our users and must be modified/injected at runtime. They are the external configuration. They depend on the IT environment they are executed in, and they should NOT be packaged with the code in regards of the build once, deploy everywhere principle.

Refactoring the configuration management

Part 1: Refactor the template to make a clear separation between external and application configuration

I suggest to refactor the project template from this:

.
|-- .ipython/profile_default/startup
|-- external-conf
|	|-- local  # contains only credentials and globals
|		|-- credentials.yml
|		|-- globals.yml  # This is the ``global.yml`` file of the TemplatedConfigLoader
|	|-- another-env  # optional, the user can add as many env as he wants
|		|-- credentials.yml
|		|-- globals.yml
|-- data
|-- docs
|-- logs
|-- notebooks
|-- src
|   |-- test
|   |-- <python_package>
|	|	|-- pipelines
|	|	|-- hooks.py # modify it to make the TemplatedConfigLoader the default
|   |-- applicative-conf # this is the current "base" environment, renamed and moved to the src/ folder
|	|	|-- catalog.yml
|	|	|-- parameters.yml
|	|	|-- logging.yml
|	|	|-- globals_default.yml # some defaults to global, in case you want to package them alongside with the app to avoid making mandatory to the user to specify the globals. 
...

With such a setup, the applicative configuration should be packaged with the project, which will make the pipelines much more portable. Two key components should be updated to match all the constraints: the TemplatedConfigLoader and the run CLI command.

Part 2: Update the ConfigLoader

  • We should use the the TemplatedConfigLoader by default to prepare future configuration injection at runtime
  • The TemplatedConfigLoader must not raise an error if there is no "globals.yml", so we can keep it simple for users who do not plan to use this advanced behaviour. The only change for him/her would be the move of the conf/base folder to src/.
  • The TemplatedConfigLoader should retrieve global configuration form different location with a priority:
    • look for configuration in src/applicative-conf
    • override with configuration in external-conf/ if it exists
    • override with environment variables if they exist
    • override with CLI arguments if they were given at runtime

With this system, the dev would choose and define explictly what is exposed to the end users thanks to the TemplatedConfigLoader system, e.g.:

# src/applicative-conf/parameters.yml

number_of_trees: ${NUMBER_OF_TREES} # exposed
alpha_penalty: 0.1 # not exposed
# src/applicative-conf/catalog.yml

my_input_dataset:
  type: pandas.CSVDataSet
  filepath: ${INPUT_PATH}
  credentials: ${INPUT_CREDENTIALS}
# external-conf/local/credentials.yml

INPUT_CREDENTIALS: <MY_VERY_SECURED_PASSWORD>
# src/applicative-conf/globals_default.yml

NUMBER_OF_TREES: 100
INPUT_PATH: data/01_raw/my_dataset.csv

Part 3 (optional) : Update the run command

If possible, the run command should explictly enable to dynamically expose only the variables in the globals. once packaged, the end user would be able to run the project with either:

  • kedro run (use default values) -> he will need to add INPUT_CREDENTIALS as an environment variable since there is no default for it
  • kedro run --INPUT_CREDENTIALS=<MY_VERY_SECURED_PASSWORD> (use default values + inject password at runtime, not secured at all, it will end up in the log!)
  • kedro run --NUMBER_OF_TREES=200 (still with INPUT_CREDENTIALS as an environment variable)

The end user cannot modify what is not exposed by the end user through the CLI or env variables (e.g. save args for the CSVDataSet), except if they are exposed in the globals_default.yml file and made dynamic by the developer.
Obviously, the user can still recreate a conf/<env-folder>/catalog.yml folder to override the configuration, but he should not be forced (nor even encouraged) to do this.

Alternative considered

I could create a plugin to implement such changes by creating a custom ProjectContext class, but the suggested template changes, albeit easy to implement, would make it hard to follow your numerous evolutions in the template. It would make much more sense to implement at least these template changes in the core library.

@yetudada, sorry to ping directly, but you told me you were working on configuration refactoring. Do such changes make sense in the global picture you have in mind?

@mzjp2
Copy link
Contributor

mzjp2 commented May 20, 2021

(ignore me, just butting in here) to say that this is an amazingly well written issue - one of the best and most thorough I've seen in a long time.

@WaylonWalker
Copy link
Contributor

Still need to digest this completely. One thing I give props to the kedro team for, regarding templates, is the move from 0.16.x to 0.17.x. It was very very hard to work outside of the standard template in 0.16.x. It would flat error and not let you do things a "different" way in some cases.

Composability

0.17.x is MUCH more modular. You can compose your own template quite easily by composing the components of kedro you wish to use. To the point where you can easily create a pipeline, catalog, runner, and cli with very little code in a single script. In fact, I've done it. After working with DAGs for the past few years it feels very slow to work without one now. In some cases where there is a significant project already complete, it may not make sense to completely port to kedro, but rather bring in a bit of kedro as you maintain it.

I treat Everything as a Package

I generally think of everything as a package, something that I can pip install, run from the command line. Or in the case of production put into a docker image. I think this workflow/deployment is what has to lead me to put everything into the package. I think it would be completely logical to find a balance of letting the user override parameters while providing good defaults for all of them inside your package. Again this is probably my small view into how I work.

@merelcht merelcht added the pinned Issue shouldn't be closed by stale bot label May 21, 2021
@idanov idanov self-assigned this May 21, 2021
@idanov
Copy link
Member

idanov commented Jun 1, 2021

@Galileo-Galilei First, really thank you for this well-written issue and great analysis of some of the main challenges we currently face in Kedro. The things you have pointed out are a real problem we are trying to address, and we certainly are aware of those challenges. Your thoughts on that are really helpful since we mainly have access to the perspective of McKinsey and QuantumBlack users and hearing the viewpoint of someone not affiliated with our organisations is super valuable.

I would like to add a few comments and maybe some clarifications on our thinking (or at times mostly my thoughts as Kedro's Tech Lead, since some of those might not have crystallised completely yet to be adopted as the official view of the team).

Deployment / orchestration

A lot of Kedro is inspired by the relevant bits of The Twelve-Factor App methodology in order to aid deployment. Initially Kedro was often mistaken for an orchestrator, but the goal of Kedro has always been to be a framework helping the creation of data science apps which can then be deployable to different orchestrators. However this view might not have been perfectly reflected in the architecture due to lack of experience on our side and user side alike. Most recent changes in the architecture though have moved towards that direction as @WaylonWalker pointed out.

In the future we’ll double down on the package deployment mode, e.g. you should be able to run your project as a Kedro package and the only necessary bit would be providing the configuration (currently under conf/). The latest architectural changes from 0.17 and the upcoming 0.18 should allow us to significantly decrease the number of breaking changes for project upgrades. A lot of our work is ensuring that we are backwards compatible, which makes it harder for us to experiment, thus reducing our speed of delivering on the mentioned challenges.

Now for the deployment model, we see a future where our users will structure their pipelines using namespaces (aka modular pipelines). Thus they will form hierarchies of nodes, where the grouping would be semantically significant for them. The top-level pipelines will be consisting of multiple modular pipelines, joined together into the overall dag. This way modular pipelines can be analogous to folders and nodes to files, e.g.

image

After having your pipeline structured like that, then we can provide a uniform deployment plugin where users can decide the level at which their nodes will be run in the orchestrator, e.g. imagine something like kedro deploy airflow --level 2 which will make sure that the output configuration will run each node separately, but collapse the nodes at level 3 as singular tasks in the orchestrator.

There’s some additional subtleties we need to take care of, e.g. running different namespaces on different types of machines, e.g. GPU instances, Spark clusters, etc. But I guess the general idea is clear - the pipeline developer will have a much better control on how things get deployed without actually needing to learn another concept or make big nodes. They will just need to make sure that their pipeline is structured semantically meaningful for them and the orchestration, which is already an implicit requirement anyways and people tend to do that as per your example, but not in a standard way.

Configuration

Logging

This one is supposed to be not needed, since Kedro has exactly the same defaults. So teams can directly get rid of it, unless they would like to change the logging pattern for different platforms, e.g. if you would like to redirect all your logs towards an ElasticSearch cluster, Sumologic or any other log collecting service out there. This configuration is environment specific (locally you might want colourful logging, but on your orchestrator that will be undesirable) and that's why it's not a good idea to package it with your code.

Credentials

This one is obviously environment specific, but what we should consider doing is adding an environment variables support. Unfortunately this has been on the backlog for a while, but doesn’t seem to be such an important issue that cannot be solved by DevOps, so we never got to implementing the environment variables for credentials.

Catalog

This is a way bigger topic and much less clear how to solve it in a clean way, but something we have on our radar for quite some time. We want to come up with a neat solution for this one by the end of 2021, but obviously there’s many factors that will come into play and I cannot guarantee we can get it done by then.

History of the problem

In my opinion, this challenge came from the fact that we treat each dataset as a unique type of data and this comes from the fact that we did not foresee that Kedro will enable the creation of huge pipelines on the order of hundreds of nodes with hundreds of datasets. However now most of our users internally have very big pipelines and a lot of intermediary datasets, which need to be defined in the catalog and not just passed in memory. Thus that created huge configuration files, which a lot of people wanted to simplify. That’s why the TemplatedConfigLoader was born out of user demand and not without some hesitation from our side.

Why the current model is failing

The problem with the TemplatedConfigLoader is that it solves the symptom, but not the real problem. The symptom is the burdensome creation of many catalog entries. The problem is the need for those entries to exist at all. Maybe to clarify here, I will refer to web frameworks like Django or Rails - in all web frameworks, you define only one database connection and then the ORM implicitly maps the objects to that database. In Kedro, each object (i.e. dataset) needs to be configured on its own. Kedro’s model is good if you have a lot of heterogenous datasources (like the case of pipelines fetching data from multiple independent sources). But it quickly dissolves into chaos as you add multiple layers of intermediary datasets, which are, if not always, then for the most part of it, pointing to the same location and can be entirely derived from the name of the dataset. So the challenge here is that we need to support both per-dataset catalog entries and one configuration entry for hundreds of datasets. Whatever solution we come up with needs to work for both cases and be declarative at the same time.

Why the catalog is configuration and not code

As we are trying to emulate the one-build-multiple-deployments model, it becomes very clear that all catalog entries are entirely environment specific (e.g. with one build you might deploy once to S3 and then the second time to ABS or GCS). So this is definitely configuration that needs to live outside your codebase. However the current mode of defining every single dataset separately makes this process completely unmaintainable, so people came up with the templated config solution with the globals.yml to factor out only the useful configuration. Some of our internal users went even further, where they have all the catalog entries part of their codebase and only the globals.yml to be treated as real config.

Parameters

The parameters configuration is an odd one because everyone uses it for different things. E.g. we see many users using it as a way to document their default values of all of their parameters, even when they don’t need to change that parameter. That made the parameter files huge and now they are very hard to understand without some domain knowledge. Some teams use these files as a way for non-technical users to do experiments on their own. Some teams would love to package their parameters in their code, since they treat it as a single place for all their global variables that they can use across their pipeline.

The main challenge I see for the parameters files is that the way we merge those from base/ to the other environments is by the means of a destructive merge. The result of that is that if you have a highly-nested parameter structure and you want to change only one parameter from the default values, you need to define the full tree on the way to the parameter.

One can argue that there should be a way to have a place in your src/ folder where you can define default parameters, so the users need to provide parameters configuration only when something deviates from the defaults. When we revisit our configuration management, we’d look into solutions for this, as well as a non-destructive parameter overriding (which also has drawbacks).

Summary

I might not have answered any questions here or even given very specific directions on how Kedro will develop in the future, but the reason for that is that we don’t have very clear direction set yet on solving those problems. I hope that I have provided some insight into our understanding of the same problems and potentially clarifications why we haven’t solved them yet. One thing is sure though, we have this on our roadmap already and its turn is coming soon, e.g. there’s only 2 other things in front of it 🙂 Thanks for sharing your view on how we could tackle that and while we might not implement it as you have suggested, we'll definitely consider drawing some inspiration from it when we design the new solution. One particular detail that I like is getting rid of the base/ environment and making it as part of the source code defaults.

@datajoely
Copy link
Contributor

datajoely commented Jun 14, 2021

Hi @Galileo-Galilei - I just wanted to say this is a high priority for us and point you towards our community update later this week, sign up here. The event starts at 6:30 PM here in London - see how that works for your timezone here.

@noklam
Copy link
Contributor

noklam commented Jun 22, 2021

Wow, just discover this thread after I started this thread in GitHub Discussion. This issue is a much more in-depth one and I agree with most of it.

I have been wanting to upgrade kedro but it is not easy and seems that 0.18.x will break something, so I am still waiting for it. @WaylonWalker Could you give an example of how 0.17x makes it easier?

@Galileo-Galilei
Copy link
Member Author

Galileo-Galilei commented Jul 13, 2021

Hi,

thank you very much for all who went on to discuss about the issue at stake here, and especially to @idanov for sharing your vision of kedro's future. This is extremely valuable to @takikadiri and I for increasing Kdero usage inside our organisation.

First of all, apologies to @datajoely: I was aware of this retrospective, but I was (un?)fortunately in vacations this week with almost no internet connection and I could't join it. I had a look at the slides which are very interesting!

Here are some thoughts / answers/ new questions which arise from above conversation, in no specific order:

On 0.17.x increased modularity and flexibility

Disclaimer: I have not used 0.17.x versions intensively, apart from a few tests. I compare the features to the 0.16.X one's hereafter.

For my personal experience, here are my list of pros and cons about 0.17.X features:

My team do not plan to migrate its existing projects because it generates a lot of migration costs (we have dozens of legacy projects + and internal CI/CD to update) and the advantages are not sufficient to yet to justify such costs.

@WaylonWalker, you claim that "0.17.x is MUCH more modular". Do you have any real-world example of something which was not straightforward with 0.16.X versions and which is now much easier?

On treating everything as a package

I perfectly agree on this point (and we do the same), but it raises two different points:

  • from a reuse perspective, I like to organize the functions of a given project so that they can be used in another one (with from project_1.nodes import function_1). The guidelines in my team is also to install every kedro project as a package in editable mode (with pip install -e src) rather than using kedro install command.
  • from a deployment perspective, we would like to deploy the project as a package to use it in CLI (e.g. project_1 run --pipeline=training), but it is currently hard because we need to inject configuration at runtime (and this is the exact point of this issue)

On deployment/orchestration

I have seen your progress on the topic, and I acknowledge that only needing a conf/ folder instead of the entire template would be a first step in making deployment easier. I know that I've been complaining about retrocompatiblity a few lines above, but I follow the development very closely and I see how much efforts you're putting in. Once again, thank you very much for the high quality library you're open sourcing! I understand that small changes helps to ensure an overall quality, and I personnally feel the delivery frequency is already quite fast.

Regarding the deployment model, you are cutting the ground under my feet: In the "Universal Kedro deployment (Part 2)", I plan to adress the transition between different pipelines levels in a very similar way :) Kedro definitely needs a way to "factor and expand" the pipelines to have different view levels. This would be beneficial for a transition to another DAG tool, but also for frontend (kedro-viz visualisation) which becomes overcrowded very quickly. That said, I would not rely on the template's structure for several reasons:

  • it would make the code hard to refactor
  • it would make hard to reuse the same function in different nodes.

I guess a declarative API (e.g. letting Pipelines to be composed of Pipelines and nodes instead of nodes only would make it easier to use, but I have not thought enough about it. Obviously, all the implementations details you raise show that it needs to be detailed carefully and that a lot of difficulties will arise.

On configuration (back to the original topic :))

Logging

Logging is obviously environment specific, I apologize if you thought I implied the opposite. I just meant we need a default behaviour, but if I understand what you are saying, it is already the case.

Credentials

I do not understand what you mean by "[it] doesn’t seem to be such an important issue that cannot be solved by DevOps". My point is precisely that many CI/CD tools expect to communicate with the underlying application through environment variables (to my knowledge: I must confess that I am far from being a devops expert), and it is really weird to me that is not "native" in kedro. I must switch to the TemplatedConfigLoader on deployment mode even if I use a credentials.yml file while developping, and it feels uncomfortable to have to change something for deployment (even if it is very easy to change).

Whatever the problem is, it should be a minima better documented than it is now, given that some beginners ask this question on various threads, with a few ugly solutions (e.g. https://discourse.kedro.community/t/load-credentials-in-docker-image-using-env-vars/480, #49). The best reference I can find is in the issue #403.

Catalog

First, I agree that it is a big topic, and unlike most others I haven't a clear vision (yet?) of how it should be refactored. Some unsorted thoughts:

  • Many users use the persistence of intermediary datasets for debugging / starting the pipeline again from an intermediary node, and they forget cleaning these unecessaries entries afterwards. I understand this is convenient, but this tends to increase a lots the number of entries in the catalog. I wonder whether a command to check for unused entries in the catalog / parameters would be beneficial from a code review persepective to avoid having a lot of "dead configuration" remaining in the projects. I may create a plugin for this one day.

  • I understand that the TemplatedConfigLoader increases this complexity as you stated

That’s why the TemplatedConfigLoader was born out of user demand and not without some hesitation from our side

but in my opinion the root of all evil comes from this commit c466c8a, when the catalog .yml became "code" and no longer configuration with the ability of dynamically creating entries. I strongly advocated against it in my team, even if I understood why some users needed it.

  • In a bunch of projects I've seen, the catalog contained a lot of the code logic, especially for projects which contained a lots of SQL complex queries as inputs of their pipelines. This would feel much more kedroish if we only instantiated the connection to the database at the beginning of the pipeline (say with an "engine" object) and we would we would be able to use this connection in the nodes (see discussion Persisting a database session? #813). The business logic would belong to the nodes (where it fits!), and we would avoid instantiating several connections to the same database (which could fire a limit rate and create random connection errors, e.g. if you are calling an API multiples times very fast). If I understand correctly, it is what you describe when you say

In all web frameworks, you define only one database connection and then the ORM implicitly maps the objects to that database

and I cannot agree more. However, given the "debugging" use of the catalog, I totally agree that you should support both ways (per-dataset configuration and one configuration for several datasets) of defining catalog entries.

  • Glad to see that some teams came out with the same solution as mine (e.g. factoring useful configuration in globals.yml). Define the "right level" of factorization needed is indeed complicated, and it may be nice if Kedro has a native/preferred way to do this.

Parameters

We encoutered almost all the use case described here (overriding only a nested key, providing a way to experiment for a a non technical user, packaging the parameters) in different projects. The size of the parameters.yml has become a huge problem for us: like the catalog, it often contains "dead config" people do not dare removing and quickly become overcrowded: it is hard to maitain and to read.

Being able to override a nested parameters structure with a syntax like hyperparams.params_a would be indeed user friendly.

As you suggest (and as I describe my original post), my team uses this file to define default values, and the only really "moving" parameters are injected via the globals.yml through the TemplatedConfigLoader.

On your summmary

I might not have answered any questions here or even given very specific directions on how Kedro will develop in the future, but the reason for that is that we don’t have very clear direction set yet on solving those problems. I hope that I have provided some insight into our understanding of the same problems and potentially clarifications why we haven’t solved them yet.

Sharing your vision on this is definitely valuable. I guess it will take a bunch of iterations to tackle the problem completely and reach an entirely satisfaying configuration management system, but some of the ideas discussed in this thread (moving conf/base base to the src/, enabling non destructive merge...) will likely improve greatly the user experience quite easily.

One thing is sure though, we have this on our roadmap already and its turn is coming soon, e.g. there’s only 2 other things in front of it 🙂

I am aware of experiment tracking, I wonder what the other one is ;)

Thanks for sharing your view on how we could tackle that and while we might not implement it as you have suggested, we'll definitely consider drawing some inspiration from it when we design the new solution.

I only care about the implemented features, not the implementation details. The goal of this thread is more to see whether the problem was shared by other teams, and to discuss the pros and cons of the different suggestions.

One particular detail that I like is getting rid of the base/ environment and making it as part of the source code defaults.

It seems quite a consensus in this thread that if we want to reduce the feature request to its core component, this would be the very one thing to implement.

@astrojuanlu
Copy link
Member

It's almost the 3 year anniversary of this issue 🎂🎈

I'm watching @ankatiyar's Tech Design session on kedro-airflow (kedro-org/kedro-plugins#25 (comment)), which points to kedro-org/kedro-plugins#672 as one of the issues, and it got me thinking about this.

I'd like to know what folks think about it in the current state of things. I don't want to drop a wall of text so here's my best attempt at summarising my thoughts:

  • Applicative vs External configuration: There's a common misconception (another user mentioned this today) that YAML files should all go to conf/. I think it's perfectly fine to ship YAML files that are tied to the business logic with the code itself. conf/ should be for, well, configuration.
  • The Catalog: The interesting thing about the Catalog is that, for every dataset, there's a part that's intimately tied to the code, and another one that is purely parametrized:
ds:
    type: spark.SparkDataset  # Your code will break if you change this to pandas.CSVDataset
    filepath: ...  # Your code is completely independent from where the data lives, the dataset takes care of it

and in fact @Galileo-Galilei hinted that when he wrote this proposal:

# src/applicative-conf/catalog.yml

my_input_dataset:
  type: pandas.CSVDataSet
  filepath: ${INPUT_PATH}
  • Parameters: I'm not so sure it's something that the users shouldn't touch:

on the one hand, the catalog.yml and the parameters.yml are project specific (they contain the business logic) and we do not expect our users to modify them, except maybe some very small and specific parts that the dev must choose and control.

It's fuzzy because during development users should be able to freely explore with different configurations for these (see also #1606) but then during production these parameters become "fossilized" and tied to the business logic.


With the experience we've gained in the past 3 years, the improvements in Kedro (namespace pipelines became a reality, OmegaConfigLoader replaced the old TemplatedConfigLoader) and the direction we have (credentials as resolver, less coupling with logging) what's your fresh view on what you're missing from Kedro in this area? What prevents users from shipping YAML files with their code in the way they see fit?

Tagging @lrodriguezlujan and @inigohidalgo because we've spoken about these recently as well.

@Galileo-Galilei
Copy link
Member Author

Galileo-Galilei commented Jun 15, 2024

Hi, this is a very valid question that need to be answered. We've accomplished a lot, and this needs to be reassessed. I have created a demo repository to implement what is suggested above and evaluate how easy it is to configure with recent versions, and what is still to be improved. I'll report my conclusions here when I am ready.

@astrojuanlu
Copy link
Member

Hi @Galileo-Galilei, I notice that you added some notes in your demo repository.

We are trying to use Discussions for feature requests & enhancement proposals #3767 and doing an issue cleanup in the meantime. Since this issue is long and complex, and is the first in your 4-part series, would you be okay writing here what are your thoughts on where do we currently stand, so that we can either move this whole issue to Discussion or just close it and open follow-up, more focused Discussions?

@Galileo-Galilei
Copy link
Member Author

Galileo-Galilei commented Nov 13, 2024

Above issue suggest a specific workflow and a lot of modifications to the framework. I'll try to sum up the state of the different feature requested above, and start by putting some of them aside because they will be tackled in other issues. I will then focus on the core request above, that is the integration in the src folder of the template of part of the configuration.

Exposing credentials through the CLI

This will be adressed likely with #4320 and has been (and will be) largely discussed there.

The only question left that I personnaly don't understand is the design choice to not make oc.env registered by default (but it is for credentials). This seems a bit contradictory with the 12 factor app which suggests all the conf should be in env variables. Since the syntax is very specific, there is no way someone uses it "unwillingly" and I think we could save us some code and documentation maintenance to explain this specificity which has no added value for users. I'd love us to be perfectly consistent with OmegaConf. I think we should define whether we want to do it or not before 0.20. One can refer to #1909 #2407 #2623 for rationale, but I don't find an open issue to rediscuss this.

Exposing configuration with runtime_params

Situation in kedro 0.19

Above syntax works almost "out of the box", since OmegaConfigLoader is the default and the runtime_params is a native resolver. You can do kedro run --params=param_key1=value1,param_key2=2.0 if you have specify ${runtime_params: param_key_1} in your configuration file.

Unsolved issues

However, I still find the dev experience not as good as it could be on this topic and I have a bunch of other feature requests specifically around it, but it is worth splitting it in a separate discussion and address each of them specifically:

  • I am frustrated that is does not work with the globals.yml file. I think we've been a bit overengineering it by trying to prevent some specific behaviour which does not look very problematic in the first place, and that a bunch of users have asked. I think we may even merge globals and runtime_params params resolver, but this is another discussion.
  • as a consequence of previous point, documention is hard to find, and may be confusing between the 2 resolver (e.g. one question we had today on slack: https://kedro.hall.community/using-a-parameter-in-the-catalogyml-with-omegaconfigloader-L2DR47nZMxsV)
  • There is no "consistency" checks that the params passed at runtime are properly registered in a resolver for the specific pipeline we are trying to run. This leads to several issues like:
    • there are no loging warning if one pass an invalid param (which does not exist in the configuration for the pipeline we are running)
    • there is no logging info to indicate that the param is overriden when it works
    • no CLI command to see which runtime_params are available for the pipeline we are running. I'd love to be able to do kedro config list_runtime_params --pipeline <pipeline> similar as suggested in the original issue (with the modern syntax)
    • It is not really possible to not possible to create this CLI command locally or in a plugin because of it is actually very hard to find the runtime params exposed in the configuration of a specific pipeline because of Spike: Enable access to OmegaConfigLoader DictConfig instead of dict only #2973

Modifying the template to separate applicative and external configuration

Situation in kedro 0.19

The "official" way does not work...

According to the official documentation from the configuration page (but it requires a good understanding of kedro because all these settings are scattered across the page) , we can manually update the template as follows:

  1. Create src/conf_app folder
  2. Copy the content of conf/base in src/conf_app
  3. Delete conf/base folder
  4. Update settings.py as follows:
# default conf is at the root
CONF_SOURCE = "."

# Keyword arguments to pass to the `CONFIG_LOADER_CLASS` constructor.
CONFIG_LOADER_ARGS = {
    "base_env": "src/conf_app",
    "default_run_env": "conf/local",
}

✅ This kind of work: when you executekedro run -e conf/my_env locally, the configuration is properly retrieved from src/conf_app first, and then overriden by the one in conf/my_env
❌ Bad news are:

  • (nitpick) You need to type kedro run -e conf/my_env instead of kedro run -e my_env which is a bit confusing regarding the doc
  • (major) The main default is that we cannot merge 2 configuration which are not inside the same folder (here we assume that they are subfolders of the root folder). This prevents deploying configuration by passing conf_source through the CLI with kedro run --conf-source=<path-to-new-conf-folder> because the python code (hence src/conf_app) and the other conf folder live in two different directories. This would really be a blocker for deployment, except if the only changes you make come from runtime_params.

...but there is an unexpected workaround

If you change settings.py in step 4 above by:

# settings.py
from pathlib import Path

CONFIG_LOADER_ARGS = {
    "base_env": (Path(__file__).parents[1] / "conf_app").as_posix(),
    "default_run_env": "local",
}

✅ It does work as expected: locally, conf_source is looking in the conf folder, so kedro run -e my_env
works. More importantly, if we pass an absolute path through the CLI for --conf-source, it will continue working.

❌ Bad news are:

Path("conf") / r"C:\Users\...\spaceflights-pandas\src\conf_app"

returns

WindowsPath('C:/Users/.../spaceflights-pandas/src/conf_app')

when base_env is absolute which is very confusing for me in termes of how pathlib resolves a "relative / absolute" path 🤯

Important decisions to make before we can consider this feature request properly addressed

  • Decision 1: Since the workaround is likely a happy accident, this is not tested and can break in the future. We should cover it with proper tests if the functionality is officially supported.
  • Decision 2: the original core proposal is to make this behaviour the default one, by changing the template. AFAIK, this has not been discussed officially but team member do likely have opinions about it. Even if we decide not to enforce this by making it the default, I think we should make sure that is will always be possible, and document it accordingly.

@datajoely
Copy link
Contributor

There are some absolute bits of 🏅 gold dust in this write up @Galileo-Galilei 💪

The bits that stick out to me:

I think we may even merge globals and runtime_params params resolver, but this is another discussion.

This is so on the money and I've not really thought about it before. Globals has always been a compromise, we've resisted doing it at every stage so we're left with a situation that evolved and was never holistically designed.

no CLI command to see which runtime_params are available for the pipeline we are running. I'd love to be able to do kedro config list_runtime_params --pipeline similar as suggested in the original issue (with the modern syntax)

There this is also very important and we can do some small tweaks to massively improve the developer / consumer experience. This point also touches on a wider 'what data contracts' does kedro implictly expect, but we don't have any proper validation (pydantic, type hints etc).

@astrojuanlu
Copy link
Member

An app's config is everything that is likely to vary between deploys (staging, production, developer environments, etc).

https://github.com/twelve-factor/twelve-factor/blob/a167d735acc3837d53b70626ba00d476d4ebe555/content/config.md?plain=1#L1-L4

@takikadiri
Copy link

takikadiri commented Nov 15, 2024

Kedro and the twelve-factor have different interpretations of configuration semantics. I believe this disparity in meaning is the primary obstacle to implementing this feature.

I think that only the user can define which part of the project vary between deploys depending on his context and needs. Those varying part could be declared as runtime_params (actual runtime_params + globals), everythings else belong to the source code. Labeling all confs as environment/external confs is too much opinionated. The only part of the project that Kedro could safely label as environment/external conf is the credentials, as we are sure that they are not part of the source code.

Here are some benefits that could be enabled by this feature:

  • Enhance the runnability of the kedro project/package as the kedro run would either just run without tedious settings (settings up a conf folder) or would demand some credentials if needed. This would enable also having a simpler getting started documentations for kedro package deployment

  • Reduce context switching between src folder and conf folder, as users would be mostly focus on the src folder

  • Eliminate the confusion regarding the level of hiearchy between base env and other envs that live in the same conf folder. Users would grasp this hierarchy in a more straightforward way if it was between conf/ and src/xxx/conf

  • Garantee that the code needed to run the projet are in the git repo: Some users could change some conf inside their conf/local which get ignored by git and subsequently not pushed to the git repo, leading to some friction in collaboration

  • Having a clean interface between data scientist and devops/mlops: data scientist care about the business logic (src/package) and devops care about the app contract/interface (conf/)

  • Allow fine grained control over release management: In some context a change in a params could mean a bump to the code version, and in some other context a params is willing to be exposed as external conf to the env that run the app

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants