This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
YAML-based configuration pipeline - Good or Bad #1963
Labels
Stage: Technical Design 🎨
Ticket needs to undergo technical design before implementation
Type: Technical DR 💾
Decision Records (technical decisions made)
This will be a step closer to answering the question, we probably don't have a clear answer yet, but let's try our best to address all these questions and document what is the current state of the issue.
Goal
To understand the "Why" and "What" users are doing with YAML base pipeline. Ultimately does Kedro team have a clear stand about this? What are the workarounds and suggestions?
Context
This is a long debated topic, but inspired by a recent internal discussion
It is also related to dynamic pipeline and more advance
ConfigLoader
. In native Kedro, we mainly have 2 YAML filesparameters.yml
andcatalog.yml
, some users create extrapipelines.yml
to create pipeline at run-time. (Actually, we also have aconfig.yml
if you want to override parameters at run time with a YAML configkedro run --config
, but it's not very common)It's also important to think about the issue in different dimensions, the conclusion and be different depending on these factors
Some quotes from the discussion (slightly modified)
Use Cases
Advance Config Loader - Class injection from YAML
Without Kedro's parameters, code may look like this.
With
parameters.yml
, code is more readable and configurable. Parameters are now an argument of a node/function. However, it's not perfect, some boilerplate code is needed for dependency injection.With libraries like
hydra
, some may define the class directly inYAML
,load_obj
is no longer needed. This is also a cleaner function, and easier to be re-used across different projectsThe YAML is now more than constants, it now contains actual Python logic. There are two sides of views.
Support:
parameters.yml
- more reusable code, simplified node (requires a specificConfigLoader
class and modification tohooks.py
atm)OmegaConf
change should make dependency injection easierAgainst:
parameters.yml
- over parameterization -> Trying to write Python code in YAML.It seems to be very polar, some users really like to use YAML for more advance logic, but User research in about a year ago suggest somethings different
Dynamic pipeline - a for loop overriding subset of parameters
In the
0.16.x
series, it's possible to read parameters to create the pipeline. They are essential "pipeline parameters", which are different from the parameters that get passed into anode
.This architecture diagrams clearly show that the pipeline is now created before the session gets created, which means reading config loader is not possible.
Alternatives could be:
kedro run --params=specific_date
and just make it a bash script or something to run it N timesRelevant issues:
Experimentation turning node on&off - dynamic numbers of features
Turn certain node on&off for feature experimentations
This is essentially a parallel pipeline, imagine a pipeline like this with 3 features generation nodes "A","B",C", and aggregation node "D". User may want to skip one of the node "A", but it's not possible to do it from
parameters.yml
, user will also have to change the definition of "D". As a workaround - users creating dynamic pipeline via YAML.Basically in your
pipelines.py
you need to make changes like thisMerging Dynamic number of datasets at run-time
The text was updated successfully, but these errors were encountered: