Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter variations during single kedro execution instance (multiple Pipeline executions with different parameters) #282

Closed
Mar1cX opened this issue Mar 11, 2020 · 4 comments

Comments

@Mar1cX
Copy link

Mar1cX commented Mar 11, 2020

What are you trying to do?

Hello. I will start with that I'm still new at Kedro and I haven't explored every part of it yet, but I have a grasp of how things overly work. During development of Pipeline architecture it looks like that there is no clear way of making defined parameter variations. What I mean by that (in parameters.yml file) I would want to define:

parameter1: [1, 2]
parameter2: [['a', 'aa'], ['b', 'bb']]

After that when executing kedro run I would have Pipeline executed four times, with those kind of parameter variations:

parameter1: 1
parameter2: ['a', 'aa']

parameter1: 1
parameter2: ['b', 'bb']

parameter1: 2
parameter2: ['a', 'aa']

parameter1: 2
parameter2: ['b', 'bb']

This kind of requirement happens sometimes when you try to simulate data manipulation with different parameters during multiple times of pipeline execution. Something similar as hyperparameters search which tools like sklearn GridSearch do. Also I can see small limitations with GridSearch usage for example. I might have missed the possible solutions during exploring of documentation.

Thanks for an answer in an advance.

@Mar1cX Mar1cX changed the title <Question> Parameter variations during single kedro execution instance (multiple Pipeline executions with different parameters) Mar 11, 2020
@lorenabalan
Copy link
Contributor

lorenabalan commented Mar 12, 2020

Hi @Mar1cX , you're right that one set of static configuration corresponds to a single run. If there's magic happening, like 4 different runs with one config, it'd be hard to trace back in case of failure - which exact combination of parameters failed?
Instead you can define multiple environments, in which you overwrite the values of parameter1 and parameter2 (as per https://kedro.readthedocs.io/en/latest/04_user_guide/03_configuration.html#additional-configuration-environments).

combo1.yml

parameter1: 1
parameter2: ['a', 'aa']

etc.
and run kedro run --env combo1, kedro run --env combo2 ..., to trigger the 4 runs.

I'm assuming you'd like to run the entire pipeline that many times, not just the node that uses those parameters. If it's the latter, then the problem becomes simpler.
There's also this issue with an example of hyperparameter tuning if that's at all helpful.

@idanov
Copy link
Member

idanov commented Mar 12, 2020

Hi @Mar1cX, welcome to Kedro's community and thank you for your suggestion!
Kedro's main purpose is to help data scientists and data engineers define their high-level pipeline, reagardless what underlying libraries they use. For the particular usecase you mention, we'd recommend users to use scikit-learn to do the hyperparameter search or add a for loop in their nodes iterating through all the combinations they need. Users can still provide all possible configurations in parameters.yml, however they would need to setup their node to accept the list of options, rather than single instances of the parameters.

To extend your example, a node in your case needs to be defined as follows:

node(func1, ['parameter1', 'parameter2'], ...)

And you would prefer for Kedro to automatically detect that the parameters are lists and then your function should look like this:

def func1(parameter1, parameter2):
    print(parameter1, parameter2) # do whatever you need with the parameters here
    # this function gets called 4 times with the cartesian product of the combination of the parameters

Currently in Kedro you can achieve what you need by adding two lines to your function:

import itertools
def func1(parameter1, parameter2):
    for p1, p2 in itertools.product(parameter1, parameter2):
        print(p1, p2) # do whatever you need with the parameters here

Kedro has no semantic understanding about the meaning of your parameters and there are more benefits in keeping it that way, e.g. it will allow you to have nodes of all kinds, including nodes which expect parameters which are lists, dictionaries or just numbers. Doing otherwise might lead to totally unexpected behaviour, e.g. how would Kedro distinguish between the illustrated usecase from the following one:

def func1(parameter1, parameter2):
    # use parameter1 for something here
    print(parameter1)
    # use parameter2 for something else here
    print(parameter2)

Both functions take lists as parameters, but one of them needs to iterate through the cartesian product of the parameters, where the other one would like to use each parameter as list separetely. Kedro should not make any assumptions about how the users need to use the parameters and that's why we prefer to keep feeding parameters separately from the control flow of the pipeline.

@Mar1cX
Copy link
Author

Mar1cX commented Mar 20, 2020

@lorenabalan and @idanov Thank you both for those suggestions. Both of them work in different case scenarios, but both of those solutions are worth to keep in mind, which I will be able to use further in Kedro usage.

@Mar1cX Mar1cX closed this as completed Mar 20, 2020
@nblumoe
Copy link

nblumoe commented Jan 27, 2021

@idanov @lorenabalan thanks for your explanations above and the linked example.
With that it is clear to me, how to do hyperparameter tuning with scikit-learn within a single kedro node.

Do you have an idea for how to do hyperparameter tuning across nodes? In scitkit-learn you can build pipelines that cover not just the model training, but also data preparation, scaling etc. One might want to include that in a hyperparameter search.

Naturally, I would express the different steps (e.g. scaling and model training) in dedicated kedro nodes. But this means, I cannot use scikit-learn hyperparameter search anymore.

Any suggestions for this scenario? It seems to me, that adding a way to do hyperparameter search across multiple kedro nodes could be very valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants