-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameter variations during single kedro execution instance (multiple Pipeline executions with different parameters) #282
Comments
Hi @Mar1cX , you're right that one set of static configuration corresponds to a single run. If there's magic happening, like 4 different runs with one config, it'd be hard to trace back in case of failure - which exact combination of parameters failed? combo1.yml
etc. I'm assuming you'd like to run the entire pipeline that many times, not just the node that uses those parameters. If it's the latter, then the problem becomes simpler. |
Hi @Mar1cX, welcome to Kedro's community and thank you for your suggestion! To extend your example, a node in your case needs to be defined as follows: node(func1, ['parameter1', 'parameter2'], ...) And you would prefer for Kedro to automatically detect that the parameters are lists and then your function should look like this: def func1(parameter1, parameter2):
print(parameter1, parameter2) # do whatever you need with the parameters here
# this function gets called 4 times with the cartesian product of the combination of the parameters Currently in Kedro you can achieve what you need by adding two lines to your function: import itertools
def func1(parameter1, parameter2):
for p1, p2 in itertools.product(parameter1, parameter2):
print(p1, p2) # do whatever you need with the parameters here Kedro has no semantic understanding about the meaning of your parameters and there are more benefits in keeping it that way, e.g. it will allow you to have nodes of all kinds, including nodes which expect parameters which are lists, dictionaries or just numbers. Doing otherwise might lead to totally unexpected behaviour, e.g. how would Kedro distinguish between the illustrated usecase from the following one: def func1(parameter1, parameter2):
# use parameter1 for something here
print(parameter1)
# use parameter2 for something else here
print(parameter2) Both functions take lists as parameters, but one of them needs to iterate through the cartesian product of the parameters, where the other one would like to use each parameter as list separetely. Kedro should not make any assumptions about how the users need to use the parameters and that's why we prefer to keep feeding parameters separately from the control flow of the pipeline. |
@lorenabalan and @idanov Thank you both for those suggestions. Both of them work in different case scenarios, but both of those solutions are worth to keep in mind, which I will be able to use further in Kedro usage. |
@idanov @lorenabalan thanks for your explanations above and the linked example. Do you have an idea for how to do hyperparameter tuning across nodes? In scitkit-learn you can build pipelines that cover not just the model training, but also data preparation, scaling etc. One might want to include that in a hyperparameter search. Naturally, I would express the different steps (e.g. scaling and model training) in dedicated kedro nodes. But this means, I cannot use scikit-learn hyperparameter search anymore. Any suggestions for this scenario? It seems to me, that adding a way to do hyperparameter search across multiple kedro nodes could be very valuable. |
What are you trying to do?
Hello. I will start with that I'm still new at Kedro and I haven't explored every part of it yet, but I have a grasp of how things overly work. During development of Pipeline architecture it looks like that there is no clear way of making defined parameter variations. What I mean by that (in parameters.yml file) I would want to define:
After that when executing
kedro run
I would have Pipeline executed four times, with those kind of parameter variations:This kind of requirement happens sometimes when you try to simulate data manipulation with different parameters during multiple times of pipeline execution. Something similar as hyperparameters search which tools like sklearn GridSearch do. Also I can see small limitations with GridSearch usage for example. I might have missed the possible solutions during exploring of documentation.
Thanks for an answer in an advance.
The text was updated successfully, but these errors were encountered: