-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline Complexity over time - How to approach a solution #783
Comments
I think one issue with solution 1. is that any time time you look for nonlinearity using I think one potentially expensive solution would be to look at the stability of the covariance of predictors assuming solution #3 is implemented - but I am not sure the best way to fully tease this out? Or looking at rank differences of importance metrics again to build off of #3? I'm going to spend some more time thinking on this, I have run into similar problems with TPOT building very long complicated pipelines - these were just two potentially terrible ideas that came to mind. |
Good point on number 1. On number 2, can you please elaborate? On number 3, are you talking about the CV fit on the training set, test set and then the second idea you are talking of? Interesting in what you are thinking. |
Pushing this further. Regarding complexity scoring, how would a complexity score be used for future generations development? Currently there is the operator count which comes out through the Pareto scores. Would something akin to an additional penalty to the loss score be a feasible idea to limit the generational development of overly complex pipelines or would this need to be approached another way? Curious to hear any thoughts. |
Sorry should have elaborated more from the beginning. My idea for looking at the covariance matrices would best apply for decomposition techniques that may have more than one "best" solution. Something like ICA may yield multiple best solutions. The idea of picking this up by looking for variance, across covariance matrices would be a simple solution to see the stability of the decomposition techniques. However this would probably be better caught by the CV accuracy scores, so after spending more time thinking about this, it may not be a very good solution. For my third idea, I thought looking at stability of importance metrics, within the CV framework TPOT already performs it's model building in, could yield some framework for seeing the stability of the model predictors on top of the prediction accuracy. I wouldn't expect many large jumps across internal CV importance metrics, but big jumps could point to some potential problems. I think a complexity score could be used as a probability to weight a specific pipeline's offspring from spawning w/in the GP framework. As described in the TPOT paper, the top 10% of pipelines from generation 1 are placed into generation 2 in the presented example. We could compute a weighted average, or something of the like, between our "complexity score", and the accuracy score, thus allowing more simple pipelines, which may not perform as well, an opportunity to propagate. I am no expert in genetic programming, just a massive fan, I am not sure if an expert like @rhiever LMK what is not clear. |
@GinoWoz1 @adrose thank both of you for posting your ideas here. For the 1st idea that @GinoWoz1 mentioned, I understand that But if there is one feature selection step to reducing feature number before So we are thinking about using the Please let me know if you have more thoughts about this idea or other ones. Any thoughts will be helpful. Thanks again! |
Thanks Weixuan. So we will need to loop through each of the elements in the stacked pipeline and or just calculate it at the end? One issue I ran into recently is that there was feature selection done on my pipeline which overfit my training set (it was SelectVariance threshold of 0.005 or something). How do we handle pipeline complexity where we don't necessarily have a proliferation of features but specific parameters are memorizing our training data set? |
The best way is to weight each of the elements in the stacked pipeline without refitting partial pipeline.
|
Thanks. Can you help me illustrate with the pipeline below what you are suggesting? This is for my understanding on what you are suggesting. In the example below I have 178 features feeding into this pipeline raw. exported_pipeline = make_pipeline( How are you thinking we should calculate this? Also what are your ideas on weighting the pipelines (or penalizing them) for the next generation? Taking the loss score and multiplying it by the "# feature output from each operator/ # feature in input data" weight? Example Operator 1: 179/178 722/718 or 102% of the original features. |
I think the complexity is |
Awesome thanks for the explanation. In terms of the top 10% of pipelines being passed on to the next generation, how would this penalty apply? Would you apply a multiplier to the loss score in this case or weight it by the complexity as to weed out those pipelines? For # of features, how would this method maintain the depth of the search if we penalize the # of variables? Some of my best pipelines have been the pareto length of 4 or 5. If we weight the # of predictors too much then TPOT may only be stuck exploring pareto lengths of 1 and 2. For the template idea you talked about, this would be a non-issue since you can set the template to do 1) Feature Transformation 2) Feature selection then 3) modelling and you should, in theory, get the best models with the least amount of predictors. Interested in hearing your thoughts. |
TPOT will still use NSGA-II selection for 2-objective optimization (score and complexity) to sorting pipelines and then pass top 10% of pipeline to the next generation. I am not worried that pareto length will be too short if we penalize the # of variables. On the counter, if the simple pipeline with less intermediate features after feature selection step can get a similar accuracy score with a pipeline using all features. We are more interested to study those intermediate features. |
Since TPOT allows lots of freedom for the solution spaces it searches, sometimes TPOT can explore unnecessarily complex pipelines.
Possible fix
Last week I met with @weixuanfu and Trang to discuss how we could solve this problem by looking at Pipeline complexity. Based on our conversation we had the ideas below, I am interested in other's ideas.
Ideas
The text was updated successfully, but these errors were encountered: