TPOTEnsemble idea #479

rhiever · 2017-06-01T20:58:44Z

Many people have been asking for a version of TPOT that creates ensembles of pipelines, as that's what often wins Kaggle competitions etc. We've created prototypes of TPOT that ensemble the Pareto front or final population, but those prototypes didn't work so well because TPOT pipelines are optimized to perform well on a dataset by themselves. In other words, there is no pressure from TPOT to create pipelines that work well with other pipelines.

Here's my proposal for allowing TPOT to create ensembles of pipelines: What if we treated the TPOT optimization procedure as a sort of boosting procedure? It could work as follows:

Create initial population (P0) and evaluate them on the dataset as normal.
Take the best pipeline from P0 and put it into a VotingClassifier
Generate the next population (P1) using the normal fitness scores.
When evaluating the individuals in P1, their fitness is computed by evaluating them in the VotingClassifier with the best pipeline from P0
Take the best pipeline from P1 and put it into the VotingClassifier with the best pipeline from P0
Generate the next population using these "ensemble fitness scores"
Evaluate the pipelines in the new generation by evaluating them in a VotingClassifier with the best individuals from the previous generations
etc.

That way, TPOT is directly optimizing for pipelines that ensemble well with the previously-best pipelines, and the final ensemble is composed of one pipeline from each generation. Is this idea crazy enough to work?

rhiever · 2017-06-02T01:48:44Z

I made a hacky demo of the TPOTEnsemble idea in this commit.

It seemed to work fine in my tests, although it gets much, much slower as the generations pass because, e.g., by generation 100 every pipeline is being evaluated in a VotingClassifier with 99 other pipelines. The only reasonable solution seems to be to store the predictions of each "best" pipeline from every generation, and manually ensemble those predictions with the new predictions from the pipelines in the current generation.

Of course, there will be no way around storing the entire pipeline list in a VotingClassifier for new predictions in the TPOT predict and score functions. But at least the above solution will save evaluating the same list of pipelines over and over again.

reiinakano · 2017-06-08T10:43:19Z

Check this out: scikit-learn/scikit-learn#8960

In the next release, scikit-learn is probably going to get an implementation of stacking classifier, so TPOT might be able to search stacked ensembles the same way it searches pipelines.

rhiever · 2017-06-08T19:11:42Z

Awesome. I look forward to the next release, then!

simonzcaiman · 2017-06-14T20:41:34Z

Ensemble of pipelines would be a great improvement for TPOT!
Will it be better if there is a stacking model selection as well? For example, if one does not want to use a VotingClassifier as the stacking model, can he also use another TPOT pipeline optimization to choose the best stacking model?

rhiever · 2017-06-14T22:18:00Z

@simonzcaiman, this is certainly something we should discuss now before we move forward with actual implementation of TPOTEnsemble. It seems like a good idea to allow different ensemble methods, but I only know of the ones in VotingClassifier from sklearn. Are there are ensemble methods (preferably with a sklearn-like interface) that we should be aware of?

sashml · 2017-06-19T15:14:11Z

Are there are ensemble methods (preferably with a sklearn-like interface) that we should be aware of?

Not sure if you should, but Sebastian has own Stacker here https://rasbt.github.io/mlxtend/user_guide/regressor/StackingRegressor/

rhiever · 2017-07-17T09:53:33Z

Dropping an idea here while it's on my mind: Maybe the original approach to TPOTEnsemble is not good because it requires too many expensive evaluations every generation. Perhaps a better approach would be similar to what @lacava does in FEW:

Take entire TPOT population and stack the outputs into a feature matrix
Fit a regularized (Lasso, preferably) linear model on the feature matrix
Use the linear model coefficients as the fitness of each pipeline

After the first generation, all pipelines with a 0 coefficient will be removed from the TPOT ensemble.

At generation 1 (and beyond), all pipelines in the new population will be added to the TPOT ensemble along with the surviving pipelines currently in the TPOT ensemble. Stack all of the outputs, fit a regularized linear model, and again use the coefficients as the fitness.

Maybe something we can collaborate on, @lacava?

lacava · 2017-08-17T19:01:59Z

@rhiever sounds like a good idea. you could use it with any method that admits some kind of feature score, e.g. lasso, random forests, etc.. and perhaps even with stacking if stacking can be made to score the models it uses in its ensemble.

jonathanng · 2017-08-26T14:00:48Z

Another strategy would be to use a randomized forest and use the importance weights as the fitness.

rhiever added enhancement need contributor labels Jun 1, 2017

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

caioaao mentioned this issue Jul 30, 2017

[MRG+1] Stacking classifier with pipelines API scikit-learn/scikit-learn#8960

Closed

7 tasks

weixuanfu mentioned this issue Jan 17, 2018

Ensembling functionality in TPOTs #656

Closed

perib mentioned this issue Sep 21, 2023

TPOT2 and the future of TPOT development -- From the Devs #1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPOTEnsemble idea #479

TPOTEnsemble idea #479

rhiever commented Jun 1, 2017 •

edited

Loading

rhiever commented Jun 2, 2017

reiinakano commented Jun 8, 2017

rhiever commented Jun 8, 2017

simonzcaiman commented Jun 14, 2017

rhiever commented Jun 14, 2017

sashml commented Jun 19, 2017

rhiever commented Jul 17, 2017

lacava commented Aug 17, 2017

jonathanng commented Aug 26, 2017

TPOTEnsemble idea #479

TPOTEnsemble idea #479

Comments

rhiever commented Jun 1, 2017 • edited Loading

rhiever commented Jun 2, 2017

reiinakano commented Jun 8, 2017

rhiever commented Jun 8, 2017

simonzcaiman commented Jun 14, 2017

rhiever commented Jun 14, 2017

sashml commented Jun 19, 2017

rhiever commented Jul 17, 2017

lacava commented Aug 17, 2017

jonathanng commented Aug 26, 2017

rhiever commented Jun 1, 2017 •

edited

Loading