Smart seeding of TPOT populations? #59

rhiever · 2015-12-15T23:11:28Z

Sorry if the text below sounds like rambling -- I was using this issue to brainstorm.

I've been thinking about possible ways to make TPOT perform better right out of the box, without having to run it for several generations to finally discover the better pipelines. One of the ideas I've had is to seed the TPOT population with a smarter group of solutions.

For example, we know that a TPOT pipeline will need at least one model, so we can seed it with each of the 6 current models over a small range of parameters:

decision tree: all combinations of
- max_features = [0 (--> auto), 1 (--> None)]
- max_depth: [0 (--> None), 1, 5, 10, 20, 50]
- = 12 total combinations
random forest: all combinations of
- n_estimators = [100, 500]
- max_features = [0 (--> auto), 1]
- = 4 total combinations
logistic regression:
- C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
- = 7 total combinations
svc:
- C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
- = 7 total combinations
knnc:
- n_neighbors = [2, 5, 10, 20, 50]
- = 5 total combinations
gradient boosting: all combinations of
- learning_rate: [0.01, 0.1, 0.5, 1.0]
- n_estimators: [100, 500]
- max_depth: [0 (--> None), 5, 10]
- = 24 total combinations

That gives us 59 "classifier-only" TPOT pipelines to start with.

We also have 4 feature selectors:

RFE: all combinations of
- num_features = [1, 5, 10, 50]
- step = [0.1, 0.25, 0.5]
- = 12 total combinations
select percentile:
- percentile = [1, 5, 10, 25, 50, 75]
- = 6 total combinations
select k best:
- k = [1, 2, 5, 10, 20, 50]
- = 6 total combinations
variance threshold:
- threshold = [0.1, 0.2, 0.3, 0.4, 0.5]
- = 5 total combinations

And 4 feature preprocessors:

standard scaler (no parameters)
- = 1 total combinations
robust scaler (no parameters)
- = 1 total combinations
polynomial features (no parameters)
- = 1 total combinations
PCA:
- n_components = [1, 2, 4, 10, 20]
- = 5 total combinations

Thus, if we wanted to provide at least one feature preprocessor or selector in the pipeline before passing the data to the model, that would result in:

feature selection combinations = 12 * 59 + 6 * 59 + 6 * 59 + 5 * 59 = 1,711

feature preprocessor combinations = 5 * 59 + 1 * 59 + 1 * 59 + 1 * 59 = 472

Giving us a total = 59 + 1,711 + 472 = 2,242 pipeline combinations to start out with.

We'd evaluate all 2,242 of these pipelines then use the top 100 to seed the TPOT population. From there, the GP algorithm is allowed to tinker with the pipeline, fine-tune the parameters, and possibly discover better combinations of pipeline operators.

That's obviously a lot of pipelines to try out at the beginning -- about 23 generations worth of pipelines, which will be quite slow on any decently sized data set. It may be necessary to cut down on the parameters that we try out at the beginning.

kadarakos · 2015-12-25T14:41:06Z

Hey rhiever,

I'm not very familiar so far with TPOT, but I really like the idea! So sorry if my idea is not helping much, but here it goes:

The TPOT package would come with the winning populations for e.g.: the Iris and MNIST data set.
A random selection from these populations with random mutations could be the starting population for the new task given by the user.

This idea would do I think similar to what you were explaining. There are "obvious" ingredients to solving a classification task such as a classifier and TPOT should only search through solutions that actually include one. My idea is to go a bit further and provide pre-trained populations on several data sets that the user can choose from and can mix populations for iris, mnist olivetti etc. to initialize the population to solve his own task. What do you think?

rhiever · 2015-12-26T14:51:19Z

Good idea! I think that's something we could explore along with #49. We've been collecting numerous supervised classification data sets to explore TPOT performance on, and it could be interesting to see what pipeline components are shared amongst the best-performing pipelines on these data sets.

kadarakos · 2016-01-05T23:43:27Z

As far as I understand this work also uses the kind of warm start I described:
https://drive.google.com/file/d/0BzRGLkqgrI-qSWJ0MXBJbmpSYmpSQlJySkt2UHQ4allueThr/view

Their method is implemented in this library:
https://github.com/automl/auto-sklearn

saddy001 · 2017-10-14T13:28:53Z

One way to find good pipelines faster at the beginning may be to sample sane default parameter settings with a higher probability. http://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization gives a good intro here. One may use a gaussian distribution with the median set to default parameter settings.
For example if SVC parameter "degree"s default is 3, the TPOT default config could look like this:

'sklearn.svm.LinearSVC': {
    'degree': scipy.stats.norm(3),
},

rhiever added the question label Dec 15, 2015

rhiever mentioned this issue Jan 21, 2016

Add verbosity parameter #7

Closed

rhiever added the enhancement label Mar 6, 2016

rhiever added being worked on and removed question labels Mar 19, 2016

rhiever added need contributor and removed being worked on labels Aug 13, 2016

rhiever mentioned this issue Oct 20, 2016

Regularization in TPOT #207

Open

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

danthedaniel mentioned this issue Jun 21, 2017

Add capability to provide custom seeds to GP #502

Merged

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

perib mentioned this issue Sep 21, 2023

TPOT2 and the future of TPOT development -- From the Devs #1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart seeding of TPOT populations? #59

Smart seeding of TPOT populations? #59

rhiever commented Dec 15, 2015

kadarakos commented Dec 25, 2015

rhiever commented Dec 26, 2015

kadarakos commented Jan 5, 2016

saddy001 commented Oct 14, 2017 •

edited

Loading

Smart seeding of TPOT populations? #59

Smart seeding of TPOT populations? #59

Comments

rhiever commented Dec 15, 2015

kadarakos commented Dec 25, 2015

rhiever commented Dec 26, 2015

kadarakos commented Jan 5, 2016

saddy001 commented Oct 14, 2017 • edited Loading

saddy001 commented Oct 14, 2017 •

edited

Loading