Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart seeding of TPOT populations? #59

Open
rhiever opened this issue Dec 15, 2015 · 4 comments
Open

Smart seeding of TPOT populations? #59

rhiever opened this issue Dec 15, 2015 · 4 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Dec 15, 2015

Sorry if the text below sounds like rambling -- I was using this issue to brainstorm.

I've been thinking about possible ways to make TPOT perform better right out of the box, without having to run it for several generations to finally discover the better pipelines. One of the ideas I've had is to seed the TPOT population with a smarter group of solutions.

For example, we know that a TPOT pipeline will need at least one model, so we can seed it with each of the 6 current models over a small range of parameters:

  • decision tree: all combinations of
    • max_features = [0 (--> auto), 1 (--> None)]
    • max_depth: [0 (--> None), 1, 5, 10, 20, 50]
    • = 12 total combinations
  • random forest: all combinations of
    • n_estimators = [100, 500]
    • max_features = [0 (--> auto), 1]
    • = 4 total combinations
  • logistic regression:
    • C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
    • = 7 total combinations
  • svc:
    • C = [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
    • = 7 total combinations
  • knnc:
    • n_neighbors = [2, 5, 10, 20, 50]
    • = 5 total combinations
  • gradient boosting: all combinations of
    • learning_rate: [0.01, 0.1, 0.5, 1.0]
    • n_estimators: [100, 500]
    • max_depth: [0 (--> None), 5, 10]
    • = 24 total combinations

That gives us 59 "classifier-only" TPOT pipelines to start with.

We also have 4 feature selectors:

  • RFE: all combinations of
    • num_features = [1, 5, 10, 50]
    • step = [0.1, 0.25, 0.5]
    • = 12 total combinations
  • select percentile:
    • percentile = [1, 5, 10, 25, 50, 75]
    • = 6 total combinations
  • select k best:
    • k = [1, 2, 5, 10, 20, 50]
    • = 6 total combinations
  • variance threshold:
    • threshold = [0.1, 0.2, 0.3, 0.4, 0.5]
    • = 5 total combinations

And 4 feature preprocessors:

  • standard scaler (no parameters)
    • = 1 total combinations
  • robust scaler (no parameters)
    • = 1 total combinations
  • polynomial features (no parameters)
    • = 1 total combinations
  • PCA:
    • n_components = [1, 2, 4, 10, 20]
    • = 5 total combinations

Thus, if we wanted to provide at least one feature preprocessor or selector in the pipeline before passing the data to the model, that would result in:

feature selection combinations = 12 * 59 + 6 * 59 + 6 * 59 + 5 * 59 = 1,711

feature preprocessor combinations = 5 * 59 + 1 * 59 + 1 * 59 + 1 * 59 = 472

Giving us a total = 59 + 1,711 + 472 = 2,242 pipeline combinations to start out with.

We'd evaluate all 2,242 of these pipelines then use the top 100 to seed the TPOT population. From there, the GP algorithm is allowed to tinker with the pipeline, fine-tune the parameters, and possibly discover better combinations of pipeline operators.

That's obviously a lot of pipelines to try out at the beginning -- about 23 generations worth of pipelines, which will be quite slow on any decently sized data set. It may be necessary to cut down on the parameters that we try out at the beginning.

@kadarakos
Copy link
Contributor

Hey rhiever,

I'm not very familiar so far with TPOT, but I really like the idea! So sorry if my idea is not helping much, but here it goes:

  • The TPOT package would come with the winning populations for e.g.: the Iris and MNIST data set.
  • A random selection from these populations with random mutations could be the starting population for the new task given by the user.

This idea would do I think similar to what you were explaining. There are "obvious" ingredients to solving a classification task such as a classifier and TPOT should only search through solutions that actually include one. My idea is to go a bit further and provide pre-trained populations on several data sets that the user can choose from and can mix populations for iris, mnist olivetti etc. to initialize the population to solve his own task. What do you think?

@rhiever
Copy link
Contributor Author

rhiever commented Dec 26, 2015

Good idea! I think that's something we could explore along with #49. We've been collecting numerous supervised classification data sets to explore TPOT performance on, and it could be interesting to see what pipeline components are shared amongst the best-performing pipelines on these data sets.

@kadarakos
Copy link
Contributor

As far as I understand this work also uses the kind of warm start I described:
https://drive.google.com/file/d/0BzRGLkqgrI-qSWJ0MXBJbmpSYmpSQlJySkt2UHQ4allueThr/view

Their method is implemented in this library:
https://github.com/automl/auto-sklearn

@saddy001
Copy link

saddy001 commented Oct 14, 2017

One way to find good pipelines faster at the beginning may be to sample sane default parameter settings with a higher probability. http://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization gives a good intro here. One may use a gaussian distribution with the median set to default parameter settings.
For example if SVC parameter "degree"s default is 3, the TPOT default config could look like this:

'sklearn.svm.LinearSVC': {
    'degree': scipy.stats.norm(3),
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants