-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smart seeding of TPOT populations? #59
Comments
Hey rhiever, I'm not very familiar so far with TPOT, but I really like the idea! So sorry if my idea is not helping much, but here it goes:
This idea would do I think similar to what you were explaining. There are "obvious" ingredients to solving a classification task such as a classifier and TPOT should only search through solutions that actually include one. My idea is to go a bit further and provide pre-trained populations on several data sets that the user can choose from and can mix populations for iris, mnist olivetti etc. to initialize the population to solve his own task. What do you think? |
Good idea! I think that's something we could explore along with #49. We've been collecting numerous supervised classification data sets to explore TPOT performance on, and it could be interesting to see what pipeline components are shared amongst the best-performing pipelines on these data sets. |
As far as I understand this work also uses the kind of warm start I described: Their method is implemented in this library: |
One way to find good pipelines faster at the beginning may be to sample sane default parameter settings with a higher probability. http://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization gives a good intro here. One may use a gaussian distribution with the median set to default parameter settings.
|
Sorry if the text below sounds like rambling -- I was using this issue to brainstorm.
I've been thinking about possible ways to make TPOT perform better right out of the box, without having to run it for several generations to finally discover the better pipelines. One of the ideas I've had is to seed the TPOT population with a smarter group of solutions.
For example, we know that a TPOT pipeline will need at least one model, so we can seed it with each of the 6 current models over a small range of parameters:
That gives us 59 "classifier-only" TPOT pipelines to start with.
We also have 4 feature selectors:
And 4 feature preprocessors:
Thus, if we wanted to provide at least one feature preprocessor or selector in the pipeline before passing the data to the model, that would result in:
feature selection combinations = 12 * 59 + 6 * 59 + 6 * 59 + 5 * 59 = 1,711
feature preprocessor combinations = 5 * 59 + 1 * 59 + 1 * 59 + 1 * 59 = 472
Giving us a total = 59 + 1,711 + 472 = 2,242 pipeline combinations to start out with.
We'd evaluate all 2,242 of these pipelines then use the top 100 to seed the TPOT population. From there, the GP algorithm is allowed to tinker with the pipeline, fine-tune the parameters, and possibly discover better combinations of pipeline operators.
That's obviously a lot of pipelines to try out at the beginning -- about 23 generations worth of pipelines, which will be quite slow on any decently sized data set. It may be necessary to cut down on the parameters that we try out at the beginning.
The text was updated successfully, but these errors were encountered: