Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regularization in TPOT #207

Open
rhiever opened this issue Aug 13, 2016 · 11 comments
Open

Regularization in TPOT #207

rhiever opened this issue Aug 13, 2016 · 11 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Aug 13, 2016

Some months ago, we added Pareto optimization to TPOT, where TPOT now maximizes the pipeline classification accuracy while minimizing the number of operators in the pipeline. We found that such an addition provided a form of regularization for TPOT: the pipelines that TPOT produced were less likely to overfit on the data set.

As I've read more about regularization, I'm starting to wonder if we should refine what we mean by "pipeline complexity" in TPOT. Although "number of operators in the pipeline" is a decent metric for pipeline complexity, maybe we should consider more traditional measures of model complexity.

The first idea that comes to mind is the number of features going into the final classifier. Such a regularization metric could encourage TPOT to compress the feature space (e.g. via PCA or feature construction), perform feature selection in the second-to-last step, and thus build less-complex models that are less prone to overfitting.

Please add additional TPOT regularization ideas to this issue.

@bartleyn
Copy link
Contributor

I've been trying to read up on the topic, but I wonder if there's a good reason to explore early stopping for these GP problems. Should we always let TPOT run the specified number of generations, or is there some reliable criteria that we can use to stop the EA early? Sure, we do internal cross-validation, but can we assume that that is always going to be reliable (i.e., the number of training samples is small, the distribution of training labels/responses doesn't reflect the general population, etc)?

@rhiever
Copy link
Contributor Author

rhiever commented Aug 23, 2016

It's still not clear to me if it's ever a good idea to stop TPOT early. I don't think I've seen a case where TPOT's generalization accuracy went down from optimizing over another e.g. 100 generations.

@rhiever rhiever modified the milestone: TPOT v0.7 Aug 31, 2016
@arita37
Copy link

arita37 commented Oct 16, 2016

One possibility is to investigate a large number of various datasets, let's say 10,000, in a batch
and do the machine learning on the results. It would give some concrete insights by dataset nature.

What about using the Kaggle ones ?, it may require many ressources for a limited amount of time,
but one can split the training to the members.

@rhiever
Copy link
Contributor Author

rhiever commented Oct 20, 2016

@arita37, that sounds related to what we've been thinking about in #59. Maybe we can use a metalearning method like that for seeding the TPOT population.

@arita37
Copy link

arita37 commented Oct 20, 2016

yes, Maybe, one can put into a repository the datasets with meta information on the dataset.
One can download and run it on our machines and put it back the results to the repository db.
So, you could get distributed calculation through the participants, improving the database of datasets.

After having enough results, one can analyze the results :
(Dataset_characteristics, BestSet_Algo( top 5) )

It would accelerate TPOT by restricting the genetic search to smaller search space
if the datasets belongs to a pre-analyzed cluster.

Think datasets are homogenous when belongs to some categories (image, web data,...).

What do you think having a Gitter Chat for TPOT ?
(this project is really interesting for me..., glad to contribute).

@MaxPowerWasTaken
Copy link

Regularization parameters are already included in most scikit-learn models. So when TPOT is exploring the hyperparameter space of a classifier, it is already using regularization.

The IRIS example in the TPOT docs (http://rhiever.github.io/tpot/examples/IRIS_Example/) show an exported pipeline using Logistic Regression including parameters "C=.09" and "penalty='l2'". So TPOT selected a regularized logistic regression classifier in that example.

@bartleyn
Copy link
Contributor

@MaxPowerWasTaken You bring up a good point about the per-model regularization, and maybe for the sake of discussion it makes sense that we spell out the different kinds of regularization that we might use in TPOT (please correct me if I'm misguided here):

  • Per-model regularization. Most of the estimator operators we have in TPOT will have some sort of the regularization parameters that ties into the scikit-learn model underneath. These are controlled by TPOT, and are not treated very different than the other model parameters.
  • Per-pipeline regularization. The creation of the TPOT pipelines can be thought to have regularization parameters as well; how complex are the pipelines, and does having simpler pipelines result in better general performance?
  • Genetic algorithm regularization. The DEAP GP algorithms that we're using for TPOT can probably be regularized too, to give us better performing pipelines. This might manifest as something like early-stopping, or perhaps changing the number of individuals per generation (I don't think controlling the number of individuals per generation would really help, but it's a useful example).

But to get back to it – maybe we should be thinking more about how the per-model regularization affects the other levels of complexity. Are the less complex pipelines always using well-regularized models underneath?

@bartdp1
Copy link
Contributor

bartdp1 commented Nov 21, 2017

Little bump here, since i wanna do some research into this.
Namely, i have the idea that TPOT tends to select pipelines for which the CV-error is rather a rather optimistic estimate of the true test-error, and that this effect becomes greater as more generations of optimization are performed.

How does the pareto optimization actually work? If i understand correctly, in the end TPOT always provides the pipeline with the best internal CV-score which has been evaluated during the optimization right? The user could choose to pick a pipeline with less components from the pareto boundary, but he should do so manually, if i am correct.

@rhiever
Copy link
Contributor Author

rhiever commented Nov 21, 2017

How does the pareto optimization actually work? If i understand correctly, in the end TPOT always provides the pipeline with the best internal CV-score which has been evaluated during the optimization right? The user could choose to pick a pipeline with less components from the pareto boundary, but he should do so manually, if i am correct.

That's correct. This Pareto front concept is also used to eliminate "poor-performing" pipelines at the end of every GP generation as well: All pipelines in the current generation are ranked by their "dominance," i.e., top-ranking pipelines must outperform all other pipelines on at least one of the multi-objective criteria (in our case, predictive performance or complexity). Second-rank pipelines must outperform all other pipelines except the top-ranking ones on at least one of the multi-objective criteria. And so on. You can look up the NSGA2 algorithm if you'd like to dig into the specific algorithm more.

We recently added an early stopping feature for TPOT, and I think that fits into GA regularization as described above. As others have pointed out, many algorithms in TPOT already have per-model regularization. I think the key is to find out what kind of per-pipeline regularization is needed to prevent TPOT from selecting pipelines with an overly-optimistic estimate of generalization performance. I still think that encouraging TPOT to perform feature construction and selection is the most promising way to accomplish that goal, as one of the biggest causes of overfitting is providing too many features to ML algorithms.

@bartdp1
Copy link
Contributor

bartdp1 commented Nov 21, 2017

Just to be sure we are discussing the same things, I think that it is good to first distinguish two types of overfitting. Apologies if the discussion is a bit extensive.

The first type is overfitting on a model level. A model which is too flexible/complex will not only 'attempt' to explain the signal in the training data, but also the randomness. Think of a decision tree model which grows a terminal leaf for each observation; perfect fit on the training set, but very poor generalization performance.
This type of ‘overfitting’ is leveraged by fitting models of different levels of complexity (often controlled by regularization parameters), and for all of these models, calculate an estimate for the generalization performance, often by cross validation. The model with the lowest cross validation estimate is then picked.

The problem here is that the cross validation error is only an estimate of the actual generalization error of the model. By picking the model that minimizes the cross validation error, we cannot trust the cross validation estimate as a good estimate of the actual generalization performance of the model, since it could just as well be the ‘winner’ model since it’s cross validation estimate is just more optimistic than that of its competitors. This fact has been mentioned numerous times in prominent literature on the topic. This is why the final model quality should always be assessed on an external validation set (or using nested-cv)

Algorithms like TPOT, that do an intelligent search over the model space, in some sense perform a directed search towards the best models, based on CV. However, given the above discussion, the estimate for these models could just as well be biased. Moreover, models with a lower cv-error could just as well have a better generalization performance. By taking this ‘direction’ in the search we could just as well be looking for models that have an even more optimistic cv-error, without improving the actual generalization performance. In some sense, we are searching for models that perform well on the out-of-sample cv folds, and thus are overfitting the CV-error.

TPOT might suffer from the last type of overfitting. I agree that the pareto optimization helps to alleviate this problem. However, I do not see why stimulating the choice for feature selectors/constructors does this. It could just as well be picked since some of the selected/constructed features help a lot in predicting the out-of-sample folds used for the cross validation estimate. I did some runs of TPOT on simulated data, after which I refitted all the evaluated pipelines on the entire training set and evaluated on a large test set. For most of these simulations it was indeed the case that most improvements of the best model made by TPOT where only an improvement in CV-score and not in score on the test set. Setup was 1000 observations and 35 variables generated from some difficult process, to keep the simulations workable.

Personally, I was thinking about penalizing CV-error of the top-performing models in each generation, since we expect the estimate for these models to be overly optimistic. This would give good performing models with a less optimistic CV-estimate a better chance to evolve. As I am writing this, I am even considering the idea to stop the top models to traverse to the next generation at all.
It would be great to have some discussion on this. FYI, I am currently writing my Master’s thesis on a particular prediction problem. Since I like TPOT a lot, it would be very nice to dedicate a part of the thesis to (possible) improvements :).

@arita37
Copy link

arita37 commented Nov 21, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants