-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regularization in TPOT #207
Comments
I've been trying to read up on the topic, but I wonder if there's a good reason to explore early stopping for these GP problems. Should we always let TPOT run the specified number of generations, or is there some reliable criteria that we can use to stop the EA early? Sure, we do internal cross-validation, but can we assume that that is always going to be reliable (i.e., the number of training samples is small, the distribution of training labels/responses doesn't reflect the general population, etc)? |
It's still not clear to me if it's ever a good idea to stop TPOT early. I don't think I've seen a case where TPOT's generalization accuracy went down from optimizing over another e.g. 100 generations. |
One possibility is to investigate a large number of various datasets, let's say 10,000, in a batch What about using the Kaggle ones ?, it may require many ressources for a limited amount of time, |
yes, Maybe, one can put into a repository the datasets with meta information on the dataset. After having enough results, one can analyze the results : It would accelerate TPOT by restricting the genetic search to smaller search space Think datasets are homogenous when belongs to some categories (image, web data,...). What do you think having a Gitter Chat for TPOT ? |
Regularization parameters are already included in most scikit-learn models. So when TPOT is exploring the hyperparameter space of a classifier, it is already using regularization. The IRIS example in the TPOT docs (http://rhiever.github.io/tpot/examples/IRIS_Example/) show an exported pipeline using Logistic Regression including parameters "C=.09" and "penalty='l2'". So TPOT selected a regularized logistic regression classifier in that example. |
@MaxPowerWasTaken You bring up a good point about the per-model regularization, and maybe for the sake of discussion it makes sense that we spell out the different kinds of regularization that we might use in TPOT (please correct me if I'm misguided here):
But to get back to it – maybe we should be thinking more about how the per-model regularization affects the other levels of complexity. Are the less complex pipelines always using well-regularized models underneath? |
Little bump here, since i wanna do some research into this. How does the pareto optimization actually work? If i understand correctly, in the end TPOT always provides the pipeline with the best internal CV-score which has been evaluated during the optimization right? The user could choose to pick a pipeline with less components from the pareto boundary, but he should do so manually, if i am correct. |
That's correct. This Pareto front concept is also used to eliminate "poor-performing" pipelines at the end of every GP generation as well: All pipelines in the current generation are ranked by their "dominance," i.e., top-ranking pipelines must outperform all other pipelines on at least one of the multi-objective criteria (in our case, predictive performance or complexity). Second-rank pipelines must outperform all other pipelines except the top-ranking ones on at least one of the multi-objective criteria. And so on. You can look up the NSGA2 algorithm if you'd like to dig into the specific algorithm more. We recently added an early stopping feature for TPOT, and I think that fits into GA regularization as described above. As others have pointed out, many algorithms in TPOT already have per-model regularization. I think the key is to find out what kind of per-pipeline regularization is needed to prevent TPOT from selecting pipelines with an overly-optimistic estimate of generalization performance. I still think that encouraging TPOT to perform feature construction and selection is the most promising way to accomplish that goal, as one of the biggest causes of overfitting is providing too many features to ML algorithms. |
Just to be sure we are discussing the same things, I think that it is good to first distinguish two types of overfitting. Apologies if the discussion is a bit extensive. The first type is overfitting on a model level. A model which is too flexible/complex will not only 'attempt' to explain the signal in the training data, but also the randomness. Think of a decision tree model which grows a terminal leaf for each observation; perfect fit on the training set, but very poor generalization performance. The problem here is that the cross validation error is only an estimate of the actual generalization error of the model. By picking the model that minimizes the cross validation error, we cannot trust the cross validation estimate as a good estimate of the actual generalization performance of the model, since it could just as well be the ‘winner’ model since it’s cross validation estimate is just more optimistic than that of its competitors. This fact has been mentioned numerous times in prominent literature on the topic. This is why the final model quality should always be assessed on an external validation set (or using nested-cv) Algorithms like TPOT, that do an intelligent search over the model space, in some sense perform a directed search towards the best models, based on CV. However, given the above discussion, the estimate for these models could just as well be biased. Moreover, models with a lower cv-error could just as well have a better generalization performance. By taking this ‘direction’ in the search we could just as well be looking for models that have an even more optimistic cv-error, without improving the actual generalization performance. In some sense, we are searching for models that perform well on the out-of-sample cv folds, and thus are overfitting the CV-error. TPOT might suffer from the last type of overfitting. I agree that the pareto optimization helps to alleviate this problem. However, I do not see why stimulating the choice for feature selectors/constructors does this. It could just as well be picked since some of the selected/constructed features help a lot in predicting the out-of-sample folds used for the cross validation estimate. I did some runs of TPOT on simulated data, after which I refitted all the evaluated pipelines on the entire training set and evaluated on a large test set. For most of these simulations it was indeed the case that most improvements of the best model made by TPOT where only an improvement in CV-score and not in score on the test set. Setup was 1000 observations and 35 variables generated from some difficult process, to keep the simulations workable. Personally, I was thinking about penalizing CV-error of the top-performing models in each generation, since we expect the estimate for these models to be overly optimistic. This would give good performing models with a less optimistic CV-estimate a better chance to evolve. As I am writing this, I am even considering the idea to stop the top models to traverse to the next generation at all. |
Can we provide pre-defined a list
of estimator to be used during fitting ?
Thanks
On Nov 22, 2017, at 2:34, bartdp1 <notifications@github.com> wrote:
Just to be sure we are discussing the same things, I think that it is good to first distinguish two types of overfitting. Apologies if the discussion is a bit extensive.
The first type is overfitting on a model level. A model which is too flexible/complex will not only 'attempt' to explain the signal in the training data, but also the randomness. Think of a decision tree model which grows a terminal leaf for each observation; perfect fit on the training set, but very poor generalization performance.
This type of ‘overfitting’ is leveraged by fitting models of different levels of complexity (often controlled by regularization parameters), and for all of these models, calculate an estimate for the generalization performance, often by cross validation. The model with the lowest cross validation estimate is then picked.
The problem here is that the cross validation error is only an estimate of the actual generalization error of the model. By picking the model that minimizes the cross validation error, we cannot trust the cross validation estimate as a good estimate of the actual generalization performance of the model, since it could just as well be the ‘winner’ model since it’s cross validation estimate is just more optimistic than that of its competitors. This fact has been mentioned numerous times in prominent literature on the topic. This is why the final model quality should always be assessed on an external validation set (or using nested-cv)
Algorithms like TPOT, that do an intelligent search over the model space, in some sense perform a directed search towards the best models, based on CV. However, given the above discussion, the estimate for these models could just as well be biased. Moreover, models with a lower cv-error could just as well have a better generalization performance. By taking this ‘direction’ in the search we could just as well be looking for models that have an even more optimistic cv-error, without improving the actual generalization performance. In some sense, we are searching for models that perform well on the out-of-sample cv folds, and thus are overfitting the CV-error.
TPOT might suffer from the last type of overfitting. I agree that the pareto optimization helps to alleviate this problem. However, I do not see why stimulating the choice for feature selectors/constructors does this. It could just as well be picked since some of the selected/constructed features help a lot in predicting the out-of-sample folds used for the cross validation estimate. I did some runs of TPOT on simulated data, after which I refitted all the evaluated pipelines on the entire training set and evaluated on a large test set. For most of these simulations it was indeed the case that most improvements of the best model made by TPOT where only an improvement in CV-score and not in score on the test set. Setup was 1000 observations and 35 variables generated from some difficult process, to keep the simulations workable.
Personally, I was thinking about penalizing CV-error of the top-performing models in each generation, since we expect the estimate for these models to be overly optimistic. This would give good performing models with a less optimistic CV-estimate a better chance to evolve. As I am writing this, I am even considering the idea to stop the top models to traverse to the next generation at all.
It would be great to have some discussion on this. FYI, I am currently writing my Master’s thesis on a particular prediction problem. Since I like TPOT a lot, it would be very nice to dedicate a part of the thesis to (possible) improvements :).
―
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Some months ago, we added Pareto optimization to TPOT, where TPOT now maximizes the pipeline classification accuracy while minimizing the number of operators in the pipeline. We found that such an addition provided a form of regularization for TPOT: the pipelines that TPOT produced were less likely to overfit on the data set.
As I've read more about regularization, I'm starting to wonder if we should refine what we mean by "pipeline complexity" in TPOT. Although "number of operators in the pipeline" is a decent metric for pipeline complexity, maybe we should consider more traditional measures of model complexity.
The first idea that comes to mind is the number of features going into the final classifier. Such a regularization metric could encourage TPOT to compress the feature space (e.g. via PCA or feature construction), perform feature selection in the second-to-last step, and thus build less-complex models that are less prone to overfitting.
Please add additional TPOT regularization ideas to this issue.
The text was updated successfully, but these errors were encountered: