-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] The update process for a tree model, and its application to feature importance #1670
Conversation
I think being able to "update" is pretty cool. We need to be careful about the prediction cache in existing GBM, since they could no longer be valid when we refreshes a leaf. |
68ccfe9
to
20b6f55
Compare
@tqchen : good point. I've moved the trees initialization for the update into Configure, which should be a better place for it, I think. And I've added some tests and documentation. |
@tqchen: what is needed to wrap this one up? Do you want me to change the python interface as well? do you want some unrelated documentation changes to be in a separate PR? |
The current logic looks good to me, though it does not resemble the general case of refreshing, which could cycle through the trees. Never-the-less, we could merge this in first, as first step Here are a few check list before merging
enum TreeProcessType {
kDefault,
kUpdate
}; |
5331976
to
8f466c3
Compare
Thanks, I've added a TreeProcessType enum. There are many imaginable ways that one can update/refresh/modify/torture the trees, also using various samples of data. It could be interesting to design some robust set of essential building blocks for such exercises (as in your "unix philosophy" approach). The existing modular plugins system for updaters is already a good start. But my addition, however, so far mostly addresses some certain practical needs. |
please fix the lint error https://travis-ci.org/dmlc/xgboost/jobs/180001971 |
8f466c3
to
bec476c
Compare
…ame default process_type to 'default'; fix the trees and trees_to_update sizes comparison check
…ater, Gamma and Tweedie; added some parameter aliases; metrics indentation and some were non-documented
bec476c
to
1c9a174
Compare
@tqchen : Travis checks were finally done and were passing (why is it taking over a day now?). And I've rebased this PR one more time. |
What is the date or R package version; for the planned release which will contain this commit? In other words when can I expect this change to be available in R CRAN package? |
@AbhishekSinghVerma The current CRAN release does contain this PR. |
The main part of this PR is about adding the infrastructure to "update" an existing tree model. It is a simple mod of the GBTree booster, introducing a
prosess_type
parameter which allows to switch between the default full boosting process (which grows new trees and updates them) and the update process for an existing model (which works by passing a model through a desired set of updater modules using some specific data). I've made this parameter as enum instead of bool, so it would keep a possibility for other process types open. Overall, it is still the same booster that basically allows for a different starting point, so I don't think it would make sense to use inheritance to create a separate booster. And the process switch should work seamlessly for DART as well.There could be various applications. E.g., it could be useful to adapt an existing model to a dataset that is somewhat different than the original training data, while still keeping the tree structure mostly the same. This would save time by not needing to rebuild all the trees, and it offers some elements of transfer learning.
Another useful application is for understanding the out-of-sample feature importance and the local feature importance of a model. So far, feature importance ranking was calculated based on information about the loss gains learned within training data, which could carry significant overfitting and may unfairly inflate some importances. Updating the model trees' stats in a hold-out sample would allow to obtain a more fair importance ranking. Also, the current feature importance is a global importance - in the whole training sample. But after building a model in heterogeneous data, I frequently want to see what sets of features are the most important in certain subsets of data for this specific model (i.e., without creating a new model in each subset from scratch). E.g., what drives this model's predictions at the upper end of regression outcome? Or what factors are the most influential on predictions within some cluster? A quick update of the model's stats by passing the data from a specific subset down the trees, allows to estimate the local importance in this data using re-calculated gains. An example is given below.
I have also modified the
refresh
updater by adding an option to not update the leaf values. This way we can update only the tree stats (for the importance estimation and other some sorts of tree-introspection analysis), but would keep the splits and leaf values intact (resulting in the same predictions as from the original model). One current limitation (or feature, as there are pros and cons to that) to keep in mind is that therefresh
updater does not support the random instance subsampling, so a model, which initially used subsampling for its training, would not get updated in a similarly random manner. One example of when no-subsampling could be beneficial is when doing a stats update within the same training sample, the gains of each split would be updated using all data rather than the subsamples that were used during training, thus resulting in less "overfitted" importances.Some more work would be needed to fully complete this functionality, but I'm putting it up, hoping to get some feedback.