Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] The update process for a tree model, and its application to feature importance #1670

Merged
merged 11 commits into from
Dec 4, 2016

Conversation

khotilov
Copy link
Member

The main part of this PR is about adding the infrastructure to "update" an existing tree model. It is a simple mod of the GBTree booster, introducing a prosess_type parameter which allows to switch between the default full boosting process (which grows new trees and updates them) and the update process for an existing model (which works by passing a model through a desired set of updater modules using some specific data). I've made this parameter as enum instead of bool, so it would keep a possibility for other process types open. Overall, it is still the same booster that basically allows for a different starting point, so I don't think it would make sense to use inheritance to create a separate booster. And the process switch should work seamlessly for DART as well.

There could be various applications. E.g., it could be useful to adapt an existing model to a dataset that is somewhat different than the original training data, while still keeping the tree structure mostly the same. This would save time by not needing to rebuild all the trees, and it offers some elements of transfer learning.

Another useful application is for understanding the out-of-sample feature importance and the local feature importance of a model. So far, feature importance ranking was calculated based on information about the loss gains learned within training data, which could carry significant overfitting and may unfairly inflate some importances. Updating the model trees' stats in a hold-out sample would allow to obtain a more fair importance ranking. Also, the current feature importance is a global importance - in the whole training sample. But after building a model in heterogeneous data, I frequently want to see what sets of features are the most important in certain subsets of data for this specific model (i.e., without creating a new model in each subset from scratch). E.g., what drives this model's predictions at the upper end of regression outcome? Or what factors are the most influential on predictions within some cluster? A quick update of the model's stats by passing the data from a specific subset down the trees, allows to estimate the local importance in this data using re-calculated gains. An example is given below.

I have also modified the refresh updater by adding an option to not update the leaf values. This way we can update only the tree stats (for the importance estimation and other some sorts of tree-introspection analysis), but would keep the splits and leaf values intact (resulting in the same predictions as from the original model). One current limitation (or feature, as there are pros and cons to that) to keep in mind is that the refresh updater does not support the random instance subsampling, so a model, which initially used subsampling for its training, would not get updated in a similarly random manner. One example of when no-subsampling could be beneficial is when doing a stats update within the same training sample, the gains of each split would be updated using all data rather than the subsamples that were used during training, thus resulting in less "overfitted" importances.

Some more work would be needed to fully complete this functionality, but I'm putting it up, hoping to get some feedback.

library(xgboost)
library(data.table)
library(mlbench)

# predicting the outcome of a diabetes test
data(PimaIndiansDiabetes2)
dt <- PimaIndiansDiabetes2
str(dt)
setDT(dt)
fnames <- colnames(dt)[-9]

set.seed(1)
tr <- sample.int(nrow(dt), 0.7*nrow(dt))
dtrain <- xgb.DMatrix(as.matrix(dt[tr, -9, with=F]), label = as.numeric(dt$diabetes[tr])-1)
dtest <- xgb.DMatrix(as.matrix(dt[-tr, -9, with=F]), label = as.numeric(dt$diabetes[-tr])-1)
wl <- list(train = dtrain, test = dtest)

param <- list(max_depth = 2, eta = 0.05, nthread = 2, subsample = 0.5, min_child_weight = 5, 
              objective = "binary:logistic", eval_metric = "auc",
              base_score = mean(getinfo(dtrain,"label")))

bst <- xgb.train(param, dtrain, 50, wl)
# some significant overfitting is happening...

# Refresh the model within the same training data (without pruning)
rparam <- modifyList(param, list(process_type='update', updater='refresh', refresh_leaf=FALSE))
rbst <- xgb.train(rparam, dtrain, nrounds = bst$niter, watchlist = wl, xgb_model = bst)
# Note how the AUCs are still the same.
# The feature importances are now less affected by the overfited gains during subsampling:
xgb.importance(fnames, rbst)
# compare to the original model:
xgb.importance(fnames, bst)
# The splits and leaf values remain the same, only the split gains and cover values have changed:
xgb.plot.tree(fnames, rbst, n_first_tree = 5)
xgb.plot.tree(fnames, bst, n_first_tree = 5)

# Also, can do the same but against the test data:
tbst <- xgb.train(rparam, dtest, nrounds = bst$niter, watchlist = wl, xgb_model = bst)
xgb.importance(fnames, tbst)

# And, say, we want to see how the feature importances change within the BMI>30 cohort:
dtrain <- xgb.DMatrix(as.matrix(dt[tr, -9, with=F][mass>30]), 
                      label = as.numeric(dt[tr][mass>30]$diabetes)-1)
dtest <- xgb.DMatrix(as.matrix(dt[-tr, -9, with=F][mass>30]), 
                      label = as.numeric(dt[-tr][mass>30]$diabetes)-1)
wl <- list(train = dtrain, test = dtest)
mbst <- xgb.train(rparam, dtrain, nrounds = bst$niter, watchlist = wl, xgb_model = bst)
# The role of glucose test in this cohort is significantly higher:
xgb.importance(fnames, mbst)
# We can also observe how the 'mass' splits became non-splits in many trees (with Gain==0):
xgb.plot.tree(fnames, mbst, n_first_tree = 5)

@tqchen
Copy link
Member

tqchen commented Oct 17, 2016

I think being able to "update" is pretty cool. We need to be careful about the prediction cache in existing GBM, since they could no longer be valid when we refreshes a leaf.

@khotilov
Copy link
Member Author

@tqchen : good point. I've moved the trees initialization for the update into Configure, which should be a better place for it, I think.

And I've added some tests and documentation.

@khotilov
Copy link
Member Author

@tqchen: what is needed to wrap this one up? Do you want me to change the python interface as well? do you want some unrelated documentation changes to be in a separate PR?

@tqchen
Copy link
Member

tqchen commented Nov 29, 2016

The current logic looks good to me, though it does not resemble the general case of refreshing, which could cycle through the trees. Never-the-less, we could merge this in first, as first step

Here are a few check list before merging

  • Let us first make sure the travis pass, I think rebasing against latest master will do
  • Add a enum type about process type, instead of using 1, 0 to indicate process type, which will make the code more readable
enum TreeProcessType {
   kDefault,
   kUpdate
};

@khotilov
Copy link
Member Author

Thanks, I've added a TreeProcessType enum.

There are many imaginable ways that one can update/refresh/modify/torture the trees, also using various samples of data. It could be interesting to design some robust set of essential building blocks for such exercises (as in your "unix philosophy" approach). The existing modular plugins system for updaters is already a good start. But my addition, however, so far mostly addresses some certain practical needs.

@tqchen
Copy link
Member

tqchen commented Dec 1, 2016

please fix the lint error https://travis-ci.org/dmlc/xgboost/jobs/180001971

@khotilov
Copy link
Member Author

khotilov commented Dec 4, 2016

@tqchen : Travis checks were finally done and were passing (why is it taking over a day now?). And I've rebased this PR one more time.

@AbhishekSinghVerma
Copy link

What is the date or R package version; for the planned release which will contain this commit? In other words when can I expect this change to be available in R CRAN package?

@khotilov
Copy link
Member Author

@AbhishekSinghVerma The current CRAN release does contain this PR.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants