-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] Output structure of predict() #5223
Comments
Hey @mayer79 , sorry it took so long to get back to you! 😆 I laughed at "just to get warm". And thanks very much for the small reproducible example. Made it much easy to understand exactly what you were asking and how to investigate it. It took me a while to respond because I wanted to prove to myself that I had the right understanding. I believe that it's:
First, for the SHAP values. Since SHAP values should sum to the raw prediction, you can confirm that the row-wise sums of each contiguous block matches the "raw score" predictions. num_features <- ncol(X)
num_shap_cols_per_class <- num_features + 1
preds_contrib <- predict(fit, head(X), type = "contrib")
preds_raw <- predict(fit, head(X), type = "raw")
# class 1
rowSums(preds_contrib[, 1:num_shap_cols_per_class])
# [1] -0.7321704 -0.7598126 -0.7598126 -0.7598126 -0.7321704 -0.7321704
preds_raw[, 1]
# [1] -0.7321704 -0.7598126 -0.7598126 -0.7598126 -0.7321704 -0.7321704
# class 2
rowSums(preds_contrib[, (num_shap_cols_per_class + 1):(num_shap_cols_per_class * 2)])
# [1] -1.293772 -1.293772 -1.293772 -1.294546 -1.293772 -1.294546
preds_raw[, 2]
# [1] -1.293772 -1.293772 -1.293772 -1.294546 -1.293772 -1.294546
# class 3
rowSums(preds_contrib[, (num_shap_cols_per_class * 2 + 1):(num_shap_cols_per_class * 3)])
# [1] -1.293439 -1.293907 -1.293907 -1.293907 -1.293439 -1.293439
preds_raw[, 3]
# [1] -1.293439 -1.293907 -1.293907 -1.293907 -1.293439 -1.293439 The leaf multiclass predictions in the R package don't follow that same pattern. Those are ordered by tree. library(data.table)
preds_raw <- predict(fit, head(X), type = "raw")
preds_leaf <- predict(fit, head(X), type = "leaf")
# with three classes, each boosting iteration produces 3 trees
# this is why the tree indices (integer unique IDs) skip by 3
#
# * first class: 0, 3
# * second class: 1, 4
# * third class: 2, 5
#
# create a table mapping those to the leaf indices produced by predict(..., type = "leaf")
firstRowDT <- data.table::data.table(
target_class = c(1, 2, 3, 1, 2, 3)
, tree_index = c(0, 1, 2, 3, 4, 5)
, leaf_index = preds_leaf[1, ]
)
# target_class tree_index leaf_index
# 1: 1 0 2
# 2: 1 3 0
# 3: 2 1 0
# 4: 2 4 2
# 5: 3 2 0
# 6: 3 5 5
# next, dump the model information, which describes every node
# (including its predicted value for samples falling into it, and its tree_index and leaf_index)
modelDT <- lightgbm::lgb.model.dt.tree(fit)
# join them together
joinedDT <- merge(
x = firstRowDT
, y = modelDT[, .(tree_index, leaf_index, leaf_value)]
, by = c("tree_index", "leaf_index")
, all.x = TRUE
)
joinedDT[, .(target_class, tree_index, leaf_index, leaf_value)]
# target_class tree_index leaf_index leaf_value
# 1: 1 0 2 -0.89861229
# 2: 2 1 0 -1.19861229
# 3: 3 2 0 -1.19861229
# 4: 1 3 0 0.15379967
# 5: 2 4 2 -0.06515416
# 6: 3 5 5 -0.09482663
# get predictions by summing leaf values from trees belonging to the same class
joinedDT[
, sum(leaf_value)
, by = target_class
][["V1"]]
# [1] -0.7321704 -1.2937720 -1.2934389
preds_raw[1, ]
# [1] -0.7321704 -1.2937720 -1.2934389 I know that example is a bit complicated, but the core idea is this... I grabbed the first row of the If you have time and interest, we'd welcome a contribution adding a note to LightGBM/R-package/R/lgb.Booster.R Line 751 in 44fe591
type documentation).
If not, then just let me know and I'd be happy to add that to the docs. |
Ingenious! Thanks a lot. I think we should indeed explain this shortly in the help page. |
I'll make this addition to the docs later today. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
How do I read the output of
predict()
in multiclass settings and R for a specific example with k = 2 trees, m = 3 classes and p = 4 features?predict(...)
: Clear, one column per class. Just to get warm.predict(..., predleaf = TRUE)
: The first k columns give me the tree node indices for the first class. The next k columns those for the second class etc? Or is the order different?predict(..., predcontrib = TRUE)
: The first p + 1 columns in the output give me the SHAP values of the p features (and a BIAS) for the first class etc?Thus, we always get all results for the first class, then those for the second class etc?
The text was updated successfully, but these errors were encountered: