Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new predict function for R. #6819

Merged
merged 30 commits into from
Jun 11, 2021
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 3 additions & 7 deletions R-package/R/callbacks.R
Original file line number Diff line number Diff line change
Expand Up @@ -263,9 +263,6 @@ cb.reset.parameters <- function(new_params) {
#' \itemize{
#' \item \code{best_score} the evaluation score at the best iteration
#' \item \code{best_iteration} at which boosting iteration the best score has occurred (1-based index)
#' \item \code{best_ntreelimit} to use with the \code{ntreelimit} parameter in \code{predict}.
#' It differs from \code{best_iteration} in multiclass or random forest settings.
#' }
#'
#' The Same values are also stored as xgb-attributes:
#' \itemize{
Expand Down Expand Up @@ -498,13 +495,12 @@ cb.cv.predict <- function(save_models = FALSE) {
rep(NA_real_, N)
}

ntreelimit <- NVL(env$basket$best_ntreelimit,
env$end_iteration * env$num_parallel_tree)
iterationrange <- c(0, NVL(env$basket$best_iteration, env$end_iteration))
if (NVL(env$params[['booster']], '') == 'gblinear') {
ntreelimit <- 0 # must be 0 for gblinear
iterationrange <- c(0, 0) # must be 0 for gblinear
}
for (fd in env$bst_folds) {
pr <- predict(fd$bst, fd$watchlist[[2]], ntreelimit = ntreelimit, reshape = TRUE)
pr <- predict(fd$bst, fd$watchlist[[2]], iterationrange = iterationrange, reshape = TRUE)
if (is.matrix(pred)) {
pred[fd$index, ] <- pr
} else {
Expand Down
3 changes: 2 additions & 1 deletion R-package/R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,8 @@ xgb.iter.eval <- function(booster_handle, watchlist, iter, feval = NULL) {
} else {
res <- sapply(seq_along(watchlist), function(j) {
w <- watchlist[[j]]
preds <- predict(booster_handle, w, outputmargin = TRUE, ntreelimit = 0) # predict using all trees
## predict using all trees
preds <- predict(booster_handle, w, outputmargin = TRUE, iterationrange = c(0, 0))
eval_res <- feval(preds, w)
out <- eval_res$value
names(out) <- paste0(evnames[j], "-", eval_res$metric)
Expand Down
181 changes: 105 additions & 76 deletions R-package/R/xgb.Booster.R
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' @param outputmargin whether the prediction should be returned in the for of original untransformed
#' sum of predictions from boosting iterations' results. E.g., setting \code{outputmargin=TRUE} for
#' logistic regression would result in predictions for log-odds instead of probabilities.
#' @param ntreelimit limit the number of model's trees or boosting iterations used in prediction (see Details).
#' It will use all the trees by default (\code{NULL} value).
#' @param ntreelimit Deprecated, use \code{iterationrange} instead.
#' @param predleaf whether predict leaf index.
#' @param predcontrib whether to return feature contributions to individual predictions (see Details).
#' @param approxcontrib whether to use a fast approximation for feature contributions (see Details).
Expand All @@ -179,16 +178,16 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' or predinteraction flags is TRUE.
#' @param training whether is the prediction result used for training. For dart booster,
#' training predicting will perform dropout.
#' @param iterationrange Specifies which layer of trees are used in prediction. For example, if a
#' random forest is trained with 100 rounds. Specifying `iteration_range=(0,
#' 20)`, then only the forests built during [0, 20) (half open set) rounds are
#' used in this prediction. It's 0 based index (unlike R vector).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use 0-based index here?

@jameslamb What's the convention for R packages that wrap native code? Do users expect to use 1-based indexing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add a note that iteration_range=(0,0) indicates the use of all trees.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm in favor of parameters like this being 1-based and having the documentation clearly indicate that. I think that's more friendly for R users.

But, to be fair, in {lightgbm} we have not been very consistent about this. Keyword arguments in the package's interface expect 1-based values, but {lightgbm} won't look inside parameters passed through a list params and subtract 1 from any parameters that are indices.

So I think that there is not really a "right" answer to this, and that it's more important to:

  1. be consistent (as much as possible)
  2. over-communicate in the documentation (always say whether it is 1-based or 0-based)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to use 1 based index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use 1-based index. Thanks for the suggestions!

#' @param strict_shape When specifed to be TRUE, output shape is invariant to model type.
#' @param ... Parameters passed to \code{predict.xgb.Booster}
#'
#' @details
#' Note that \code{ntreelimit} is not necessarily equal to the number of boosting iterations
#' and it is not necessarily equal to the number of trees in a model.
#' E.g., in a random forest-like model, \code{ntreelimit} would limit the number of trees.
#' But for multiclass classification, while there are multiple trees per iteration,
#' \code{ntreelimit} limits the number of boosting iterations.
#'
#' Also note that \code{ntreelimit} would currently do nothing for predictions from gblinear,
#' Note that \code{iterationrange} would currently do nothing for predictions from gblinear,
#' since gblinear doesn't keep its boosting history.
#'
#' One possible practical applications of the \code{predleaf} option is to use the model
Expand Down Expand Up @@ -253,7 +252,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' # use all trees by default
#' pred <- predict(bst, test$data)
#' # use only the 1st tree
#' pred1 <- predict(bst, test$data, ntreelimit = 1)
#' pred1 <- predict(bst, test$data, iterationrange = c(0, 1))
#'
#' # Predicting tree leafs:
#' # the result is an nsamples X ntrees matrix
Expand Down Expand Up @@ -305,7 +304,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' all.equal(pred, pred_labels)
#' # prediction from using only 5 iterations should result
#' # in the same error as seen in iteration 5:
#' pred5 <- predict(bst, as.matrix(iris[, -5]), ntreelimit=5)
#' pred5 <- predict(bst, as.matrix(iris[, -5]), iterationrange=(0, 5))
#' sum(pred5 != lb)/length(lb)
#'
#'
Expand All @@ -319,7 +318,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' lb <- test$label
#' dtest <- xgb.DMatrix(test$data, label=lb)
#' err <- sapply(1:25, function(n) {
#' pred <- predict(bst, dtest, ntreelimit=n)
#' pred <- predict(bst, dtest, iterationrange=c(0, n))
#' sum((pred > 0.5) != lb)/length(lb)
#' })
#' plot(err, type='l', ylim=c(0,0.1), xlab='#trees')
Expand All @@ -328,7 +327,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' @export
predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FALSE, ntreelimit = NULL,
predleaf = FALSE, predcontrib = FALSE, approxcontrib = FALSE, predinteraction = FALSE,
reshape = FALSE, training = FALSE, ...) {
reshape = FALSE, training = FALSE, iterationrange = NULL, strict_shape = FALSE, ...) {

object <- xgb.Booster.complete(object, saveraw = FALSE)
if (!inherits(newdata, "xgb.DMatrix"))
Expand All @@ -337,81 +336,111 @@ predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FA
!is.null(colnames(newdata)) &&
!identical(object[["feature_names"]], colnames(newdata)))
stop("Feature names stored in `object` and `newdata` are different!")
if (is.null(ntreelimit))
ntreelimit <- NVL(object$best_ntreelimit, 0)
if (NVL(object$params[['booster']], '') == 'gblinear')

if (NVL(object$params[['booster']], '') == 'gblinear' || is.null(ntreelimit))
ntreelimit <- 0
if (ntreelimit < 0)
stop("ntreelimit cannot be negative")
if (ntreelimit != 0 && is.null(iterationrange)) {
## only ntreelimit, initialize iteration range
iterationrange <- c(0, 0)
} else if (ntreelimit == 0 && !is.null(iterationrange)) {
## only iteration range, do nothing
} else if (ntreelimit != 0 && !is.null(iterationrange)) {
## both are specified, let libgxgboost throw an error
} else {
## no limit is supplied, use best
if (is.null(object$best_iteration)) {
iterationrange <- c(0, 0)
} else {
iterationrange <- c(0, as.integer(object$best_iteration))
}
}
## Handle the 0 length values.
box <- function(val) {
if (length(val) == 0) {
cval <- vector(, 1)
cval[0] <- val
return(cval)
}
return (val)
}

option <- 0L + 1L * as.logical(outputmargin) + 2L * as.logical(predleaf) + 4L * as.logical(predcontrib) +
8L * as.logical(approxcontrib) + 16L * as.logical(predinteraction)
## We set strict_shape to TRUE then drop the dimensions conditionally
args <- list(
training = box(training),
strict_shape = box(TRUE),
iteration_begin = box(as.integer(iterationrange[1])),
iteration_end = box(as.integer(iterationrange[2])),
ntree_limit = box(as.integer(ntreelimit)),
type = box(as.integer(0))
)

set_type <- function(type) {
if (args$type != 0) {
stop("One type of prediction at a time.")
}
return(box(as.integer(type)))
}
if (outputmargin) {
args$type <- set_type(1)
}
if (predcontrib) {
args$type <- set_type(if (approxcontrib) 3 else 2)
}
if (predinteraction) {
args$type <- set_type(if (approxcontrib) 5 else 4)
}
if (predleaf) {
args$type <- set_type(6)
}

ret <- .Call(XGBoosterPredict_R, object$handle, newdata, option[1],
as.integer(ntreelimit), as.integer(training))
predts <- .Call(
XGBoosterPredictFromDMatrix_R, object$handle, newdata, jsonlite::toJSON(args, auto_unbox = TRUE)
)
names(predts) <- c("shape", "results")
shape <- predts$shape
ret <- predts$results

n_ret <- length(ret)
n_row <- nrow(newdata)
npred_per_case <- n_ret / n_row
if (n_row != shape[1]) {
stop("Incorrect predict shape.")
}

if (n_ret %% n_row != 0)
stop("prediction length ", n_ret, " is not multiple of nrows(newdata) ", n_row)
arr <- array(data = ret, dim = rev(shape))

if (predleaf) {
ret <- if (n_ret == n_row) {
matrix(ret, ncol = 1)
} else {
matrix(ret, nrow = n_row, byrow = TRUE)
}
} else if (predcontrib) {
n_col1 <- ncol(newdata) + 1
n_group <- npred_per_case / n_col1
cnames <- if (!is.null(colnames(newdata))) c(colnames(newdata), "BIAS") else NULL
ret <- if (n_ret == n_row) {
matrix(ret, ncol = 1, dimnames = list(NULL, cnames))
} else if (n_group == 1) {
matrix(ret, nrow = n_row, byrow = TRUE, dimnames = list(NULL, cnames))
} else {
arr <- aperm(
a = array(
data = ret,
dim = c(n_col1, n_group, n_row),
dimnames = list(cnames, NULL, NULL)
),
perm = c(2, 3, 1) # [group, row, col]
)
lapply(seq_len(n_group), function(g) arr[g, , ])
cnames <- if (!is.null(colnames(newdata))) c(colnames(newdata), "BIAS") else NULL
if (predcontrib) {
dimnames(arr) <- list(cnames, NULL, NULL)
if (!strict_shape) {
arr <- aperm(a = arr, perm = c(2, 3, 1)) # [group, row, col]
}
} else if (predinteraction) {
n_col1 <- ncol(newdata) + 1
n_group <- npred_per_case / n_col1^2
cnames <- if (!is.null(colnames(newdata))) c(colnames(newdata), "BIAS") else NULL
ret <- if (n_ret == n_row) {
matrix(ret, ncol = 1, dimnames = list(NULL, cnames))
} else if (n_group == 1) {
aperm(
a = array(
data = ret,
dim = c(n_col1, n_col1, n_row),
dimnames = list(cnames, cnames, NULL)
),
perm = c(3, 1, 2)
)
} else {
arr <- aperm(
a = array(
data = ret,
dim = c(n_col1, n_col1, n_group, n_row),
dimnames = list(cnames, cnames, NULL, NULL)
),
perm = c(3, 4, 1, 2) # [group, row, col1, col2]
)
lapply(seq_len(n_group), function(g) arr[g, , , ])
dimnames(arr) <- list(cnames, cnames, NULL, NULL)
if (!strict_shape) {
arr <- aperm(a = arr, perm = c(3, 4, 1, 2)) # [group, row, col, col]
}
} else if (reshape && npred_per_case > 1) {
ret <- matrix(ret, nrow = n_row, byrow = TRUE)
}
return(ret)

if (!strict_shape) {
n_groups <- shape[2]
if (predleaf) {
arr <- matrix(arr, nrow = n_row, byrow = TRUE)
} else if (predcontrib && n_groups != 1) {
arr <- lapply(seq_len(n_groups), function(g) arr[g, , ])
} else if (predinteraction && n_groups != 1) {
arr <- lapply(seq_len(n_groups), function(g) arr[g, , , ])
} else if (!reshape && n_groups != 1) {
arr <- ret
} else if (reshape && n_groups != 1) {
arr <- matrix(arr, ncol = 3, byrow = TRUE)
}
arr <- drop(arr)
if (length(dim(arr)) == 1) {
arr <- as.vector(arr)
} else if (length(dim(arr)) == 2) {
arr <- as.matrix(arr)
}
}
return(arr)
}

#' @rdname predict.xgb.Booster
Expand Down
4 changes: 1 addition & 3 deletions R-package/R/xgb.cv.R
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,7 @@
#' parameter or randomly generated.
#' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' which could further be used in \code{predict} method
#' (only available with early stopping).
#' \item \code{best_ntreelimit} and the \code{ntreelimit} Deprecated attributes, use \code{best_iteration} instead.
#' \item \code{pred} CV prediction values available when \code{prediction} is set.
#' It is either vector or matrix (see \code{\link{cb.cv.predict}}).
#' \item \code{models} a list of the CV folds' models. It is only available with the explicit
Expand Down
3 changes: 0 additions & 3 deletions R-package/R/xgb.train.R
Original file line number Diff line number Diff line change
Expand Up @@ -171,9 +171,6 @@
#' explicitly passed.
#' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' which could further be used in \code{predict} method
#' (only available with early stopping).
#' \item \code{best_score} the best evaluation metric value during early stopping.
#' (only available with early stopping).
#' \item \code{feature_names} names of the training dataset features
Expand Down
2 changes: 2 additions & 0 deletions R-package/src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ extern SEXP XGBoosterSerializeToBuffer_R(SEXP handle);
extern SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw);
extern SEXP XGBoosterModelToRaw_R(SEXP);
extern SEXP XGBoosterPredict_R(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterPredictFromDMatrix_R(SEXP, SEXP, SEXP);
extern SEXP XGBoosterSaveModel_R(SEXP, SEXP);
extern SEXP XGBoosterSetAttr_R(SEXP, SEXP, SEXP);
extern SEXP XGBoosterSetParam_R(SEXP, SEXP, SEXP);
Expand Down Expand Up @@ -63,6 +64,7 @@ static const R_CallMethodDef CallEntries[] = {
{"XGBoosterUnserializeFromBuffer_R", (DL_FUNC) &XGBoosterUnserializeFromBuffer_R, 2},
{"XGBoosterModelToRaw_R", (DL_FUNC) &XGBoosterModelToRaw_R, 1},
{"XGBoosterPredict_R", (DL_FUNC) &XGBoosterPredict_R, 5},
{"XGBoosterPredictFromDMatrix_R", (DL_FUNC) &XGBoosterPredictFromDMatrix_R, 3},
{"XGBoosterSaveModel_R", (DL_FUNC) &XGBoosterSaveModel_R, 2},
{"XGBoosterSetAttr_R", (DL_FUNC) &XGBoosterSetAttr_R, 3},
{"XGBoosterSetParam_R", (DL_FUNC) &XGBoosterSetParam_R, 3},
Expand Down
39 changes: 39 additions & 0 deletions R-package/src/xgboost_R.cc
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,45 @@ SEXP XGBoosterPredict_R(SEXP handle, SEXP dmat, SEXP option_mask,
return ret;
}

SEXP XGBoosterPredictFromDMatrix_R(SEXP handle, SEXP dmat, SEXP json_config) {
SEXP r_out_shape;
SEXP r_out_result;
SEXP r_out;

R_API_BEGIN();
char const *c_json_config = CHAR(asChar(json_config));

bst_ulong out_dim;
bst_ulong const *out_shape;
float const *out_result;
CHECK_CALL(XGBoosterPredictFromDMatrix(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dmat), c_json_config,
&out_shape, &out_dim, &out_result));

r_out_shape = PROTECT(allocVector(INTSXP, out_dim));
size_t len = 1;
for (size_t i = 0; i < out_dim; ++i) {
INTEGER(r_out_shape)[i] = out_shape[i];
len *= out_shape[i];
}
r_out_result = PROTECT(allocVector(REALSXP, len));

#pragma omp parallel for
for (omp_ulong i = 0; i < len; ++i) {
REAL(r_out_result)[i] = out_result[i];
}

r_out = PROTECT(allocVector(VECSXP, 2));

SET_VECTOR_ELT(r_out, 0, r_out_shape);
SET_VECTOR_ELT(r_out, 1, r_out_result);

R_API_END();
UNPROTECT(3);

return r_out;
}

SEXP XGBoosterLoadModel_R(SEXP handle, SEXP fname) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadModel(R_ExternalPtrAddr(handle), CHAR(asChar(fname))));
Expand Down
12 changes: 11 additions & 1 deletion R-package/src/xgboost_R.h
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ XGB_DLL SEXP XGBoosterBoostOneIter_R(SEXP handle, SEXP dtrain, SEXP grad, SEXP h
XGB_DLL SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evnames);

/*!
* \brief make prediction based on dmat
* \brief (Deprecated) make prediction based on dmat
* \param handle handle
* \param dmat data matrix
* \param option_mask output_margin:1 predict_leaf:2
Expand All @@ -173,6 +173,16 @@ XGB_DLL SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evn
*/
XGB_DLL SEXP XGBoosterPredict_R(SEXP handle, SEXP dmat, SEXP option_mask,
SEXP ntree_limit, SEXP training);

/*!
* \brief Run prediction on DMatrix, replacing `XGBoosterPredict_R`
* \param handle handle
* \param dmat data matrix
* \param json_config See `XGBoosterPredictFromDMatrix` in xgboost c_api.h
*
* \return A list containing 2 vectors, first one for shape while second one for prediction result.
*/
XGB_DLL SEXP XGBoosterPredictFromDMatrix_R(SEXP handle, SEXP dmat, SEXP json_config);
/*!
* \brief load model from existing file
* \param handle handle
Expand Down
Loading