diff --git a/R-package/R/xgb.DMatrix.R b/R-package/R/xgb.DMatrix.R index 429cf3f0422c..280fcf52ee3e 100644 --- a/R-package/R/xgb.DMatrix.R +++ b/R-package/R/xgb.DMatrix.R @@ -9,12 +9,13 @@ #' method (`tree_method = "hist"`, which is the default algorithm), but is not usable for the #' sorted-indices method (`tree_method = "exact"`), nor for the approximate method #' (`tree_method = "approx"`). +#' #' @param data Data from which to create a DMatrix, which can then be used for fitting models or #' for getting predictions out of a fitted model. #' -#' Supported input types are as follows:\itemize{ -#' \item `matrix` objects, with types `numeric`, `integer`, or `logical`. -#' \item `data.frame` objects, with columns of types `numeric`, `integer`, `logical`, or `factor`. +#' Supported input types are as follows: +#' - `matrix` objects, with types `numeric`, `integer`, or `logical`. +#' - `data.frame` objects, with columns of types `numeric`, `integer`, `logical`, or `factor` #' #' Note that xgboost uses base-0 encoding for categorical types, hence `factor` types (which use base-1 #' encoding') will be converted inside the function call. Be aware that the encoding used for `factor` @@ -23,33 +24,14 @@ #' was constructed. #' #' Other column types are not supported. -#' \item CSR matrices, as class `dgRMatrix` from package `Matrix`. -#' \item CSC matrices, as class `dgCMatrix` from package `Matrix`. These are **not** supported for -#' 'xgb.QuantileDMatrix'. -#' \item Single-row CSR matrices, as class `dsparseVector` from package `Matrix`, which is interpreted -#' as a single row (only when making predictions from a fitted model). -#' \item Text files in a supported format, passed as a `character` variable containing the URI path to -#' the file, with an optional format specifier. -#' -#' These are **not** supported for `xgb.QuantileDMatrix`. Supported formats are:\itemize{ -#' \item XGBoost's own binary format for DMatrices, as produced by [xgb.DMatrix.save()]. -#' \item SVMLight (a.k.a. LibSVM) format for CSR matrices. This format can be signaled by suffix -#' `?format=libsvm` at the end of the file path. It will be the default format if not -#' otherwise specified. -#' \item CSV files (comma-separated values). This format can be specified by adding suffix -#' `?format=csv` at the end ofthe file path. It will **not** be auto-deduced from file extensions. -#' } +#' - CSR matrices, as class `dgRMatrix` from package `Matrix`. +#' - CSC matrices, as class `dgCMatrix` from package `Matrix`. #' -#' Be aware that the format of the file will not be auto-deduced - for example, if a file is named 'file.csv', -#' it will not look at the extension or file contents to determine that it is a comma-separated value. -#' Instead, the format must be specified following the URI format, so the input to `data` should be passed -#' like this: `"file.csv?format=csv"` (or `"file.csv?format=csv&label_column=0"` if the first column -#' corresponds to the labels). +#' These are **not** supported by `xgb.QuantileDMatrix`. +#' - XGBoost's own binary format for DMatrices, as produced by [xgb.DMatrix.save()]. +#' - Single-row CSR matrices, as class `dsparseVector` from package `Matrix`, which is interpreted +#' as a single row (only when making predictions from a fitted model). #' -#' For more information about passing text files as input, see the articles -#' \href{https://xgboost.readthedocs.io/en/stable/tutorials/input_format.html}{Text Input Format of DMatrix} and -#' \href{https://xgboost.readthedocs.io/en/stable/python/python_intro.html#python-data-interface}{Data Interface}. -#' } #' @param label Label of the training data. For classification problems, should be passed encoded as #' integers with numeration starting at zero. #' @param weight Weight for each instance. @@ -95,15 +77,9 @@ #' @param label_lower_bound Lower bound for survival training. #' @param label_upper_bound Upper bound for survival training. #' @param feature_weights Set feature weights for column sampling. -#' @param data_split_mode When passing a URI (as R `character`) as input, this signals -#' whether to split by row or column. Allowed values are `"row"` and `"col"`. -#' -#' In distributed mode, the file is split accordingly; otherwise this is only an indicator on -#' how the file was split beforehand. Default to row. -#' -#' This is not used when `data` is not a URI. -#' @return An 'xgb.DMatrix' object. If calling 'xgb.QuantileDMatrix', it will have additional -#' subclass 'xgb.QuantileDMatrix'. +#' @param data_split_mode Not used yet. This parameter is for distributed training, which is not yet available for the R package. +#' @return An 'xgb.DMatrix' object. If calling `xgb.QuantileDMatrix`, it will have additional +#' subclass `xgb.QuantileDMatrix`. #' #' @details #' Note that DMatrix objects are not serializable through R functions such as [saveRDS()] or [save()]. @@ -145,6 +121,9 @@ xgb.DMatrix <- function( if (!is.null(group) && !is.null(qid)) { stop("Either one of 'group' or 'qid' should be NULL") } + if (data_split_mode != "row") { + stop("'data_split_mode' is not supported yet.") + } nthread <- as.integer(NVL(nthread, -1L)) if (typeof(data) == "character") { if (length(data) > 1) { diff --git a/R-package/man/xgb.DMatrix.Rd b/R-package/man/xgb.DMatrix.Rd index 2cfa2e713038..23a24dec4226 100644 --- a/R-package/man/xgb.DMatrix.Rd +++ b/R-package/man/xgb.DMatrix.Rd @@ -45,9 +45,11 @@ xgb.QuantileDMatrix( \item{data}{Data from which to create a DMatrix, which can then be used for fitting models or for getting predictions out of a fitted model. -Supported input types are as follows:\itemize{ +Supported input types are as follows: +\itemize{ \item \code{matrix} objects, with types \code{numeric}, \code{integer}, or \code{logical}. -\item \code{data.frame} objects, with columns of types \code{numeric}, \code{integer}, \code{logical}, or \code{factor}. +\item \code{data.frame} objects, with columns of types \code{numeric}, \code{integer}, \code{logical}, or \code{factor} +} Note that xgboost uses base-0 encoding for categorical types, hence \code{factor} types (which use base-1 encoding') will be converted inside the function call. Be aware that the encoding used for \code{factor} @@ -56,32 +58,16 @@ responsibility to ensure that factor columns have the same levels as the ones fr was constructed. Other column types are not supported. +\itemize{ \item CSR matrices, as class \code{dgRMatrix} from package \code{Matrix}. -\item CSC matrices, as class \code{dgCMatrix} from package \code{Matrix}. These are \strong{not} supported for -'xgb.QuantileDMatrix'. -\item Single-row CSR matrices, as class \code{dsparseVector} from package \code{Matrix}, which is interpreted -as a single row (only when making predictions from a fitted model). -\item Text files in a supported format, passed as a \code{character} variable containing the URI path to -the file, with an optional format specifier. - -These are \strong{not} supported for \code{xgb.QuantileDMatrix}. Supported formats are:\itemize{ -\item XGBoost's own binary format for DMatrices, as produced by \code{\link[=xgb.DMatrix.save]{xgb.DMatrix.save()}}. -\item SVMLight (a.k.a. LibSVM) format for CSR matrices. This format can be signaled by suffix -\code{?format=libsvm} at the end of the file path. It will be the default format if not -otherwise specified. -\item CSV files (comma-separated values). This format can be specified by adding suffix -\code{?format=csv} at the end ofthe file path. It will \strong{not} be auto-deduced from file extensions. +\item CSC matrices, as class \code{dgCMatrix} from package \code{Matrix}. } -Be aware that the format of the file will not be auto-deduced - for example, if a file is named 'file.csv', -it will not look at the extension or file contents to determine that it is a comma-separated value. -Instead, the format must be specified following the URI format, so the input to \code{data} should be passed -like this: \code{"file.csv?format=csv"} (or \code{"file.csv?format=csv&label_column=0"} if the first column -corresponds to the labels). - -For more information about passing text files as input, see the articles -\href{https://xgboost.readthedocs.io/en/stable/tutorials/input_format.html}{Text Input Format of DMatrix} and -\href{https://xgboost.readthedocs.io/en/stable/python/python_intro.html#python-data-interface}{Data Interface}. +These are \strong{not} supported by \code{xgb.QuantileDMatrix}. +\itemize{ +\item XGBoost's own binary format for DMatrices, as produced by \code{\link[=xgb.DMatrix.save]{xgb.DMatrix.save()}}. +\item Single-row CSR matrices, as class \code{dsparseVector} from package \code{Matrix}, which is interpreted +as a single row (only when making predictions from a fitted model). }} \item{label}{Label of the training data. For classification problems, should be passed encoded as @@ -144,13 +130,7 @@ not be saved, so make sure that \code{factor} columns passed to \code{predict} h \item{feature_weights}{Set feature weights for column sampling.} -\item{data_split_mode}{When passing a URI (as R \code{character}) as input, this signals -whether to split by row or column. Allowed values are \code{"row"} and \code{"col"}. - -In distributed mode, the file is split accordingly; otherwise this is only an indicator on -how the file was split beforehand. Default to row. - -This is not used when \code{data} is not a URI.} +\item{data_split_mode}{Not used yet. This parameter is for distributed training, which is not yet available for the R package.} \item{ref}{The training dataset that provides quantile information, needed when creating validation/test dataset with \code{\link[=xgb.QuantileDMatrix]{xgb.QuantileDMatrix()}}. Supplying the training DMatrix @@ -163,8 +143,8 @@ applied to the validation/test data} This is only supported when constructing a QuantileDMatrix.} } \value{ -An 'xgb.DMatrix' object. If calling 'xgb.QuantileDMatrix', it will have additional -subclass 'xgb.QuantileDMatrix'. +An 'xgb.DMatrix' object. If calling \code{xgb.QuantileDMatrix}, it will have additional +subclass \code{xgb.QuantileDMatrix}. } \description{ Construct an 'xgb.DMatrix' object from a given data source, which can then be passed to functions