Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data split mode to DMatrix MetaInfo #8568

Merged
merged 27 commits into from
Dec 25, 2022
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
40252ee
Add data split mode to DMatrix MetaInfo
rongou Dec 7, 2022
7c35c40
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 8, 2022
26ed1a9
remove dsplit training param
rongou Dec 8, 2022
d3fda24
fix dmatrix validation
rongou Dec 8, 2022
8e797f7
fix python
rongou Dec 8, 2022
e12f361
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 12, 2022
8f7ac3e
fix dsplit for local mode
rongou Dec 12, 2022
fa7a670
fix java bulid
rongou Dec 12, 2022
afc5fa0
fix R package
rongou Dec 12, 2022
31b7112
fix demo
rongou Dec 12, 2022
32d7fcc
fix line too long
rongou Dec 12, 2022
c857cd9
fix r doc
rongou Dec 12, 2022
aa0c26c
update roxgen
rongou Dec 12, 2022
cbd1a42
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 13, 2022
d7830cb
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 15, 2022
c9ee1d6
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 15, 2022
6782dd9
add XGDMatrixCreateFromFileV2
rongou Dec 15, 2022
86226e0
add a test for v2
rongou Dec 16, 2022
914df2a
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 16, 2022
bde1e4c
add need_split to json config
rongou Dec 16, 2022
55f8aa4
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 19, 2022
9002705
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 20, 2022
c80a3ae
change to uri
rongou Dec 20, 2022
58ae574
remove need_split as a parameter
rongou Dec 20, 2022
f6148a3
fix python
rongou Dec 20, 2022
da7d545
fix dask test
rongou Dec 20, 2022
417dc18
Merge remote-tracking branch 'upstream/master' into data-split-param
rongou Dec 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion doc/tutorials/saving_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,6 @@ Will print out something similar to (not actual output as it's too long for demo
"learner_train_param": {
"booster": "gbtree",
"disable_default_eval_metric": "0",
"dsplit": "auto",
"objective": "reg:squarederror"
},
"metrics": [],
Expand Down
17 changes: 17 additions & 0 deletions include/xgboost/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -126,12 +126,29 @@ XGB_DLL int XGBGetGlobalConfig(char const **out_config);

/*!
* \brief load a data matrix
* \deprecated since 1.7.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the next big one would be 2.0 unless something major comes up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* \param fname the name of the file
* \param silent whether print messages during loading
* \param out a loaded data matrix
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGDMatrixCreateFromFile(const char *fname, int silent, DMatrixHandle *out);

/*!
* \brief load a data matrix
* \param config JSON encoded parameters for DMatrix construction. Accepted fields are:
* - filename: The name of the file.
* - silent (optional): Whether to print message during loading. Default to true.
* - need_split (optional): Whether to split the file. Default to true in distributed mode, false
* otherwise.
* - data_split_mode (optional): Whether to split by row or column. If need_split is true, the
* file is split accordingly; if false, this is only an indicator on how the file was split
* beforehand. Default to row.
* \param out a loaded data matrix
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGDMatrixCreateFromFileV2(char const *config, DMatrixHandle *out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered making it a proper API for URI? For instance FromURI? Since users actually need to use some of the URI formats like file.txt?format=csv. XGBoost/dmlc-core doesn't guess the format and is a source of bugs when users pass in only the file name.

Also, do you plan to introduce the need_split with other input sources as well? If not, we can make it an URI parameter and limit its use to this function. Otherwise, we need to have an additional parameter in all language bindings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to URI.

For need_split, we can get rid of it if we are just going to preserve the current behavior, which is to split for distributed training, no otherwise. If a user wants more flexibility, they can always use another api to load the data. For distributed training, most people are probably not using the file api anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it as a parameter.


/**
* @example c-api-demo.c
*/
Expand Down
16 changes: 9 additions & 7 deletions include/xgboost/data.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,7 @@ enum class DataType : uint8_t {

enum class FeatureType : uint8_t { kNumerical = 0, kCategorical = 1 };

enum class DataSplitMode : int {
kAuto = 0, kCol = 1, kRow = 2, kNone = 3
};
enum class DataSplitMode : int { kRow = 0, kCol = 1 };

/*!
* \brief Meta information about dataset, always sit in memory.
Expand All @@ -60,6 +58,8 @@ class MetaInfo {
uint64_t num_nonzero_{0}; // NOLINT
/*! \brief label of each instance */
linalg::Tensor<float, 2> labels;
/*! \brief data split mode */
DataSplitMode data_split_mode{DataSplitMode::kRow};
/*!
* \brief the index of begin and end of a group
* needed when the learning task is ranking.
Expand Down Expand Up @@ -544,16 +544,18 @@ class DMatrix {
* \brief Load DMatrix from URI.
* \param uri The URI of input.
* \param silent Whether print information during loading.
* \param data_split_mode Mode to read in part of the data, divided among the workers in distributed mode.
* \param data_split_mode In distributed mode, split the input according this mode; otherwise,
* it's just an indicator on how the input was split beforehand.
* \param file_format The format type of the file, used for dmlc::Parser::Create.
* By default "auto" will be able to load in both local binary file.
* \param page_size Page size for external memory.
* \return The created DMatrix.
*/
static DMatrix* Load(const std::string& uri,
bool silent,
DataSplitMode data_split_mode,
const std::string& file_format = "auto");
bool silent = true,
DataSplitMode data_split_mode = DataSplitMode::kRow,
const std::string& file_format = "auto",
bool need_split = false);

/**
* \brief Creates a new DMatrix from an external data adapter.
Expand Down
10 changes: 10 additions & 0 deletions python-package/xgboost/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import warnings
from abc import ABC, abstractmethod
from collections.abc import Mapping
from enum import IntEnum, unique
from functools import wraps
from inspect import Parameter, signature
from typing import (
Expand Down Expand Up @@ -608,6 +609,13 @@ def inner_f(*args: Any, **kwargs: Any) -> _T:
_deprecate_positional_args = require_keyword_args(False)


@unique
class DataSplitMode(IntEnum):
"""Supported data split mode for DMatrix."""
ROW = 0
COL = 1


class DMatrix: # pylint: disable=too-many-instance-attributes,too-many-public-methods
"""Data Matrix used in XGBoost.

Expand Down Expand Up @@ -635,6 +643,7 @@ def __init__(
label_upper_bound: Optional[ArrayLike] = None,
feature_weights: Optional[ArrayLike] = None,
enable_categorical: bool = False,
data_split_mode: DataSplitMode = DataSplitMode.ROW,
) -> None:
"""Parameters
----------
Expand Down Expand Up @@ -728,6 +737,7 @@ def __init__(
feature_names=feature_names,
feature_types=feature_types,
enable_categorical=enable_categorical,
data_split_mode=data_split_mode,
)
assert handle is not None
self.handle = handle
Expand Down
15 changes: 11 additions & 4 deletions python-package/xgboost/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from .core import (
_LIB,
DataIter,
DataSplitMode,
DMatrix,
_check_call,
_cuda_array_interface,
Expand Down Expand Up @@ -865,13 +866,18 @@ def _from_uri(
missing: Optional[FloatCompatible],
feature_names: Optional[FeatureNames],
feature_types: Optional[FeatureTypes],
data_split_mode: DataSplitMode = DataSplitMode.ROW,
) -> DispatchedDataBackendReturnType:
_warn_unused_missing(data, missing)
handle = ctypes.c_void_p()
data = os.fspath(os.path.expanduser(data))
_check_call(_LIB.XGDMatrixCreateFromFile(c_str(data),
ctypes.c_int(1),
ctypes.byref(handle)))
args = {
"filename": str(data),
"data_split_mode": int(data_split_mode),
}
config = bytes(json.dumps(args), "utf-8")
_check_call(_LIB.XGDMatrixCreateFromFileV2(config,
ctypes.byref(handle)))
return handle, feature_names, feature_types


Expand Down Expand Up @@ -938,6 +944,7 @@ def dispatch_data_backend(
feature_names: Optional[FeatureNames],
feature_types: Optional[FeatureTypes],
enable_categorical: bool = False,
data_split_mode: DataSplitMode = DataSplitMode.ROW,
) -> DispatchedDataBackendReturnType:
'''Dispatch data for DMatrix.'''
if not _is_cudf_ser(data) and not _is_pandas_series(data):
Expand All @@ -953,7 +960,7 @@ def dispatch_data_backend(
if _is_numpy_array(data):
return _from_numpy_array(data, missing, threads, feature_names, feature_types)
if _is_uri(data):
return _from_uri(data, missing, feature_names, feature_types)
return _from_uri(data, missing, feature_names, feature_types, data_split_mode)
if _is_list(data):
return _from_list(data, missing, threads, feature_names, feature_types)
if _is_tuple(data):
Expand Down
33 changes: 24 additions & 9 deletions src/c_api/c_api.cc
Original file line number Diff line number Diff line change
Expand Up @@ -206,17 +206,32 @@ XGB_DLL int XGBGetGlobalConfig(const char** json_str) {
}

XGB_DLL int XGDMatrixCreateFromFile(const char *fname, int silent, DMatrixHandle *out) {
API_BEGIN();
auto data_split_mode = DataSplitMode::kNone;
if (collective::IsFederated()) {
LOG(CONSOLE) << "XGBoost federated mode detected, not splitting data among workers";
} else if (collective::IsDistributed()) {
LOG(CONSOLE) << "XGBoost distributed mode detected, will split data among workers";
data_split_mode = DataSplitMode::kRow;
}
xgboost_CHECK_C_ARG_PTR(fname);
xgboost_CHECK_C_ARG_PTR(out);
*out = new std::shared_ptr<DMatrix>(DMatrix::Load(fname, silent != 0, data_split_mode));

Json config{Object()};
config["filename"] = std::string{fname};
config["silent"] = silent;
std::string config_str;
Json::Dump(config, &config_str);
return XGDMatrixCreateFromFileV2(config_str.c_str(), out);
}

XGB_DLL int XGDMatrixCreateFromFileV2(const char *config, DMatrixHandle *out) {
API_BEGIN();
xgboost_CHECK_C_ARG_PTR(config);
xgboost_CHECK_C_ARG_PTR(out);

auto jconfig = Json::Load(StringView{config});
std::string filename = RequiredArg<String>(jconfig, "filename", __func__);
auto silent = static_cast<bool>(OptionalArg<Integer, int64_t>(jconfig, "silent", 1));
auto need_split = static_cast<bool>(OptionalArg<Integer, int64_t>(
jconfig, "need_split", (collective::IsDistributed() && !collective::IsFederated()) ? 1 : 0));
auto data_split_mode =
static_cast<DataSplitMode>(OptionalArg<Integer, int64_t>(jconfig, "data_split_mode", 0));

*out = new std::shared_ptr<DMatrix>(
DMatrix::Load(filename, silent, data_split_mode, "auto", need_split));
API_END();
}

Expand Down
13 changes: 1 addition & 12 deletions src/cli_main.cc
Original file line number Diff line number Diff line change
Expand Up @@ -112,10 +112,8 @@ struct CLIParam : public XGBoostParameter<CLIParam> {
DMLC_DECLARE_FIELD(name_pred).set_default("pred.txt")
.describe("Name of the prediction file.");
DMLC_DECLARE_FIELD(dsplit).set_default(0)
.add_enum("auto", 0)
.add_enum("row", 0)
.add_enum("col", 1)
.add_enum("row", 2)
.add_enum("none", 3)
.describe("Data split mode.");
DMLC_DECLARE_FIELD(ntree_limit).set_default(0).set_lower_bound(0)
.describe("(Deprecated) Use iteration_begin/iteration_end instead.");
Expand Down Expand Up @@ -158,15 +156,6 @@ struct CLIParam : public XGBoostParameter<CLIParam> {
if (name_pred == "stdout") {
save_period = 0;
}
if (dsplit == static_cast<int>(DataSplitMode::kAuto)) {
if (collective::IsFederated()) {
dsplit = static_cast<int>(DataSplitMode::kNone);
} else if (collective::IsDistributed()) {
dsplit = static_cast<int>(DataSplitMode::kRow);
} else {
dsplit = static_cast<int>(DataSplitMode::kNone);
}
}
}
};

Expand Down
19 changes: 11 additions & 8 deletions src/data/data.cc
Original file line number Diff line number Diff line change
Expand Up @@ -782,19 +782,21 @@ DMatrix *TryLoadBinary(std::string fname, bool silent) {
}

DMatrix* DMatrix::Load(const std::string& uri, bool silent, DataSplitMode data_split_mode,
const std::string& file_format) {
CHECK(data_split_mode == DataSplitMode::kRow ||
data_split_mode == DataSplitMode::kCol ||
data_split_mode == DataSplitMode::kNone)
<< "Precondition violated; data split mode can only be 'row', 'col', or 'none'";
const std::string& file_format, bool need_split) {
if (need_split) {
LOG(CONSOLE) << "Splitting data among workers";
} else {
LOG(CONSOLE) << "Not splitting data among workers";
}

std::string fname, cache_file;
size_t dlm_pos = uri.find('#');
if (dlm_pos != std::string::npos) {
cache_file = uri.substr(dlm_pos + 1, uri.length());
fname = uri.substr(0, dlm_pos);
CHECK_EQ(cache_file.find('#'), std::string::npos)
<< "Only one `#` is allowed in file path for cache file specification.";
if (data_split_mode == DataSplitMode::kRow) {
if (need_split && data_split_mode == DataSplitMode::kRow) {
std::ostringstream os;
std::vector<std::string> cache_shards = common::Split(cache_file, ':');
for (size_t i = 0; i < cache_shards.size(); ++i) {
Expand Down Expand Up @@ -828,7 +830,7 @@ DMatrix* DMatrix::Load(const std::string& uri, bool silent, DataSplitMode data_s
}

int partid = 0, npart = 1;
if (data_split_mode == DataSplitMode::kRow) {
if (need_split && data_split_mode == DataSplitMode::kRow) {
partid = collective::GetRank();
npart = collective::GetWorldSize();
} else {
Expand Down Expand Up @@ -887,7 +889,7 @@ DMatrix* DMatrix::Load(const std::string& uri, bool silent, DataSplitMode data_s
* since partitioned data not knowing the real number of features. */
collective::Allreduce<collective::Operation::kMax>(&dmat->Info().num_col_, 1);

if (data_split_mode == DataSplitMode::kCol) {
if (need_split && data_split_mode == DataSplitMode::kCol) {
if (!cache_file.empty()) {
LOG(FATAL) << "Column-wise data split is not support for external memory.";
}
Expand All @@ -898,6 +900,7 @@ DMatrix* DMatrix::Load(const std::string& uri, bool silent, DataSplitMode data_s
delete dmat;
return sliced;
} else {
dmat->Info().data_split_mode = data_split_mode;
return dmat;
}
}
Expand Down
1 change: 1 addition & 0 deletions src/data/simple_dmatrix.cc
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ DMatrix* SimpleDMatrix::SliceCol(std::size_t start, std::size_t size) {
out->Info() = this->Info().Copy();
out->Info().num_nonzero_ = h_offset.back();
}
out->Info().data_split_mode = DataSplitMode::kCol;
return out;
}

Expand Down
Loading