diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 6f75b56ce848..a406a4358478 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -76,3 +76,5 @@ List of Contributors * [Andy Adinets](https://github.com/canonizer) * [Henry Gouk](https://github.com/henrygouk) * [Pierre de Sahb](https://github.com/pdesahb) +* [liuliang01](https://github.com/liuliang01) + - liuliang01 added support for the qid column for LibSVM input format. This makes ranking task easier in distributed setting. diff --git a/dmlc-core b/dmlc-core index 1c422a99786b..459ab734d15a 160000 --- a/dmlc-core +++ b/dmlc-core @@ -1 +1 @@ -Subproject commit 1c422a99786ba677db953a565807973154f6d374 +Subproject commit 459ab734d15acd68fd437abf845c7c1730b5a38f diff --git a/doc/get_started/index.md b/doc/get_started/index.md index 13d843ad6d61..b3bd52310ec9 100644 --- a/doc/get_started/index.md +++ b/doc/get_started/index.md @@ -5,6 +5,7 @@ on the demo dataset on a binary classification task. ## Links to Helpful Other Resources - See [Installation Guide](../build.md) on how to install xgboost. +- See [Text Input Format](../input_format.md) on using text format for specifying training/testing data. - See [How to pages](../how_to/index.md) on various tips on using xgboost. - See [Tutorials](../tutorials/index.md) on tutorials on specific tasks. - See [Learning to use XGBoost by Examples](../../demo) for more code examples. diff --git a/doc/how_to/index.md b/doc/how_to/index.md index c0e5f72b1059..347a98582cc7 100644 --- a/doc/how_to/index.md +++ b/doc/how_to/index.md @@ -6,6 +6,7 @@ This page contains guidelines to use and develop XGBoost. - [How to Install XGBoost](../build.md) ## Use XGBoost in Specific Ways +- [Text input format](../input_format.md) - [Parameter tuning guide](param_tuning.md) - [Use out of core computation for large dataset](external_memory.md) - [Use XGBoost GPU algorithms](../gpu/index.md) diff --git a/doc/input_format.md b/doc/input_format.md index fcefc5eae815..64fa6cbf8fc7 100644 --- a/doc/input_format.md +++ b/doc/input_format.md @@ -2,7 +2,9 @@ Text Input Format of DMatrix ============================ ## Basic Input Format -As we have mentioned, XGBoost takes LibSVM format. For training or predicting, XGBoost takes an instance file with the format as below: +XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See [here](https://en.wikipedia.org/wiki/Comma-separated_values) for a description of the CSV format.) + +For training or predicting, XGBoost takes an instance file with the format as below: train.txt ``` @@ -14,13 +16,12 @@ train.txt ``` Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive. -Additional Information ----------------------- -Note: these additional information are only applicable to single machine version of the package. +Auxiliary Files for Additional Information +------------------------------------------ +**Note: all information below is applicable only to single-node version of the package.** If you'd like to perform distributed training with multiple nodes, skip to the next section. ### Group Input Format -As XGBoost supports accomplishing [ranking task](../demo/rank), we support the group input format. In ranking task, instances are categorized into different groups in real world scenarios, for example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. Except the instance file mentioned in the group input format, XGBoost need an file indicating the group information. For example, if the instance file is the "train.txt" shown above, -and the group file is as below: +For [ranking task](../demo/rank), XGBoost supports the group input format. In ranking task, instances are categorized into *query groups* in real world scenarios. For example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. XGBoost requires an file that indicates the group information. For example, if the instance file is the "train.txt" shown above, the group file should be named "train.txt.group" and be of the following format: train.txt.group ``` @@ -28,10 +29,10 @@ train.txt.group 3 ``` This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order. -While configuration, you do not have to indicate the path of the group file. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.group" in the same directory and decides whether to read the data as group input format. +At the time of configuration, you do not have to indicate the path of the group file. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.group" in the same directory. ### Instance Weight File -XGBoost supports providing each instance an weight to differentiate the importance of instances. For example, if we provide an instance weight file for the "train.txt" file in the example as below: +Instances in the training data may be assigned weights to differentiate relative importance among them. For example, if we provide an instance weight file for the "train.txt" file in the example as below: train.txt.weight ``` @@ -41,10 +42,12 @@ train.txt.weight 1 0.5 ``` -It means that XGBoost will emphasize more on the first and fourth instanceļ¼Œ that is to say positive instances while training. -The configuration is similar to configuring the group information. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.weight" in the same directory and if there is, will use the weights while training models. Weights will be included into an "xxx.buffer" file that is created by XGBoost automatically. If you want to update the weights, you need to delete the "xxx.buffer" file prior to launching XGBoost. +It means that XGBoost will emphasize more on the first and fourth instance (i.e. the positive instances) while training. +The configuration is similar to configuring the group information. If the instance file name is "xxx", XGBoost will look for a file named "xxx.weight" in the same directory. If the file exists, the instance weights will be extracted and used at the time of training. + +NOTE. If you choose to save the training data as a binary buffer (using [save_binary()](http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix.save_binary)), keep in mind that the resulting binary buffer file will include the instance weights. To update the weights, use [the set_weight() function](http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix.set_weight). -### Initial Margin file +### Initial Margin File XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for "train.txt" file, we can create the following file: train.txt.base_margin @@ -54,3 +57,37 @@ train.txt.base_margin 3.4 ``` XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin values. + +Embedding additional information inside LibSVM file +--------------------------------------------------- +**This section is applicable to both single- and multiple-node settings.** + +### Query ID Columns +This is most useful for [ranking task](../demo/rank), where the instances are grouped into query groups. You may embed query group ID for each instance in the LibSVM file by adding a token of form `qid:xx` in each row: + +train.txt +``` +1 qid:1 101:1.2 102:0.03 +0 qid:1 1:2.1 10001:300 10002:400 +0 qid:2 0:1.3 1:0.3 +1 qid:2 0:0.01 1:0.3 +0 qid:3 0:0.2 1:0.3 +1 qid:3 3:-0.1 10:-0.3 +0 qid:3 6:0.2 10:0.15 +``` +Keep in mind the following restrictions: +* It is not allowed to specify query ID's for some instances but not for others. Either every row is assigned query ID's or none at all. +* The rows have to be sorted in ascending order by the query IDs. So, for instance, you may not have one row having large query ID than any of the following rows. + +### Instance weights +You may specify instance weights in the LibSVM file by appending each instance label with the corresponding weight in the form of `[label]:[weight]`, as shown by the following example: + +train.txt +``` +1:1.0 101:1.2 102:0.03 +0:0.5 1:2.1 10001:300 10002:400 +0:0.5 0:1.3 1:0.3 +1:1.0 0:0.01 1:0.3 +0:0.5 0:0.2 1:0.3 +``` +where the negative instances are assigned half weights compared to the positive instances. diff --git a/include/xgboost/data.h b/include/xgboost/data.h index aabdbc531714..467572fa70b0 100644 --- a/include/xgboost/data.h +++ b/include/xgboost/data.h @@ -53,6 +53,8 @@ class MetaInfo { std::vector group_ptr_; /*! \brief weights of each instance, optional */ std::vector weights_; + /*! \brief session-id of each instance, optional */ + std::vector qids_; /*! * \brief initialized margins, * if specified, xgboost will start from this init margin @@ -60,7 +62,9 @@ class MetaInfo { */ std::vector base_margin_; /*! \brief version flag, used to check version of this info */ - static const int kVersion = 1; + static const int kVersion = 2; + /*! \brief version that introduced qid field */ + static const int kVersionQidAdded = 2; /*! \brief default constructor */ MetaInfo() = default; /*! @@ -136,6 +140,9 @@ struct Entry { inline static bool CmpValue(const Entry& a, const Entry& b) { return a.fvalue < b.fvalue; } + inline bool operator==(const Entry& other) const { + return (this->index == other.index && this->fvalue == other.fvalue); + } }; /*! diff --git a/src/c_api/c_api.cc b/src/c_api/c_api.cc index 569fff016abe..865af8930a5c 100644 --- a/src/c_api/c_api.cc +++ b/src/c_api/c_api.cc @@ -141,6 +141,8 @@ class NativeDataIter : public dmlc::Parser { block_.offset = dmlc::BeginPtr(offset_); block_.label = dmlc::BeginPtr(label_); block_.weight = dmlc::BeginPtr(weight_); + block_.qid = nullptr; + block_.field = nullptr; block_.index = dmlc::BeginPtr(index_); block_.value = dmlc::BeginPtr(value_); bytes_read_ += offset_.size() * sizeof(size_t) + diff --git a/src/data/data.cc b/src/data/data.cc index 137fd1f06fb3..a8bf41c46655 100644 --- a/src/data/data.cc +++ b/src/data/data.cc @@ -28,6 +28,7 @@ void MetaInfo::Clear() { labels_.clear(); root_index_.clear(); group_ptr_.clear(); + qids_.clear(); weights_.clear(); base_margin_.clear(); } @@ -40,6 +41,7 @@ void MetaInfo::SaveBinary(dmlc::Stream *fo) const { fo->Write(&num_nonzero_, sizeof(num_nonzero_)); fo->Write(labels_); fo->Write(group_ptr_); + fo->Write(qids_); fo->Write(weights_); fo->Write(root_index_); fo->Write(base_margin_); @@ -48,13 +50,18 @@ void MetaInfo::SaveBinary(dmlc::Stream *fo) const { void MetaInfo::LoadBinary(dmlc::Stream *fi) { int version; CHECK(fi->Read(&version, sizeof(version)) == sizeof(version)) << "MetaInfo: invalid version"; - CHECK_EQ(version, kVersion) << "MetaInfo: invalid format"; + CHECK(version >= 1 && version <= kVersion) << "MetaInfo: unsupported file version"; CHECK(fi->Read(&num_row_, sizeof(num_row_)) == sizeof(num_row_)) << "MetaInfo: invalid format"; CHECK(fi->Read(&num_col_, sizeof(num_col_)) == sizeof(num_col_)) << "MetaInfo: invalid format"; CHECK(fi->Read(&num_nonzero_, sizeof(num_nonzero_)) == sizeof(num_nonzero_)) << "MetaInfo: invalid format"; CHECK(fi->Read(&labels_)) << "MetaInfo: invalid format"; CHECK(fi->Read(&group_ptr_)) << "MetaInfo: invalid format"; + if (version >= kVersionQidAdded) { + CHECK(fi->Read(&qids_)) << "MetaInfo: invalid format"; + } else { // old format doesn't contain qid field + qids_.clear(); + } CHECK(fi->Read(&weights_)) << "MetaInfo: invalid format"; CHECK(fi->Read(&root_index_)) << "MetaInfo: invalid format"; CHECK(fi->Read(&base_margin_)) << "MetaInfo: invalid format"; diff --git a/src/data/simple_csr_source.cc b/src/data/simple_csr_source.cc index 05a3f6aa5135..816404fad676 100644 --- a/src/data/simple_csr_source.cc +++ b/src/data/simple_csr_source.cc @@ -4,6 +4,7 @@ */ #include #include +#include #include "./simple_csr_source.h" namespace xgboost { @@ -26,6 +27,10 @@ void SimpleCSRSource::CopyFrom(DMatrix* src) { } void SimpleCSRSource::CopyFrom(dmlc::Parser* parser) { + // use qid to get group info + const uint64_t default_max = std::numeric_limits::max(); + uint64_t last_group_id = default_max; + bst_uint group_size = 0; this->Clear(); while (parser->Next()) { const dmlc::RowBlock& batch = parser->Value(); @@ -35,6 +40,19 @@ void SimpleCSRSource::CopyFrom(dmlc::Parser* parser) { if (batch.weight != nullptr) { info.weights_.insert(info.weights_.end(), batch.weight, batch.weight + batch.size); } + if (batch.qid != nullptr) { + info.qids_.insert(info.qids_.end(), batch.qid, batch.qid + batch.size); + // get group + for (size_t i = 0; i < batch.size; ++i) { + const uint64_t cur_group_id = batch.qid[i]; + if (last_group_id == default_max || last_group_id != cur_group_id) { + info.group_ptr_.push_back(group_size); + } + last_group_id = cur_group_id; + ++group_size; + } + } + // Remove the assertion on batch.index, which can be null in the case that the data in this // batch is entirely sparse. Although it's true that this indicates a likely issue with the // user's data workflows, passing XGBoost entirely sparse data should not cause it to fail. @@ -56,7 +74,14 @@ void SimpleCSRSource::CopyFrom(dmlc::Parser* parser) { page_.offset.push_back(page_.offset[top - 1] + batch.offset[i + 1] - batch.offset[0]); } } + if (last_group_id != default_max) { + if (group_size > info.group_ptr_.back()) { + info.group_ptr_.push_back(group_size); + } + } this->info.num_nonzero_ = static_cast(page_.data.size()); + // Either every row has query ID or none at all + CHECK(info.qids_.empty() || info.qids_.size() == info.num_row_); } void SimpleCSRSource::LoadBinary(dmlc::Stream* fi) { diff --git a/src/data/sparse_page_source.cc b/src/data/sparse_page_source.cc index 7541ac9411f1..ddac4a9415f0 100644 --- a/src/data/sparse_page_source.cc +++ b/src/data/sparse_page_source.cc @@ -122,6 +122,10 @@ void SparsePageSource::Create(dmlc::Parser* src, constexpr double kStep = 4.0; size_t tick_expected = static_cast(kStep); + const uint64_t default_max = std::numeric_limits::max(); + uint64_t last_group_id = default_max; + bst_uint group_size = 0; + while (src->Next()) { const dmlc::RowBlock& batch = src->Value(); if (batch.label != nullptr) { @@ -130,6 +134,18 @@ void SparsePageSource::Create(dmlc::Parser* src, if (batch.weight != nullptr) { info.weights_.insert(info.weights_.end(), batch.weight, batch.weight + batch.size); } + if (batch.qid != nullptr) { + info.qids_.insert(info.qids_.end(), batch.qid, batch.qid + batch.size); + // get group + for (size_t i = 0; i < batch.size; ++i) { + const uint64_t cur_group_id = batch.qid[i]; + if (last_group_id == default_max || last_group_id != cur_group_id) { + info.group_ptr_.push_back(group_size); + } + last_group_id = cur_group_id; + ++group_size; + } + } info.num_row_ += batch.size; info.num_nonzero_ += batch.offset[batch.size] - batch.offset[0]; for (size_t i = batch.offset[0]; i < batch.offset[batch.size]; ++i) { @@ -153,6 +169,11 @@ void SparsePageSource::Create(dmlc::Parser* src, } } } + if (last_group_id != default_max) { + if (group_size > info.group_ptr_.back()) { + info.group_ptr_.push_back(group_size); + } + } if (page->data.size() != 0) { writer.PushWrite(std::move(page)); @@ -162,6 +183,8 @@ void SparsePageSource::Create(dmlc::Parser* src, dmlc::Stream::Create(name_info.c_str(), "w")); int tmagic = kMagic; fo->Write(&tmagic, sizeof(tmagic)); + // Either every row has query ID or none at all + CHECK(info.qids_.empty() || info.qids_.size() == info.num_row_); info.SaveBinary(fo.get()); } LOG(CONSOLE) << "SparsePageSource: Finished writing to " << name_info; diff --git a/tests/cpp/data/test_metainfo.cc b/tests/cpp/data/test_metainfo.cc index f2d5d2ae4aef..41bc7b9793d2 100644 --- a/tests/cpp/data/test_metainfo.cc +++ b/tests/cpp/data/test_metainfo.cc @@ -1,5 +1,9 @@ // Copyright by Contributors +#include #include +#include +#include +#include "../../../src/data/simple_csr_source.h" #include "../helpers.h" @@ -49,7 +53,7 @@ TEST(MetaInfo, SaveLoadBinary) { info.SaveBinary(fs); delete fs; - ASSERT_EQ(GetFileSize(tmp_file), 76) + ASSERT_EQ(GetFileSize(tmp_file), 84) << "Expected saved binary file size to be same as object size"; fs = dmlc::Stream::Create(tmp_file.c_str(), "r"); @@ -61,3 +65,61 @@ TEST(MetaInfo, SaveLoadBinary) { std::remove(tmp_file.c_str()); } + +TEST(MetaInfo, LoadQid) { + std::string tmp_file = TempFileName(); + { + std::unique_ptr fs( + dmlc::Stream::Create(tmp_file.c_str(), "w")); + dmlc::ostream os(fs.get()); + os << R"qid(3 qid:1 1:1 2:1 3:0 4:0.2 5:0 + 2 qid:1 1:0 2:0 3:1 4:0.1 5:1 + 1 qid:1 1:0 2:1 3:0 4:0.4 5:0 + 1 qid:1 1:0 2:0 3:1 4:0.3 5:0 + 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 + 2 qid:2 1:1 2:0 3:1 4:0.4 5:0 + 1 qid:2 1:0 2:0 3:1 4:0.1 5:0 + 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 + 2 qid:3 1:0 2:0 3:1 4:0.1 5:1 + 3 qid:3 1:1 2:1 3:0 4:0.3 5:0 + 4 qid:3 1:1 2:0 3:0 4:0.4 5:1 + 1 qid:3 1:0 2:1 3:1 4:0.5 5:0)qid"; + os.set_stream(nullptr); + } + std::unique_ptr dmat( + xgboost::DMatrix::Load(tmp_file, true, false, "libsvm")); + std::remove(tmp_file.c_str()); + + const xgboost::MetaInfo& info = dmat->Info(); + const std::vector expected_qids{1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3}; + const std::vector expected_group_ptr{0, 4, 8, 12}; + CHECK(info.qids_ == expected_qids); + CHECK(info.group_ptr_ == expected_group_ptr); + CHECK_GE(info.kVersion, info.kVersionQidAdded); + + const std::vector expected_offset{ + 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60 + }; + const std::vector expected_data{ + {1, 1}, {2, 1}, {3, 0}, {4, 0.2}, {5, 0}, + {1, 0}, {2, 0}, {3, 1}, {4, 0.1}, {5, 1}, + {1, 0}, {2, 1}, {3, 0}, {4, 0.4}, {5, 0}, + {1, 0}, {2, 0}, {3, 1}, {4, 0.3}, {5, 0}, + {1, 0}, {2, 0}, {3, 1}, {4, 0.2}, {5, 0}, + {1, 1}, {2, 0}, {3, 1}, {4, 0.4}, {5, 0}, + {1, 0}, {2, 0}, {3, 1}, {4, 0.1}, {5, 0}, + {1, 0}, {2, 0}, {3, 1}, {4, 0.2}, {5, 0}, + {1, 0}, {2, 0}, {3, 1}, {4, 0.1}, {5, 1}, + {1, 1}, {2, 1}, {3, 0}, {4, 0.3}, {5, 0}, + {1, 1}, {2, 0}, {3, 0}, {4, 0.4}, {5, 1}, + {1, 0}, {2, 1}, {3, 1}, {4, 0.5}, {5, 0} + }; + dmlc::DataIter* iter = dmat->RowIterator(); + iter->BeforeFirst(); + CHECK(iter->Next()); + const xgboost::SparsePage& batch = iter->Value(); + CHECK_EQ(batch.base_rowid, 0); + CHECK(batch.offset == expected_offset); + CHECK(batch.data == expected_data); + CHECK(!iter->Next()); +}