Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ptr from mmap for GHistIndexMatrix and ColumnMatrix. #9315

Merged
merged 9 commits into from
Jun 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/c.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ DMatrix
.. doxygengroup:: DMatrix
:project: xgboost

.. _c_streaming:

Streaming
---------

Expand Down
3 changes: 3 additions & 0 deletions doc/tutorials/dask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ on a dask cluster:
y = da.random.random(size=(num_obs, 1), chunks=(1000, 1))

dtrain = xgb.dask.DaskDMatrix(client, X, y)
# or
# dtrain = xgb.dask.DaskQuantileDMatrix(client, X, y)
# `DaskQuantileDMatrix` is available for the `hist` and `gpu_hist` tree method.

output = xgb.dask.train(
client,
Expand Down
81 changes: 70 additions & 11 deletions doc/tutorials/external_memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,15 @@ GPU-based training algorithm. We will introduce them in the following sections.

The feature is still experimental as of 2.0. The performance is not well optimized.

The external memory support has gone through multiple iterations and is still under heavy
development. Like the :py:class:`~xgboost.QuantileDMatrix` with
:py:class:`~xgboost.DataIter`, XGBoost loads data batch-by-batch using a custom iterator
supplied by the user. However, unlike the :py:class:`~xgboost.QuantileDMatrix`, external
memory will not concatenate the batches unless GPU is used (it uses a hybrid approach,
more details follow). Instead, it will cache all batches on the external memory and fetch
them on-demand. Go to the end of the document to see a comparison between
`QuantileDMatrix` and external memory.

*************
Data Iterator
*************
Expand Down Expand Up @@ -113,10 +122,11 @@ External memory is supported by GPU algorithms (i.e. when ``tree_method`` is set
``gpu_hist``). However, the algorithm used for GPU is different from the one used for
CPU. When training on a CPU, the tree method iterates through all batches from external
memory for each step of the tree construction algorithm. On the other hand, the GPU
algorithm concatenates all batches into one and stores it in GPU memory. To reduce overall
memory usage, users can utilize subsampling. The good news is that the GPU hist tree
method supports gradient-based sampling, enabling users to set a low sampling rate without
compromising accuracy.
algorithm uses a hybrid approach. It iterates through the data during the beginning of
each iteration and concatenates all batches into one in GPU memory. To reduce overall
memory usage, users can utilize subsampling. The GPU hist tree method supports
`gradient-based sampling`, enabling users to set a low sampling rate without compromising
accuracy.

.. code-block:: python

Expand All @@ -134,6 +144,8 @@ see `this paper <https://arxiv.org/abs/2005.09148>`_.
When GPU is running out of memory during iteration on external memory, user might
recieve a segfault instead of an OOM exception.

.. _ext_remarks:

*******
Remarks
*******
Expand All @@ -142,17 +154,64 @@ When using external memory with XBGoost, data is divided into smaller chunks so
a fraction of it needs to be stored in memory at any given time. It's important to note
that this method only applies to the predictor data (``X``), while other data, like labels
and internal runtime structures are concatenated. This means that memory reduction is most
effective when dealing with wide datasets where ``X`` is larger compared to other data
like ``y``, while it has little impact on slim datasets.
effective when dealing with wide datasets where ``X`` is significantly larger in size
compared to other data like ``y``, while it has little impact on slim datasets.

As one might expect, fetching data on-demand puts significant pressure on the storage
device. Today's computing device can process way more data than a storage can read in a
single unit of time. The ratio is at order of magnitudes. An GPU is capable of processing
hundred of Gigabytes of floating-point data in a split second. On the other hand, a
four-lane NVMe storage connected to a PCIe-4 slot usually has about 6GB/s of data transfer
rate. As a result, the training is likely to be severely bounded by your storage
device. Before adopting the external memory solution, some back-of-envelop calculations
might help you see whether it's viable. For instance, if your NVMe drive can transfer 4GB
(a fairly practical number) of data per second and you have a 100GB of data in compressed
XGBoost cache (which corresponds to a dense float32 numpy array with the size of 200GB,
give or take). A tree with depth 8 needs at least 16 iterations through the data when the
parameter is right. You need about 14 minutes to train a single tree without accounting
for some other overheads and assume the computation overlaps with the IO. If your dataset
happens to have TB-level size, then you might need thousands of trees to get a generalized
model. These calculations can help you get an estimate on the expected training time.

However, sometimes we can ameliorate this limitation. One should also consider that the OS
(mostly talking about the Linux kernel) can usually cache the data on host memory. It only
evicts pages when new data comes in and there's no room left. In practice, at least some
portion of the data can persist on the host memory throughout the entire training
session. We are aware of this cache when optimizing the external memory fetcher. The
compressed cache is usually smaller than the raw input data, especially when the input is
dense without any missing value. If the host memory can fit a significant portion of this
compressed cache, then the performance should be decent after initialization. Our
development so far focus on two fronts of optimization for external memory:

- Avoid iterating through the data whenever appropriate.
- If the OS can cache the data, the performance should be close to in-core training.

Starting with XGBoost 2.0, the implementation of external memory uses ``mmap``. It is not
yet tested against system errors like disconnected network devices (`SIGBUS`). Also, it's
worth noting that most tests have been conducted on Linux distributions.
tested against system errors like disconnected network devices (`SIGBUS`). In the face of
a bus error, you will see a hard crash and need to clean up the cache files. If the
training session might take a long time and you are using solutions like NVMe-oF, we
recommend checkpointing your model periodically. Also, it's worth noting that most tests
have been conducted on Linux distributions.

Another important point to keep in mind is that creating the initial cache for XGBoost may
take some time. The interface to external memory is through custom iterators, which may or
may not be thread-safe. Therefore, initialization is performed sequentially.

Another important point to keep in mind is that creating the initial cache for XGBoost may
take some time. The interface to external memory is through custom iterators, which we can
not assume to be thread-safe. Therefore, initialization is performed sequentially. Using
the `xgboost.config_context` with `verbosity=2` can give you some information on what
XGBoost is doing during the wait if you don't mind the extra output.

*******************************
Compared to the QuantileDMatrix
*******************************

Passing an iterator to the :py:class:`~xgboost.QuantileDmatrix` enables direct
construction of `QuantileDmatrix` with data chunks. On the other hand, if it's passed to
:py:class:`~xgboost.DMatrix`, it instead enables the external memory feature. The
:py:class:`~xgboost.QuantileDmatrix` concatenates the data on memory after compression and
doesn't fetch data during training. On the other hand, the external memory `DMatrix`
fetches data batches from external memory on-demand. Use the `QuantileDMatrix` (with
iterator if necessary) when you can fit most of your data in memory. The training would be
an order of magnitute faster than using external memory.

****************
Text File Inputs
Expand Down
18 changes: 9 additions & 9 deletions doc/tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,22 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo

model
saving_model
learning_to_rank
dart
monotonic
feature_interaction_constraint
aft_survival_analysis
categorical
multioutput
rf
kubernetes
Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
Distributed XGBoost with XGBoost4J-Spark-GPU <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html>
dask
spark_estimator
ray
dart
monotonic
rf
feature_interaction_constraint
learning_to_rank
aft_survival_analysis
external_memory
c_api_tutorial
input_format
param_tuning
external_memory
custom_metric_obj
categorical
multioutput
43 changes: 43 additions & 0 deletions doc/tutorials/param_tuning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,46 @@ This can affect the training of XGBoost model, and there are two ways to improve

- In such a case, you cannot re-balance the dataset
- Set parameter ``max_delta_step`` to a finite number (say 1) to help convergence


*********************
Reducing Memory Usage
*********************

If you are using a HPO library like :py:class:`sklearn.model_selection.GridSearchCV`,
please control the number of threads it can use. It's best to let XGBoost to run in
parallel instead of asking `GridSearchCV` to run multiple experiments at the same
time. For instance, creating a fold of data for cross validation can consume a significant
amount of memory:

.. code-block:: python

# This creates a copy of dataset. X and X_train are both in memory at the same time.

# This happens for every thread at the same time if you run `GridSearchCV` with
# `n_jobs` larger than 1

X_train, X_test, y_train, y_test = train_test_split(X, y)

.. code-block:: python

df = pd.DataFrame()
# This creates a new copy of the dataframe, even if you specify the inplace parameter
new_df = df.drop(...)

.. code-block:: python

array = np.array(...)
# This may or may not make a copy of the data, depending on the type of the data
array.astype(np.float32)

.. code-block::

# np by default uses double, do you actually need it?
array = np.array(...)

You can find some more specific memory reduction practices scattered through the documents
For instances: :doc:`/tutorials/dask`, :doc:`/gpu/index`,
:doc:`/contrib/scaling`. However, before going into these, being conscious about making
data copies is a good starting point. It usually consumes a lot more memory than people
expect.
11 changes: 3 additions & 8 deletions rabit/include/rabit/internal/io.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@
#include "rabit/internal/utils.h"
#include "rabit/serializable.h"

namespace rabit {
namespace utils {
namespace rabit::utils {
/*! \brief re-use definition of dmlc::SeekStream */
using SeekStream = dmlc::SeekStream;
/**
Expand All @@ -31,9 +30,6 @@ struct MemoryFixSizeBuffer : public SeekStream {
// similar to SEEK_END in libc
static std::size_t constexpr kSeekEnd = std::numeric_limits<std::size_t>::max();

protected:
MemoryFixSizeBuffer() = default;

public:
/**
* @brief Ctor
Expand Down Expand Up @@ -68,7 +64,7 @@ struct MemoryFixSizeBuffer : public SeekStream {
* @brief Current position in the buffer (stream).
*/
std::size_t Tell() override { return curr_ptr_; }
virtual bool AtEnd() const { return curr_ptr_ == buffer_size_; }
[[nodiscard]] virtual bool AtEnd() const { return curr_ptr_ == buffer_size_; }

protected:
/*! \brief in memory buffer */
Expand Down Expand Up @@ -119,6 +115,5 @@ struct MemoryBufferStream : public SeekStream {
/*! \brief current pointer */
size_t curr_ptr_;
}; // class MemoryBufferStream
} // namespace utils
} // namespace rabit
} // namespace rabit::utils
#endif // RABIT_INTERNAL_IO_H_
79 changes: 68 additions & 11 deletions src/common/column_matrix.cc
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@
/*!
* Copyright 2017-2022 by XGBoost Contributors
/**
* Copyright 2017-2023, XGBoost Contributors
* \brief Utility for fast column-wise access
*/
#include "column_matrix.h"

namespace xgboost {
namespace common {
#include <algorithm> // for transform
#include <cstddef> // for size_t
#include <cstdint> // for uint64_t, uint8_t
#include <limits> // for numeric_limits
#include <type_traits> // for remove_reference_t
#include <vector> // for vector

#include "../data/gradient_index.h" // for GHistIndexMatrix
#include "io.h" // for AlignedResourceReadStream, AlignedFileWriteStream
#include "xgboost/base.h" // for bst_feaature_t
#include "xgboost/span.h" // for Span

namespace xgboost::common {
void ColumnMatrix::InitStorage(GHistIndexMatrix const& gmat, double sparse_threshold) {
auto const nfeature = gmat.Features();
const size_t nrow = gmat.Size();
// identify type of each column
type_.resize(nfeature);
type_ = common::MakeFixedVecWithMalloc(nfeature, ColumnType{});

uint32_t max_val = std::numeric_limits<uint32_t>::max();
for (bst_feature_t fid = 0; fid < nfeature; ++fid) {
Expand All @@ -34,7 +45,7 @@ void ColumnMatrix::InitStorage(GHistIndexMatrix const& gmat, double sparse_thres

// want to compute storage boundary for each feature
// using variants of prefix sum scan
feature_offsets_.resize(nfeature + 1);
feature_offsets_ = common::MakeFixedVecWithMalloc(nfeature + 1, std::size_t{0});
size_t accum_index = 0;
feature_offsets_[0] = accum_index;
for (bst_feature_t fid = 1; fid < nfeature + 1; ++fid) {
Expand All @@ -49,17 +60,63 @@ void ColumnMatrix::InitStorage(GHistIndexMatrix const& gmat, double sparse_thres
SetTypeSize(gmat.MaxNumBinPerFeat());
auto storage_size =
feature_offsets_.back() * static_cast<std::underlying_type_t<BinTypeSize>>(bins_type_size_);
index_.resize(storage_size, 0);

index_ = common::MakeFixedVecWithMalloc(storage_size, std::uint8_t{0});

if (!all_dense_column) {
row_ind_.resize(feature_offsets_[nfeature]);
row_ind_ = common::MakeFixedVecWithMalloc(feature_offsets_[nfeature], std::size_t{0});
}

// store least bin id for each feature
index_base_ = const_cast<uint32_t*>(gmat.cut.Ptrs().data());

any_missing_ = !gmat.IsDense();

missing_flags_.clear();
missing_ = MissingIndicator{0, false};
}

// IO procedures for external memory.
bool ColumnMatrix::Read(AlignedResourceReadStream* fi, uint32_t const* index_base) {
if (!common::ReadVec(fi, &index_)) {
return false;
}
if (!common::ReadVec(fi, &type_)) {
return false;
}
if (!common::ReadVec(fi, &row_ind_)) {
return false;
}
if (!common::ReadVec(fi, &feature_offsets_)) {
return false;
}

if (!common::ReadVec(fi, &missing_.storage)) {
return false;
}
missing_.InitView();

index_base_ = index_base;
if (!fi->Read(&bins_type_size_)) {
return false;
}
if (!fi->Read(&any_missing_)) {
return false;
}
return true;
}

std::size_t ColumnMatrix::Write(AlignedFileWriteStream* fo) const {
std::size_t bytes{0};

bytes += common::WriteVec(fo, index_);
bytes += common::WriteVec(fo, type_);
bytes += common::WriteVec(fo, row_ind_);
bytes += common::WriteVec(fo, feature_offsets_);
bytes += common::WriteVec(fo, missing_.storage);

bytes += fo->Write(bins_type_size_);
bytes += fo->Write(any_missing_);

return bytes;
}
} // namespace common
} // namespace xgboost
} // namespace xgboost::common
Loading