Skip to content

Commit

Permalink
Rewrite approx.
Browse files Browse the repository at this point in the history
Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Regenerate the index.

ama.

Clang tidy.

Retain page.

Fix.

Lint.

Tidy.

Integer backed enum.

Convert to uint32_t.

Prototype for saving gidx.

Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Initial port.

Pass in hessian.

Init column sampler.

Unused code.

Use ctx.

Merge sampling.

Use ctx in partition.

Fix init root.

Force regenerate the sketch.

Create a ctx.

Get it compile.

Don't use const method.

Use page id.

Pass in base row id.

Pass the cut instead.

Small fixes.

Debug.

Fix bin size.

Debug.

Fixes.

Debug.

Fix empty partition.

Remove comment.

Lint.

Fix tests compilation.

Remove check.

Merge some fixes.

fix.

Fix fetching.

lint.

Extract expand entry.

Lint.

Fix unittests.

Fix windows build.

Fix comparison.

Make const.

Note.

const.

Fix reduce hist.

Fix sparse data.

Avoid implicit conversion.

private.

mem leak.

Remove skip initialization.

Use maximum space.

demo.

lint.

File link tags.

ama.

Fix redefinition.

Fix ranking.

use npy.

Comment.

Tune it down.

Specify the tree method.

Get rid of the duplicated partitioner.

Allocate task.

Tests.

make batches.

Log.

Remove span.

Revert "make batches."

This reverts commit 33f7072.

small cleanup.

Lint.

Revert demo.

Better make batches.

Demo.

Test for grow policy.

Test feature weights.

small cleanup.

Remove iterator in evaluation.

Fix dask test.

Pass n_threads.

Start implementation for categorical data.

Fix.

Add apply split.

Enumerate splits.

Enable sklearn.

Works.

d_step.

update.

Pass feature types into index.

Search cut.

Add test.

As cat.

Fix cut.

Extract some tests.

Fix.

Interesting case.

Add Python tests.

Cleanup.

Revert "Interesting case."

This reverts commit 6bbaac2.

Bin.

Fix.

Dispatch.

Remove subtraction trick.

Lint

Use multiple buffers.

Revert "Use multiple buffers."

This reverts commit 2849f57.

Test for external memory.

Format.

Partition based categorical split.

Remove debug code.

Fix.

Lint.

Fix test.

Fix demo.

Fix.

Add test.

Remove use of omp func.

name.

Fix.

test.

Make LCG impl compliant to std.

Fix test.

Constexpr.

Use unsigned type.

osx

More test.
  • Loading branch information
trivialfis committed Nov 2, 2021
1 parent a133211 commit c56cbab
Show file tree
Hide file tree
Showing 70 changed files with 2,158 additions and 718 deletions.
1 change: 1 addition & 0 deletions amalgamation/xgboost-all0.cc
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
#include "../src/tree/updater_refresh.cc"
#include "../src/tree/updater_sync.cc"
#include "../src/tree/updater_histmaker.cc"
#include "../src/tree/updater_approx.cc"
#include "../src/tree/constraints.cc"

// linear
Expand Down
4 changes: 2 additions & 2 deletions demo/guide-python/categorical.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""Experimental support for categorical data. After 1.5 XGBoost `gpu_hist` tree method
has experimental support for one-hot encoding based tree split.
"""Experimental support for categorical data. After 1.6 XGBoost `gpu_hist` and `approx`
tree method have experimental support for one-hot encoding based tree split.
In before, users need to run an encoder themselves before passing the data into XGBoost,
which creates a sparse matrix and potentially increase memory usage. This demo showcases
Expand Down
16 changes: 13 additions & 3 deletions doc/parameter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,15 +232,25 @@ Parameters for Tree Booster
- Constraints for interaction representing permitted interactions. The constraints must
be specified in the form of a nest list, e.g. ``[[0, 1], [2, 3, 4]]``, where each inner
list is a group of indices of features that are allowed to interact with each other.
See tutorial for more information
See tutorial for more information.

Additional parameters for ``hist`` and ``gpu_hist`` tree method
================================================================
Additional parameters for ``hist`` and ``gpu_hist`` and ``approx`` tree method
==============================================================================

* ``single_precision_histogram``, [default=``false``]

- Use single precision to build histograms instead of double precision.

Additional parameters for ``approx`` tree method
================================================

* ``max_cat_to_onehot``

- A threshold for deciding whether XGBoost should use one-hot encoding based split for
categorical data. When number of categories is lesser than the threshold then one-hot
encoding is chosen, otherwise the categories will be partitioned into children nodes.
Only relevant for regression and binary classification and `approx` tree method.

Additional parameters for Dart Booster (``booster=dart``)
=========================================================

Expand Down
9 changes: 6 additions & 3 deletions include/xgboost/learner.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,16 @@
#include <dmlc/any.h>
#include <xgboost/base.h>
#include <xgboost/feature_map.h>
#include <xgboost/predictor.h>
#include <xgboost/generic_parameters.h>
#include <xgboost/host_device_vector.h>
#include <xgboost/model.h>
#include <xgboost/predictor.h>
#include <xgboost/task.h>

#include <utility>
#include <map>
#include <memory>
#include <string>
#include <utility>
#include <vector>

namespace xgboost {
Expand Down Expand Up @@ -307,11 +308,13 @@ struct LearnerModelParam {
uint32_t num_feature { 0 };
/* \brief number of classes, if it is multi-class classification */
uint32_t num_output_group { 0 };
/* \brief Current task, determined by objective. */
Task task{Task::kRegression};

LearnerModelParam() = default;
// As the old `LearnerModelParamLegacy` is still used by binary IO, we keep
// this one as an immutable copy.
LearnerModelParam(LearnerModelParamLegacy const& user_param, float base_margin);
LearnerModelParam(LearnerModelParamLegacy const& user_param, float base_margin, Task t);
/* \brief Whether this parameter is initialized with LearnerModelParamLegacy. */
bool Initialized() const { return num_feature != 0; }
};
Expand Down
6 changes: 6 additions & 0 deletions include/xgboost/objective.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#include <xgboost/model.h>
#include <xgboost/generic_parameters.h>
#include <xgboost/host_device_vector.h>
#include <xgboost/task.h>

#include <vector>
#include <utility>
Expand Down Expand Up @@ -72,6 +73,11 @@ class ObjFunction : public Configurable {
virtual bst_float ProbToMargin(bst_float base_score) const {
return base_score;
}
/*!
* \brief Return task of this objective.
*/
virtual enum Task Task() const = 0;

/*!
* \brief Create an objective function according to name.
* \param tparam Generic parameters.
Expand Down
17 changes: 17 additions & 0 deletions include/xgboost/task.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/*!
* Copyright 2015-2021 by XGBoost Contributors
*/
#include <cinttypes>
#ifndef XGBOOST_TASK_H_
#define XGBOOST_TASK_H_
namespace xgboost {
enum class Task : uint8_t {
kRegression = 0,
kBinary = 1,
kClassification = 2,
kSurvival = 3,
kRanking = 4,
kOther = 5,
};
}
#endif // XGBOOST_TASK_H_
15 changes: 7 additions & 8 deletions include/xgboost/tree_updater.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,17 @@
#include <dmlc/registry.h>
#include <xgboost/base.h>
#include <xgboost/data.h>
#include <xgboost/tree_model.h>
#include <xgboost/generic_parameters.h>
#include <xgboost/host_device_vector.h>
#include <xgboost/model.h>
#include <xgboost/linalg.h>
#include <xgboost/model.h>
#include <xgboost/task.h>
#include <xgboost/tree_model.h>

#include <functional>
#include <vector>
#include <utility>
#include <string>
#include <utility>
#include <vector>

namespace xgboost {

Expand Down Expand Up @@ -83,16 +84,14 @@ class TreeUpdater : public Configurable {
* \param name Name of the tree updater.
* \param tparam A global runtime parameter
*/
static TreeUpdater* Create(const std::string& name, GenericParameter const* tparam);
static TreeUpdater* Create(const std::string& name, GenericParameter const* tparam, Task task);
};

/*!
* \brief Registry entry for tree updater.
*/
struct TreeUpdaterReg
: public dmlc::FunctionRegEntryBase<TreeUpdaterReg,
std::function<TreeUpdater* ()> > {
};
: public dmlc::FunctionRegEntryBase<TreeUpdaterReg, std::function<TreeUpdater*(Task task)> > {};

/*!
* \brief Macro to register tree updater.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -219,8 +219,12 @@ abstract class XGBoostRegressorSuiteBase extends FunSuite with PerTest {
}
}

class XGBoostCpuRegressorSuite extends XGBoostRegressorSuiteBase {
class XGBoostCpuRegressorSuiteApprox extends XGBoostRegressorSuiteBase {
override protected val treeMethod: String = "approx"
}

class XGBoostCpuRegressorSuiteHist extends XGBoostRegressorSuiteBase {
override protected val treeMethod: String = "hist"
}

@GpuTestSuite
Expand Down
3 changes: 3 additions & 0 deletions plugin/example/custom_obj.cc
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ class MyLogistic : public ObjFunction {
void Configure(const std::vector<std::pair<std::string, std::string> >& args) override {
param_.UpdateAllowUnknown(args);
}

enum Task Task() const override { return Task::kRegression; }

void GetGradient(const HostDeviceVector<bst_float> &preds,
const MetaInfo &info,
int iter,
Expand Down
15 changes: 14 additions & 1 deletion python-package/xgboost/sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,16 @@ def inner(y_score: np.ndarray, dmatrix: DMatrix) -> Tuple[str, float]:
This parameter replaces `early_stopping_rounds` in :py:meth:`fit` method.
max_cat_to_onehot : bool
.. versionadded:: 1.6.0
A threshold for deciding whether XGBoost should use one-hot encoding based split
for categorical data. When number of categories is lesser than the threshold then
one-hot encoding is chosen, otherwise the categories will be partitioned into
children nodes. Only relevant for regression and binary classification and
`approx` tree method.
kwargs : dict, optional
Keyword arguments for XGBoost Booster object. Full documentation of
parameters can be found here:
Expand Down Expand Up @@ -473,6 +483,7 @@ def __init__(
enable_categorical: bool = False,
eval_metric: Optional[Union[str, List[str], Callable]] = None,
early_stopping_rounds: Optional[int] = None,
max_cat_to_onehot: Optional[int] = None,
**kwargs: Any
) -> None:
if not SKLEARN_INSTALLED:
Expand Down Expand Up @@ -511,6 +522,7 @@ def __init__(
self.enable_categorical = enable_categorical
self.eval_metric = eval_metric
self.early_stopping_rounds = early_stopping_rounds
self.max_cat_to_onehot = max_cat_to_onehot
if kwargs:
self.kwargs = kwargs

Expand Down Expand Up @@ -779,7 +791,8 @@ def _duplicated(parameter: str) -> None:
else early_stopping_rounds
)

if self.enable_categorical and params.get("tree_method", None) != "gpu_hist":
tree_method = params.get("tree_method", None)
if self.enable_categorical and tree_method not in ("gpu_hist", "approx"):
raise ValueError(
"Experimental support for categorical data is not implemented for"
" current tree method yet."
Expand Down
14 changes: 12 additions & 2 deletions src/common/categorical.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@
#ifndef XGBOOST_COMMON_CATEGORICAL_H_
#define XGBOOST_COMMON_CATEGORICAL_H_

#include "bitfield.h"
#include "xgboost/base.h"
#include "xgboost/data.h"
#include "xgboost/span.h"
#include "xgboost/parameter.h"
#include "bitfield.h"
#include "xgboost/span.h"
#include "xgboost/task.h"

namespace xgboost {
namespace common {
Expand Down Expand Up @@ -47,6 +48,15 @@ inline void CheckCat(bst_cat_t cat) {
"should be non-negative.";
}

/*!
* \brief Whether should we use onehot encoding for categorical data.
*/
inline bool UseOneHot(uint32_t n_cats, uint32_t max_cat_to_onehot, Task task) {
bool use_one_hot =
n_cats < max_cat_to_onehot || (task != Task::kRegression && task != Task::kBinary);
return use_one_hot;
}

struct IsCatOp {
XGBOOST_DEVICE bool operator()(FeatureType ft) {
return ft == FeatureType::kCategorical;
Expand Down
8 changes: 8 additions & 0 deletions src/common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,14 @@ std::vector<Idx> ArgSort(Container const &array, Comp comp = std::less<V>{}) {
XGBOOST_PARALLEL_STABLE_SORT(result.begin(), result.end(), op);
return result;
}

/**
* Last index of a group in a CSR style of index pointer.
*/
template <typename Idx, typename Indptr>
XGBOOST_DEVICE size_t LastOf(Idx group, common::Span<Indptr> indptr) {
return indptr[group + 1] - 1;
}
} // namespace common
} // namespace xgboost
#endif // XGBOOST_COMMON_COMMON_H_
Loading

0 comments on commit c56cbab

Please sign in to comment.