XGBoost JSON Schema, Version 1.0

Preface

This document contains an exhaustive description of the XGBoost JSON schema, a mapping between XGBoost object classes and JSON objects and arrays. We aim to store a complete representation of all XGBoost objects.

Representation of `dmlc::Parameter` objects

Every object of subclasses of dmlc::Parameter are to be represented as JSON objects. We will create a bridge method to seamlessly convert dmlc::Parameter objects into JSON objects, such that proper value types are used.

For example, consider the parameter class

struct MyParam : public dmlc::Parameter<MyParam> {
  float learning_rate;
  int num_hidden;
  int activation;
  std::string name;
  DMLC_DECLARE_PARAMETER(MyParam) {
    DMLC_DECLARE_FIELD(num_hidden);
    DMLC_DECLARE_FIELD(learning_rate);
    DMLC_DECLARE_FIELD(activation).add_enum("relu", 1).add_enum("sigmoid", 2);
    DMLC_DECLARE_FIELD(name);
  }
};

and an object created by the following initialization code:

MyParam param;
param.learning_rate = 0.1f;
param.num_hidden = 10;
param.activation = 2;  // sigmoid
param.name = "MyNet";

This collection is naturally expressed as the following JSON object:

{
  "learning_rate": 0.1,
  "num_hidden": 10,
  "activation": "sigmoid",
  "name": "MyNet"
}

Notations

In the following sections, the schema for each XGBoost class is shown as a JSON object. Fields whose keys are marked with italic are optional and may be absent in some models. The hyper-linked value indicate that the value shall be the JSON representation of another XGBoost class. The italic value indicate that the value shall be a primitive type (string, integer, floating-point etc). Every mention of floating-point refers to single-precision floating point (32-bit), unless explicitly stated otherwise. Every mention of integer refers to 32-bit integer unless stated otherwise.

Full content of the schema

Note: Click :ref:`here <example>` for a minimal example of the current schema.

XGBoostModel
Learner
LearnerTrainParam
GradientBooster
Metric
Objective
GBTree
GBTreeTrainParam
Dart
DartTrainParam
TreeUpdater
GBTreeModel
GBTreeModelParam
RegTree
TreeParam
Node
LeafNode
TestNode
NodeStat
GBLinear
GBLinearTrainParam
GBLinearModel
GBLinearModelParam
LinearUpdater
SoftmaxMultiClassObj
HingeObj
RegLossObj
LambdaRankObj
PairwiseRankObj
LambdaRankObjNDCG
LambdaRankObjMAP
PoissonRegression
CoxRegression
GammaRegression
TweedieRegression
ColMaker
HistMaker
QuantileHistMaker
GPUMaker
GPUHistMaker
TreePruner
TreeSyncher
SketchMaker
TreeRefresher
TreeTrainParam
SplitEvaluator
ElasticNet
MonotonicConstraint
InteractionConstraint
CoordinateUpdater
GPUCoordinateUpdater
CoordinateTrainParam
ShotgunUpdater
ShotgunTrainParam

XGBoostModel

This is the root object for XGBoost model.

{
  "version" : [1, 0],
  "learner" : Learner
}

Learner

{
  "learner_train_param" : LearnerTrainParam,
  "gradient_booster" : GradientBooster,
  "eval_metrics" : [ array of Metric ],
  "objective" : Objective
}

The learner_train_param field stores (hyper)parameters used for training.

The gradient_booster field stores an gradient boosted ensemble consisting of models of certain type (e.g. tree, linear).

The eval_metrics field is used to store evaluation metrics.

The objective field stores the objective (loss) function used to train the ensemble model.

LearnerTrainParam

This class is a subclass of dmlc::Parameter.

{
  "seed": integer,
  "seed_per_iteration": boolean,
  "dsplit": string,
  "tree_method": string,
  "disable_default_eval_metric": boolean,
  "base_score" : floating-point,
  "num_feature" : integer,
  "num_class" : integer,
  "gpu_id": integer,
  "n_gpus": integer
}

The dsplit field indicates the data partitioning mode for distributed learning. Its value should be one of auto, col, and row. The value should be set to auto when only a single node is used for training.

The tree_method field is the choice of tree construction and its value should be one of auto, approx, exact, hist, gpu_exact, and gpu_hist. The value should be set to auto when the base learner is not a decision tree (e.g. linear model).

The num_class is used only for multi-class classification task, in which it indicates the number of output classes.

The gpu_id and n_gpus fields are used to set the GPU(s) to use for training and prediction. If no GPU is used, the fields should be omitted. Note: after the planned refactor of GPU device management facilities, we should have only one copy of gpu_id and n_gpus across the whole XGBoost codebase, namely one residing in LearnerTrainParam.

GradientBooster

Currently, we may choose one of the three subclasses for the gradient boosted ensemble:

GBTree: decision tree models
Dart: DART (Dropouts meet Multiple Additive Regression Trees) models
GBLinear: linear models

We can determine which subclass was used by looking at the name field of each subclass.

Metric

string

For the time being, every metric is fully specified by a single string. In the future, we may want to add extra parameters to some metrics. When that happens, we should add subclasses of Metric.

The string must be a valid metric name as specified by the parameter doc.

Objective

Currently, we may choose one of the 10 subclasses for the objective function:

RegLossObj
SoftmaxMultiClassObj
HingeObj
LambdaRankObj
PairwiseRankObj
LambdaRankObjNDCG
LambdaRankObjMAP
PoissonRegression
CoxRegression
GammaRegression
TweedieRegression

GBTree

The GBTree class stores an ensemble of decision trees that are produced via gradient boosting. It is a subclass of GradientBooster.

{
  "name" : "GBTree",
  "num_boosting_round" : integer,
  "gbtree_train_param" : GBTreeTrainParam,
  "updater_train_param" : TreeTrainParam,
  "updaters" : [ array of TreeUpdater ],
  "model" : GBTreeModel
}

The num_boosting_round field stores the number of boosting rounds performed. This number is different from the number of trees if num_parallel_tree of GBTreeTrainParam is greater than 1.

The gbtree_train_param field is the list of training parameters specific to GBTree. The updater_train_param field gives the training parameters that are common to all updaters in the updaters field.

The updaters field is the sequence of tree updaters that were used in training the tree ensemble model.

GBTreeTrainParam

This class is a subclass of dmlc::Parameter.

{
  "num_parallel_tree": integer,
  "updater_seq": [ array of string ],
  "process_type": string,
  "predictor": string
}

The num_parallel_tree field denotes the number of parallel trees constructed during each iteration. It is used to support boosted random forest.

The updater_seq field stores the list of updater names that was provided at the beginning of training. This field may not necessarily match the sequence given in the updaters field of GBTree or Dart.

The process_type field denotes whether to create new trees (default) or to update existing trees (update) during the boosting process. The field's value must be either default or update. Keep in mind that update is highly experimental; most use cases will use default.

Dart

The Dart class stores an ensemble of decision trees that are produced via gradient boosting, with dropouts at training time. This class is a subclass of GBTree and hence contains all fields that GBTree contains. It is a subclass of GradientBooster.

{
  "name" : "Dart",
  "gbtree_train_param" : GBTreeTrainParam,
  "dart_train_param" : DartTrainParam,
  "updater_train_param" : TreeTrainParam,
  "num_boosting_round" : integer,
  "updaters" : [ array of TreeUpdater ],
  "model" : GBTreeModel,
  "weight_drop" : [ array of floating-point ]
}

In addition to gbtree_train_param, this class also has dart_train_param, the set of training parameters specific to Dart.

The num_boosting_round field stores the number of boosting rounds performed. This number is different from the number of trees if num_parallel_tree of GBTreeTrainParam is greater than 1.

The updaters field is the sequence of tree updaters that were used in training the tree ensemble model.

The weight_drop field stores the weights assigned to individual trees. The weights should be used at training time.

DartTrainParam

This class is a subclass of dmlc::Parameter.

{
  "sample_type": string,
  "normalize_type": string,
  "rate_drop": floating-point,
  "one_drop": boolean,
  "skip_drop": floating-point,
  "learning_rate": floating-point
}

The meaning of these parameters is to be found in the parameter doc.

The sample_type field must be either uniform or weighted.

The normalize_type field must be either tree or forest.

TreeUpdater

Currently, we may choose one of the nine subclasses for the tree updater:

ColMaker: corresponds to grow_colmaker in the updater sequence
HistMaker: corresponds to grow_histmaker in the updater sequence
QuantileHistMaker: corresponds to grow_quantile_histmaker in the updater sequence
GPUMaker: corresponds to grow_gpu in the updater sequence
GPUHistMaker: corresponds to grow_gpu_hist in the updater sequence
TreePruner: corresponds to prune in the updater sequence
TreeSyncher: corresponds to sync in the updater sequence
SketchMaker: corresponds to grow_skmaker in the updater sequence
TreeRefresher: corresponds to refresh in the updater sequence

We can determine which subclass was used by looking at the name field of each subclass.

Note: DistColMaker has not been maintained for a while and thus excluded.

GBTreeModel

The GBTreeModel class is the list of regression trees, plus the model parameters.

{
  "model_param" : GBTreeModelParam,
  "trees" : [ array of RegTree ],
  "tree_info" : [ array of integer ]
}

tree_info is a reserved field, retained for the sake of compatibility with the current binary serialization method.

GBTreeModelParam

This class is a subclass of dmlc::Parameter.

{
  "num_trees": integer,
  "num_feature" : integer,
  "num_output_group" : integer
}

The num_output_group is the size of prediction per instance. This value is set to 1 for all tasks except multi-class classification. For multi-class classification, num_output_group must be set to the number of classes. This must be identical to the value for num_class field of LearnerTrainParam.

Note. num_roots and size_leaf_vector have been omitted due to deprecation.

RegTree

{
  "tree_param" : TreeParam,
  "nodes" : [ array of Node ],
  "stats" : [ array of NodeStat ]
}

The first node in the nodes array specify root node.

The nodes array specify an adjacency list for an acyclic directed binary tree graph. Each tree node has zero or two outgoing edges and exactly one incoming edge. Cycles are not allowed.

TreeParam

This class is a subclass of dmlc::Parameter.

{
  "num_nodes": integer,
  "num_deleted" : integer
  "num_feature": integer
}

The num_deleted field is optional and indicates that some node IDs are marked deleted and thus should be re-used for creating new nodes. This exists since the pruning method leaves gaps in node IDs. When omitted, num_deleted is assumed to be zero. This field may be deprecated in the future.

Note. num_roots and size_leaf_vector have been omitted due to deprecation. max_depth is removed because it is not used anywhere in the codebase.

Node

We may choose one of the two subclasses for the node class:

LeafNode: leaf node (no child node, real output)
TestNode: non-leaf node (two child nodes, test condition)

We distinguish the two types of node by whether the node representation is a JSON array (test node) or a single floating-point number (leaf node).

LeafNode

Each leaf node is represented as a single floating-point number:

floating-point (leaf_output)

The leaf_output field specifies the real-valued output associated with the leaf node.

TestNode

Each test node is represented as a JSON array of a fixed size, each element storing the following fields:

[
  integer (child_left_id),
  integer (child_right_id),
  unsigned integer (feature_id),
  floating-point (threshold),
  boolean (default_left)
]

The feature_id and threshold fields specify the feature ID and threshold used in the test node, where the test is of form data[feature_id] < threshold. The child_left_id and child_right_id fields specify the nodes to be taken in a tree traversal when the test data[feature_id] < threshold is true and false, respectively. The node IDs are 0-based offsets to the nodes arrays in RegTree. The default_left field indicates the default direction in a tree traversal when feature value for feature_id is missing.

NodeStat

Statistics for each node is represented as a JSON array of a fixed size, each element storing the following fields:

[
  floating-point (loss_chg),
  floating-point (sum_hess),
  floating-point (base_weight),
  64-bit integer (instance_cnt)
]

Note. leaf_child_cnt has been omitted because it is only internally used by the tree pruner. For serialization / deserialization, leaf_child_cnt should always be set to 0.

GBLinear

The GBLinear class stores an ensemble of linear models that are produced via gradient boosting. It is a subclass of GradientBooster.

{
  "name" : "GBLinear",
  "num_boosting_round" : integer,
  "gblinear_train_param" : GBLinearTrainParam,
  "model": GBLinearModel,
  "updater": LinearUpdater
}

The num_boosting_round field stores the number of boosting rounds performed.

GBLinearTrainParam

This class is a subclass of dmlc::Parameter.

{
  "updater" : string,
  "tolerance" : floating-point
}

The updater field is the name of the linear updater used for training. Its value must match that of updater in GBLinear.

The tolerance field is the threshold for early stopping, in which iterations were terminated if the largest weight updater is smaller than the threshold. Setting it to zero disables early stopping.

Note. max_row_perbatch is omitted because it is deprecated.

GBLinearModel

{
  "model_param" : GBLinearModelParam,
  "weight" : [ array of floating-point ]
}

The weight field stores the final coefficients of the combined linear model, after all boosting rounds. Currently, the linear booster does not store coefficients of individual boosting rounds.

GBLinearModelParam

This class is a subclass of dmlc::Parameter.

{
  "num_feature" : integer,
  "num_output_group" : integer
}

The num_output_group is the size of prediction per instance. This value is set to 1 for all tasks except multi-class classification. For multi-class classification, num_output_group must be set to the number of classes.

LinearUpdater

Currently, we may choose one of the three subclasses for the linear updater:

CoordinateUpdater: corresponds to updater="coord_descent"
GPUCoordinateUpdater: corresponds to updater="gpu_coord_descent"
ShotgunUpdater: corresponds to updater="shotgun"

We can determine which subclass was used by looking at the name field of each subclass.