Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
555 changes: 0 additions & 555 deletions doc/model.schema

This file was deleted.

73 changes: 57 additions & 16 deletions doc/tutorials/saving_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
Introduction to Model IO
########################

**Contents**

.. contents::
:backlinks: none
:local:

Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option is
enabled for serializing models to file, serializing models to buffer, and for memory
snapshot (pickle and alike).
Expand Down Expand Up @@ -229,25 +235,58 @@ Difference between saving model and dumping model
XGBoost has a function called ``dump_model`` in the Booster class, which lets you to
export the model in a readable format like ``text``, ``json`` or ``dot`` (graphviz). The
primary use case for it is for model interpretation and visualization, and is not supposed
to be loaded back to XGBoost. The JSON version has a `schema
<https://github.com/dmlc/xgboost/blob/master/doc/dump.schema>`__. See next section for
more info.
to be loaded back to XGBoost.

**********
Categories
**********

Since 3.1, the categories encoding from a training dataframe is stored in the booster to
provide test-time re-coding support, see :ref:`cat-recode` for more info about how the
re-coder works. We will briefly explain the JSON format for the serialized category index.

The categories are saved in a JSON object named "cats" under the gbtree model. It contains
three keys:

- feature_segments

This is a CSR-like pointer that stores the number of categories for each feature. It
starts with zero and ends with the total number of categories from all features. For
example:

.. code-block:: python

feature_segments = [0, 3, 3, 5]

The ``feature_segments`` list represents a dataset with two categorical features and one
numerical feature. The first feature contains three categories, the second feature is
numerical and thus has no categories, and the last feature includes two categories.

- sorted_idx

***********
JSON Schema
***********
This array stores the sorted indices (`argsort`) of categories across all features,
segmented by the ``feature_segments``. Given a feature with categories: ``["b", "c",
"a"]``, the sorted index is ``[2, 0, 1]``.

Another important feature of JSON format is a documented `schema
<https://json-schema.org/>`__, based on which one can easily reuse the output model from
XGBoost. Here is the JSON schema for the output model (not serialization, which will not
be stable as noted above). For an example of parsing XGBoost tree model, see
``/demo/json-model``. Please notice the "weight_drop" field used in "dart" booster.
XGBoost does not scale tree leaf directly, instead it saves the weights as a separated
array.
- enc

.. include:: ../model.schema
:code: json
This is an array with a length equal to the number of features, storing all the categories
in the same order as the input dataframe. The storage schema depends on whether the
categories are strings (XGBoost also supports numerical categories, such as integers). For
string categories, we use a schema similar to the arrow format for a string array. The
categories of each feature are represented by two arrays, namely ``offsets`` and
``values``. The format is also similar to a CSR-matrix. The ``values`` field is a
``uint8`` array storing characters from all category names. Given a feature with three
categories: ``["bb", "c", "a"]``, the ``values`` field is ``[98, 98, 99, 97]``. Then the
``offsets`` segments the ``values`` array similar to a CSR pointer: ``[0, 2, 3, 4]``. We
chose to not store the ``values`` as a JSON string to avoid handling special characters
and string encoding. The string names are stored exactly as given by the dataframe.

As for numerical categories, the ``enc`` contains two keys: ``type`` and ``values``. The
``type`` field is an integer ID that identifies the type of the categories, such as 64-bit
integers and 32-bit floating points (note that they are all f32 inside a decision
tree). The exact mapping between the type to the integer ID is internal but stable. The
``values`` is an array storing all categories in a feature.

*************
Brief History
Expand All @@ -258,4 +297,6 @@ Brief History
- Later in XGBoost 1.6.0, additional support for Universal Binary JSON was introduced as
an optimization for more efficient model IO.
- UBJSON has been set to default in 2.1.
- The old binary format was removed in 3.1.
- The old binary format was removed in 3.1.
- The JSON schema file is no longer maintained and has been removed in 3.2. The underlying
schema of the model is not changed.
6 changes: 5 additions & 1 deletion include/xgboost/json.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,11 @@ class Value {
}

public:
/*!\brief Simplified implementation of LLVM RTTI. */
/**
* @brief Simplified implementation of LLVM RTTI.
*
* @note The integer ID must be kept stable.
*/
enum class ValueKind : std::int64_t {
kString = 0,
kNumber = 1,
Expand Down
2 changes: 0 additions & 2 deletions ops/conda_env/aarch64_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,6 @@ dependencies:
- cmake
- ninja
- boto3
- jsonschema
- boto3
- awscli
- numba
- llvmlite
Expand Down
1 change: 0 additions & 1 deletion ops/conda_env/linux_cpu_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ dependencies:
- pytest-cov
- python-kubernetes
- urllib3
- jsonschema
- boto3
- awscli
- py-ubjson
Expand Down
1 change: 0 additions & 1 deletion ops/conda_env/macos_cpu_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ dependencies:
- pytest-timeout
- python-kubernetes
- urllib3
- jsonschema
- boto3
- awscli
- loky>=3.5.1
Expand Down
1 change: 0 additions & 1 deletion ops/conda_env/win64_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ dependencies:
- pytest
- boto3
- hypothesis
- jsonschema
- cupy>=13.2
- python-graphviz
- pip
Expand Down
4 changes: 0 additions & 4 deletions python-package/xgboost/testing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,10 +190,6 @@ def no_dask_cudf() -> PytestSkip:
return no_mod("dask_cudf")


def no_json_schema() -> PytestSkip:
return no_mod("jsonschema")


def no_graphviz() -> PytestSkip:
return no_mod("graphviz")

Expand Down
38 changes: 0 additions & 38 deletions tests/python/test_basic_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,44 +245,6 @@ def test_feature_names_validation(self):
bst = xgb.train([], dm2)
bst.predict(dm2) # success

@pytest.mark.skipif(**tm.no_json_schema())
def test_json_dump_schema(self):
import jsonschema

def validate_model(parameters):
X = np.random.random((100, 30))
y = np.random.randint(0, 4, size=(100,))

parameters["num_class"] = 4
m = xgb.DMatrix(X, y)

booster = xgb.train(parameters, m)
dump = booster.get_dump(dump_format="json")

for i in range(len(dump)):
jsonschema.validate(instance=json.loads(dump[i]), schema=schema)

path = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
doc = os.path.join(path, "doc", "dump.schema")
with open(doc, "r") as fd:
schema = json.load(fd)

parameters = {
"tree_method": "hist",
"booster": "gbtree",
"objective": "multi:softmax",
}
validate_model(parameters)

parameters = {
"tree_method": "hist",
"booster": "dart",
"objective": "multi:softmax",
}
validate_model(parameters)

def test_special_model_dump_characters(self) -> None:
params = {"objective": "reg:squarederror", "max_depth": 3}
feature_names = ['"feature 0"', "\tfeature\n1", """feature "2"."""]
Expand Down
38 changes: 0 additions & 38 deletions tests/python/test_model_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,44 +122,6 @@ def test_categorical_model_io(self) -> None:
predt_1 = booster.predict(Xy)
np.testing.assert_allclose(predt_0, predt_1)

@pytest.mark.skipif(**tm.no_json_schema())
def test_json_io_schema(self) -> None:
import jsonschema

model_path = "test_json_schema.json"
path = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
doc = os.path.join(path, "doc", "model.schema")
with open(doc, "r") as fd:
schema = json.load(fd)
parameters = {"tree_method": "hist", "booster": "gbtree"}
jsonschema.validate(instance=json_model(model_path, parameters), schema=schema)
os.remove(model_path)

parameters = {"tree_method": "hist", "booster": "dart"}
jsonschema.validate(instance=json_model(model_path, parameters), schema=schema)
os.remove(model_path)

try:
dtrain, _ = tm.load_agaricus(__file__)
xgb.train({"objective": "foo"}, dtrain, num_boost_round=1)
except ValueError as e:
e_str = str(e)
beg = e_str.find("Objective candidate")
end = e_str.find("Stack trace")
e_str = e_str[beg:end]
e_str = e_str.strip()
splited = e_str.splitlines()
objectives = [s.split(": ")[1] for s in splited]
j_objectives = schema["properties"]["learner"]["properties"]["objective"][
"oneOf"
]
objectives_from_schema = set()
for j_obj in j_objectives:
objectives_from_schema.add(j_obj["properties"]["name"]["const"])
assert set(objectives) == objectives_from_schema

def test_with_pathlib(self) -> None:
"""Saving and loading model files from paths."""
save_path = Path("model.ubj")
Expand Down
Loading