dmlc · trivialfis · Nov 4, 2025 · Nov 2, 2025 · Nov 2, 2025 · Nov 4, 2025
diff --git a/doc/model.schema b/doc/model.schema
diff --git a/doc/tutorials/saving_model.rst b/doc/tutorials/saving_model.rst
@@ -2,6 +2,12 @@
 Introduction to Model IO
 ########################
 
+**Contents**
+
+.. contents::
+  :backlinks: none
+  :local:
+
 Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option is
 enabled for serializing models to file, serializing models to buffer, and for memory
 snapshot (pickle and alike).
@@ -229,25 +235,58 @@ Difference between saving model and dumping model
 XGBoost has a function called ``dump_model`` in the Booster class, which lets you to
 export the model in a readable format like ``text``, ``json`` or ``dot`` (graphviz).  The
 primary use case for it is for model interpretation and visualization, and is not supposed
-to be loaded back to XGBoost.  The JSON version has a `schema
-<https://github.com/dmlc/xgboost/blob/master/doc/dump.schema>`__.  See next section for
-more info.
+to be loaded back to XGBoost.
+
+**********
+Categories
+**********
+
+Since 3.1, the categories encoding from a training dataframe is stored in the booster to
+provide test-time re-coding support, see :ref:`cat-recode` for more info about how the
+re-coder works. We will briefly explain the JSON format for the serialized category index.
+
+The categories are saved in a JSON object named "cats" under the gbtree model. It contains
+three keys:
+
+- feature_segments
+
+This is a CSR-like pointer that stores the number of categories for each feature. It
+starts with zero and ends with the total number of categories from all features. For
+example:
+
+.. code-block:: python
+
+    feature_segments = [0, 3, 3, 5]
+
+The ``feature_segments`` list represents a dataset with two categorical features and one
+numerical feature. The first feature contains three categories, the second feature is
+numerical and thus has no categories, and the last feature includes two categories.
+
+- sorted_idx
 
-***********
-JSON Schema
-***********
+This array stores the sorted indices (`argsort`) of categories across all features,
+segmented by the ``feature_segments``. Given a feature with categories: ``["b", "c",
+"a"]``, the sorted index is ``[2, 0, 1]``.
 
-Another important feature of JSON format is a documented `schema
-<https://json-schema.org/>`__, based on which one can easily reuse the output model from
-XGBoost.  Here is the JSON schema for the output model (not serialization, which will not
-be stable as noted above).  For an example of parsing XGBoost tree model, see
-``/demo/json-model``.  Please notice the "weight_drop" field used in "dart" booster.
-XGBoost does not scale tree leaf directly, instead it saves the weights as a separated
-array.
+- enc
 
-.. include:: ../model.schema
-   :code: json
+This is an array with a length equal to the number of features, storing all the categories
+in the same order as the input dataframe. The storage schema depends on whether the
+categories are strings (XGBoost also supports numerical categories, such as integers). For
+string categories, we use a schema similar to the arrow format for a string array. The
+categories of each feature are represented by two arrays, namely ``offsets`` and
+``values``. The format is also similar to a CSR-matrix. The ``values`` field is a
+``uint8`` array storing characters from all category names. Given a feature with three
+categories: ``["bb", "c", "a"]``, the ``values`` field is ``[98, 98, 99, 97]``. Then the
+``offsets`` segments the ``values`` array similar to a CSR pointer: ``[0, 2, 3, 4]``. We
+chose to not store the ``values`` as a JSON string to avoid handling special characters
+and string encoding. The string names are stored exactly as given by the dataframe.
 
+As for numerical categories, the ``enc`` contains two keys: ``type`` and ``values``. The
+``type`` field is an integer ID that identifies the type of the categories, such as 64-bit
+integers and 32-bit floating points (note that they are all f32 inside a decision
+tree). The exact mapping between the type to the integer ID is internal but stable. The
+``values`` is an array storing all categories in a feature.
 
 *************
 Brief History
@@ -258,4 +297,6 @@ Brief History
 - Later in XGBoost 1.6.0, additional support for Universal Binary JSON was introduced as
   an optimization for more efficient model IO.
 - UBJSON has been set to default in 2.1.
-- The old binary format was removed in 3.1.
+- The old binary format was removed in 3.1.
+- The JSON schema file is no longer maintained and has been removed in 3.2. The underlying
+  schema of the model is not changed.
diff --git a/include/xgboost/json.h b/include/xgboost/json.h
@@ -30,7 +30,11 @@ class Value {
   }
 
  public:
-  /*!\brief Simplified implementation of LLVM RTTI. */
+  /**
+   * @brief Simplified implementation of LLVM RTTI.
+   *
+   * @note The integer ID must be kept stable.
+   */
   enum class ValueKind : std::int64_t {
     kString = 0,
     kNumber = 1,

diff --git a/ops/conda_env/aarch64_test.yml b/ops/conda_env/aarch64_test.yml
@@ -21,8 +21,6 @@ dependencies:
 - cmake
 - ninja
 - boto3
-- jsonschema
-- boto3
 - awscli
 - numba
 - llvmlite

diff --git a/ops/conda_env/linux_cpu_test.yml b/ops/conda_env/linux_cpu_test.yml
@@ -30,7 +30,6 @@ dependencies:
 - pytest-cov
 - python-kubernetes
 - urllib3
-- jsonschema
 - boto3
 - awscli
 - py-ubjson

diff --git a/ops/conda_env/macos_cpu_test.yml b/ops/conda_env/macos_cpu_test.yml
@@ -24,7 +24,6 @@ dependencies:
 - pytest-timeout
 - python-kubernetes
 - urllib3
-- jsonschema
 - boto3
 - awscli
 - loky>=3.5.1

diff --git a/ops/conda_env/win64_test.yml b/ops/conda_env/win64_test.yml
@@ -11,7 +11,6 @@ dependencies:
 - pytest
 - boto3
 - hypothesis
-- jsonschema
 - cupy>=13.2
 - python-graphviz
 - pip

diff --git a/python-package/xgboost/testing/__init__.py b/python-package/xgboost/testing/__init__.py
@@ -190,10 +190,6 @@ def no_dask_cudf() -> PytestSkip:
     return no_mod("dask_cudf")
 
 
-def no_json_schema() -> PytestSkip:
-    return no_mod("jsonschema")
-
-
 def no_graphviz() -> PytestSkip:
     return no_mod("graphviz")
 

diff --git a/tests/python/test_basic_models.py b/tests/python/test_basic_models.py
@@ -245,44 +245,6 @@ def test_feature_names_validation(self):
         bst = xgb.train([], dm2)
         bst.predict(dm2)  # success
 
-    @pytest.mark.skipif(**tm.no_json_schema())
-    def test_json_dump_schema(self):
-        import jsonschema
-
-        def validate_model(parameters):
-            X = np.random.random((100, 30))
-            y = np.random.randint(0, 4, size=(100,))
-
-            parameters["num_class"] = 4
-            m = xgb.DMatrix(X, y)
-
-            booster = xgb.train(parameters, m)
-            dump = booster.get_dump(dump_format="json")
-
-            for i in range(len(dump)):
-                jsonschema.validate(instance=json.loads(dump[i]), schema=schema)
-
-        path = os.path.dirname(
-            os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-        )
-        doc = os.path.join(path, "doc", "dump.schema")
-        with open(doc, "r") as fd:
-            schema = json.load(fd)
-
-        parameters = {
-            "tree_method": "hist",
-            "booster": "gbtree",
-            "objective": "multi:softmax",
-        }
-        validate_model(parameters)
-
-        parameters = {
-            "tree_method": "hist",
-            "booster": "dart",
-            "objective": "multi:softmax",
-        }
-        validate_model(parameters)
-
     def test_special_model_dump_characters(self) -> None:
         params = {"objective": "reg:squarederror", "max_depth": 3}
         feature_names = ['"feature 0"', "\tfeature\n1", """feature "2"."""]

diff --git a/tests/python/test_model_io.py b/tests/python/test_model_io.py
@@ -122,44 +122,6 @@ def test_categorical_model_io(self) -> None:
             predt_1 = booster.predict(Xy)
             np.testing.assert_allclose(predt_0, predt_1)
 
-    @pytest.mark.skipif(**tm.no_json_schema())
-    def test_json_io_schema(self) -> None:
-        import jsonschema
-
-        model_path = "test_json_schema.json"
-        path = os.path.dirname(
-            os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-        )
-        doc = os.path.join(path, "doc", "model.schema")
-        with open(doc, "r") as fd:
-            schema = json.load(fd)
-        parameters = {"tree_method": "hist", "booster": "gbtree"}
-        jsonschema.validate(instance=json_model(model_path, parameters), schema=schema)
-        os.remove(model_path)
-
-        parameters = {"tree_method": "hist", "booster": "dart"}
-        jsonschema.validate(instance=json_model(model_path, parameters), schema=schema)
-        os.remove(model_path)
-
-        try:
-            dtrain, _ = tm.load_agaricus(__file__)
-            xgb.train({"objective": "foo"}, dtrain, num_boost_round=1)
-        except ValueError as e:
-            e_str = str(e)
-            beg = e_str.find("Objective candidate")
-            end = e_str.find("Stack trace")
-            e_str = e_str[beg:end]
-            e_str = e_str.strip()
-            splited = e_str.splitlines()
-            objectives = [s.split(": ")[1] for s in splited]
-            j_objectives = schema["properties"]["learner"]["properties"]["objective"][
-                "oneOf"
-            ]
-            objectives_from_schema = set()
-            for j_obj in j_objectives:
-                objectives_from_schema.add(j_obj["properties"]["name"]["const"])
-            assert set(objectives) == objectives_from_schema
-
     def test_with_pathlib(self) -> None:
         """Saving and loading model files from paths."""
         save_path = Path("model.ubj")
-Original file line number
+Diff line change
@@ Expand Up / @@ -21,8 +21,6 @@ dependencies: @@
     - cmake
     - ninja
     - boto3
-    - jsonschema
-    - boto3
     - awscli
     - numba
     - llvmlite
@@ Expand Down @@