Implement JSON IO for XGBoost.

* Split up model IO and serialization. * Add JSON for both model IO and serialisation. * Add tests for JSON IO in both cxx and Python. * Rigorous tests for training continuation. * Add basic documentation for the serialisation format. * Enabled save/load config in Python pickle.
dmlc · Dec 10, 2019 · 076f58c · 076f58c
1 parent e089e16
commit 076f58c
Show file tree

Hide file tree

Showing 41 changed files with 1,979 additions and 452 deletions.
diff --git a/.gitignore b/.gitignore
@@ -74,6 +74,7 @@ tags
 *.class
 target
 *.swp
+.gdb_history
 
 # cpp tests and gcov generated files
 *.gcov

diff --git a/amalgamation/xgboost-all0.cc b/amalgamation/xgboost-all0.cc
@@ -25,7 +25,9 @@
 // gbms
 #include "../src/gbm/gbm.cc"
 #include "../src/gbm/gbtree.cc"
+#include "../src/gbm/gbtree_model.cc"
 #include "../src/gbm/gblinear.cc"
+#include "../src/gbm/gblinear_model.cc"
 
 // data
 #include "../src/data/data.cc"
@@ -44,8 +46,8 @@
 #endif
 
 // tress
-#include "../src/tree/split_evaluator.cc"
 #include "../src/tree/param.cc"
+#include "../src/tree/split_evaluator.cc"
 #include "../src/tree/tree_model.cc"
 #include "../src/tree/tree_updater.cc"
 #include "../src/tree/updater_colmaker.cc"

diff --git a/doc/tutorials/index.rst b/doc/tutorials/index.rst
@@ -10,6 +10,7 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo
   :caption: Contents:
 
   model
+  saving_model
   Distributed XGBoost with AWS YARN <aws_yarn>
   kubernetes
   Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>

diff --git a/doc/tutorials/saving_model.rst b/doc/tutorials/saving_model.rst
@@ -0,0 +1,197 @@
+########################
+Introduction to Model IO
+########################
+
+In XGBoost 1.0.0, we introduced experimental support of using `JSON
+<https://www.json.org/json-en.html>`_ for saving/loading XGBoost models and related
+hyper-parameters for training, aiming to replace the old binary internal format with an
+open format that can be easily reused.  The support for binary format will be continued in
+the future until JSON format is no-longer experimental and has satisfying performance.
+This tutorial aims to share some basic insights into the JSON serialisation method used in
+XGBoost.  Without explicitly mentioned, the following sections assume you are using the
+experimental JSON format, which can be enabled by passing
+``enable_experimental_json_serialization=True`` as training parameter, or provide the file
+name with ``.json`` as file extension when saving/loading model:
+``booster.save_model('model.json')``.  More details below.
+
+Before we get started, XGBoost is a gradient boosting library with focus on tree model,
+which means inside XGBoost, there are 2 distinct parts: the model and algorithms used to
+build it.  If you come from Deep Learning community, then it should be clear to you that
+there are differences between the neural network structures composed of weights with fixed
+tensor operations, and the optimizers used to train them.
+
+So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters
+like number of input columns in trained trees, and the objective function, which combined
+to represent the concept of "model" in XGBoost.  As for why are we saving the objective as
+part of model, that's because objective controls transformation of global bias (called
+``base_score`` in XGBoost).  Users can share this model with others for prediction,
+evaluation or continue the training with a different set of hyper-parameters etc.
+However, this is not the end of story.  There are cases where we need to save something
+more than just the model itself.  For example, in distrbuted training, XGBoost performs
+checkpointing operation.  Or for some reasons, your favorite distributed computing
+framework decide to copy the model from one worker to another and continue the training in
+there.  In such cases, the serialisation output is required to contain enougth information
+to continue previous training without user providing any parameters again.  We consider
+such scenario as memory snapshot (or memory based serialisation method) and distinguish it
+with normal model IO operation.  In Python, this can be invoked by pickling the
+``Booster`` object, while in R the same can be achieved by accessing ``bst$raw``.  Please
+refer to corresponding language binding document for precise API (as this feature is quite
+new, please open an issue if you can't find appropriate documents, or better, a PR).
+
+.. note::
+
+  The old binary format doesn't distinguish difference between model and raw memory
+  serialisation format, it's a mix of everything, which is part of the reason why we want
+  to replace it with a more robust serialisation method.  JVM Package has its own memory
+  based serialisation methods, which may lead to some inconsistance in output model.  It's
+  a known issue we are trying to address.
+
+To enable JSON format support for model IO (saving only the trees and objective), provide
+a filename with ``.json`` as file extension:
+
+.. code-block:: python
+
+  bst.save_model('model_file_name.json')
+
+While for enabling JSON as memory based serialisation format, pass
+``enable_experimental_json_serialization`` as a training parameter.  In Python this can be
+done by:
+
+.. code-block:: python
+
+  bst = xgboost.train({'enable_experimental_json_serialization': True}, dtrain)
+  with open('filename', 'wb') as fd:
+      pickle.dump(bst, fd)
+
+Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost.  Hence
+parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
+As the name suggested, memory based serialisation captures many stuffs internal to
+XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable
+output format.  That being said, loading pickled booster (memory snapshot) in a different
+XGBoost version may lead to errors or undefined behaviors.  But we promise the stable
+output format of binary model and JSON model (once it's no-longer experimental) as they
+are designed to be reusable.  This scheme fits as Python itself doesn't guarantee pickled
+bytecode can be used in different Python version.
+
+***************************
+Custom objective and metric
+***************************
+
+XGBoost accepts user provided objective and metric functions as an extension.  These
+functions are not saved in model file as they are language dependent feature.  With
+Python, user can pickle the model to include these functions in saved binary.  One
+drawback is, the output from pickle is not a stable serialization format and doesn't work
+on different Python version or XGBoost version, not to mention different language
+environment.  Another way to workaround this limitation is to provide these functions
+again after the model is loaded. If the customized function is useful, please consider
+making a PR for implementing it inside XGBoost, this way we can have your functions
+working with different language bindings.
+
+********************************************************
+Saving and Loading the internal parameters configuration
+********************************************************
+
+XGBoost's ``C API`` and ``Python API`` supports saving and loading the internal
+configuration directly as a JSON string.  In Python package:
+
+.. code-block:: python
+
+  bst = xgboost.train(...)
+  config = bst.save_config()
+  print(config)
+
+Will print out something similiar to (not actual output as it's too long for demonstration):
+
+.. code-block:: json
+
+    {
+      "Learner": {
+        "generic_parameter": {
+          "enable_experimental_json_serialization": "0",
+          "gpu_id": "0",
+          "gpu_page_size": "0",
+          "n_jobs": "0",
+          "random_state": "0",
+          "seed": "0",
+          "seed_per_iteration": "0"
+        },
+        "gradient_booster": {
+          "gbtree_train_param": {
+            "num_parallel_tree": "1",
+            "predictor": "gpu_predictor",
+            "process_type": "default",
+            "tree_method": "gpu_hist",
+            "updater": "grow_gpu_hist",
+            "updater_seq": "grow_gpu_hist"
+          },
+          "name": "gbtree",
+          "updater": {
+            "grow_gpu_hist": {
+              "gpu_hist_train_param": {
+                "debug_synchronize": "0",
+                "gpu_batch_nrows": "0",
+                "single_precision_histogram": "0"
+              },
+              "train_param": {
+                "alpha": "0",
+                "cache_opt": "1",
+                "colsample_bylevel": "1",
+                "colsample_bynode": "1",
+                "colsample_bytree": "1",
+                "default_direction": "learn",
+                "enable_feature_grouping": "0",
+                "eta": "0.300000012",
+                "gamma": "0",
+                "grow_policy": "depthwise",
+                "interaction_constraints": "",
+                "lambda": "1",
+                "learning_rate": "0.300000012",
+                "max_bin": "256",
+                "max_conflict_rate": "0",
+                "max_delta_step": "0",
+                "max_depth": "6",
+                "max_leaves": "0",
+                "max_search_group": "100",
+                "refresh_leaf": "1",
+                "sketch_eps": "0.0299999993",
+                "sketch_ratio": "2",
+                "subsample": "1"
+              }
+            }
+          }
+        },
+        "learner_train_param": {
+          "booster": "gbtree",
+          "disable_default_eval_metric": "0",
+          "dsplit": "auto",
+          "objective": "reg:squarederror"
+        },
+        "metrics": [],
+        "objective": {
+          "name": "reg:squarederror",
+          "reg_loss_param": {
+            "scale_pos_weight": "1"
+          }
+        }
+      },
+      "version": [1, 0, 0]
+    }
+
+
+You can load it back to the model generated by same version of XGBoost by:
+
+.. code-block:: python
+
+  bst.load_config(config)
+
+This way users can study the internal representation more closely.
+
+************
+Future Plans
+************
+
+Right now using the JSON format incurs longer serialisation time, we have been working on
+optimizing the JSON implementation to close the gap between binary format and JSON format.
+You can track the progress in `#5046 <https://github.com/dmlc/xgboost/pull/5046>`_.
+Another important item for JSON format support is a stable and documented `schema
+<https://json-schema.org/>`_, based on which one can easily reuse the saved model.
diff --git a/include/xgboost/c_api.h b/include/xgboost/c_api.h
@@ -428,15 +428,15 @@ XGB_DLL int XGBoosterPredict(BoosterHandle handle,
                              const float **out_result);
 
 /*!
- * \brief load model from existing file
+ * \brief Load model from existing file
  * \param handle handle
  * \param fname file name
 * \return 0 when success, -1 when failure happens
  */
 XGB_DLL int XGBoosterLoadModel(BoosterHandle handle,
                                const char *fname);
 /*!
- * \brief save model into existing file
+ * \brief Save model into existing file
  * \param handle handle
  * \param fname file name
  * \return 0 when success, -1 when failure happens
@@ -464,6 +464,45 @@ XGB_DLL int XGBoosterLoadModelFromBuffer(BoosterHandle handle,
 XGB_DLL int XGBoosterGetModelRaw(BoosterHandle handle,
                                  bst_ulong *out_len,
                                  const char **out_dptr);
+
+/*!
+ * \brief Initialize the booster from rabit checkpoint.
+ *  This is used in distributed training API.
+ * \param handle handle
+ * \param version The output version of the model.
+ * \return 0 when success, -1 when failure happens
+ */
+XGB_DLL int XGBoosterLoadRabitCheckpoint(BoosterHandle handle,
+                                         int* version);
+
+/*!
+ * \brief Save the current checkpoint to rabit.
+ * \param handle handle
+ * \return 0 when success, -1 when failure happens
+ */
+XGB_DLL int XGBoosterSaveRabitCheckpoint(BoosterHandle handle);
+
+
+/*!
+ * \brief Save XGBoost's internal configuration into a JSON document.
+ * \param handle handle to Booster object.
+ * \param out_str A valid pointer an array of characters.  The characters array is
+ *                allocated and managed by XGBoost, while pointer to that array needs to
+ *                be managed by caller.
+ * \return 0 when success, -1 when failure happens
+ */
+XGB_DLL int XGBoosterSaveJsonParameters(BoosterHandle handle,
+                                        bst_ulong *out_len,
+                                        char const** out_str);
+/*!
+ * \brief Load XGBoost's internal configuration from a JSON document.
+ * \param handle handle to Booster object.
+ * \param json_parameters string representation of a JSON document.
+ * \return 0 when success, -1 when failure happens
+ */
+XGB_DLL int XGBoosterLoadJsonParameters(BoosterHandle handle,
+                                        char const* json_parameters);
+
 /*!
  * \brief dump model, return array of strings representing model dump
  * \param handle handle
@@ -570,25 +609,4 @@ XGB_DLL int XGBoosterSetAttr(BoosterHandle handle,
 XGB_DLL int XGBoosterGetAttrNames(BoosterHandle handle,
                                   bst_ulong* out_len,
                                   const char*** out);
-
-// --- Distributed training API----
-// NOTE: functions in rabit/c_api.h will be also available in libxgboost.so
-/*!
- * \brief Initialize the booster from rabit checkpoint.
- *  This is used in distributed training API.
- * \param handle handle
- * \param version The output version of the model.
- * \return 0 when success, -1 when failure happens
- */
-XGB_DLL int XGBoosterLoadRabitCheckpoint(
-    BoosterHandle handle,
-    int* version);
-
-/*!
- * \brief Save the current checkpoint to rabit.
- * \param handle handle
- * \return 0 when success, -1 when failure happens
- */
-XGB_DLL int XGBoosterSaveRabitCheckpoint(BoosterHandle handle);
-
 #endif  // XGBOOST_C_API_H_
diff --git a/include/xgboost/gbm.h b/include/xgboost/gbm.h
@@ -32,7 +32,7 @@ struct LearnerModelParam;
 /*!
  * \brief interface of gradient boosting model.
  */
-class GradientBooster {
+class GradientBooster : public Model, public Configurable {
  protected:
   GenericParameter const* generic_param_;
 

diff --git a/include/xgboost/json_io.h b/include/xgboost/json_io.h
@@ -55,6 +55,12 @@ class JsonReader {
   } cursor_;
 
   StringView raw_str_;
+  bool initialized_;
+
+ public:
+  size_t Pos() const { return cursor_.Pos(); }
+  size_t Length() const { return raw_str_.size(); }
+  bool Initialized() const { return initialized_; }
 
  protected:
   void SkipSpaces();
@@ -109,8 +115,9 @@ class JsonReader {
 
  public:
   explicit JsonReader(StringView str) :
-      raw_str_{str} {}
+      raw_str_{str}, initialized_{true} {}
 
+  JsonReader() : initialized_{false} {};
   virtual ~JsonReader() = default;
 
   Json Load();

diff --git a/include/xgboost/learner.h b/include/xgboost/learner.h
@@ -45,24 +45,14 @@ class Json;
  *
  *  \endcode
  */
-class Learner : public Model, public rabit::Serializable {
+class Learner : public Model, public Configurable, public rabit::Serializable {
  public:
   /*! \brief virtual destructor */
   ~Learner() override;
   /*!
    * \brief Configure Learner based on set parameters.
    */
   virtual void Configure() = 0;
-  /*!
-   * \brief load model from stream
-   * \param fi input stream.
-   */
-  void Load(dmlc::Stream* fi) override = 0;
-  /*!
-   * \brief save model to stream.
-   * \param fo output stream
-   */
-  void Save(dmlc::Stream* fo) const override = 0;
   /*!
    * \brief update the model for one iteration
    *  With the specified objective function.