Skip to content

Commit

Permalink
Implement JSON IO for XGBoost.
Browse files Browse the repository at this point in the history
* Split up model IO and serialization.
* Add JSON for both model IO and serialisation.
* Add tests for JSON IO in both cxx and Python.
* Rigorous tests for training continuation.
* Add basic documentation for the serialisation format.
* Enabled save/load config in Python pickle.
  • Loading branch information
trivialfis committed Dec 10, 2019
1 parent e089e16 commit 076f58c
Show file tree
Hide file tree
Showing 41 changed files with 1,979 additions and 452 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ tags
*.class
target
*.swp
.gdb_history

# cpp tests and gcov generated files
*.gcov
Expand Down
4 changes: 3 additions & 1 deletion amalgamation/xgboost-all0.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@
// gbms
#include "../src/gbm/gbm.cc"
#include "../src/gbm/gbtree.cc"
#include "../src/gbm/gbtree_model.cc"
#include "../src/gbm/gblinear.cc"
#include "../src/gbm/gblinear_model.cc"

// data
#include "../src/data/data.cc"
Expand All @@ -44,8 +46,8 @@
#endif

// tress
#include "../src/tree/split_evaluator.cc"
#include "../src/tree/param.cc"
#include "../src/tree/split_evaluator.cc"
#include "../src/tree/tree_model.cc"
#include "../src/tree/tree_updater.cc"
#include "../src/tree/updater_colmaker.cc"
Expand Down
1 change: 1 addition & 0 deletions doc/tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo
:caption: Contents:

model
saving_model
Distributed XGBoost with AWS YARN <aws_yarn>
kubernetes
Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
Expand Down
197 changes: 197 additions & 0 deletions doc/tutorials/saving_model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
########################
Introduction to Model IO
########################

In XGBoost 1.0.0, we introduced experimental support of using `JSON
<https://www.json.org/json-en.html>`_ for saving/loading XGBoost models and related
hyper-parameters for training, aiming to replace the old binary internal format with an
open format that can be easily reused. The support for binary format will be continued in
the future until JSON format is no-longer experimental and has satisfying performance.
This tutorial aims to share some basic insights into the JSON serialisation method used in
XGBoost. Without explicitly mentioned, the following sections assume you are using the
experimental JSON format, which can be enabled by passing
``enable_experimental_json_serialization=True`` as training parameter, or provide the file
name with ``.json`` as file extension when saving/loading model:
``booster.save_model('model.json')``. More details below.

Before we get started, XGBoost is a gradient boosting library with focus on tree model,
which means inside XGBoost, there are 2 distinct parts: the model and algorithms used to
build it. If you come from Deep Learning community, then it should be clear to you that
there are differences between the neural network structures composed of weights with fixed
tensor operations, and the optimizers used to train them.

So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters
like number of input columns in trained trees, and the objective function, which combined
to represent the concept of "model" in XGBoost. As for why are we saving the objective as
part of model, that's because objective controls transformation of global bias (called
``base_score`` in XGBoost). Users can share this model with others for prediction,
evaluation or continue the training with a different set of hyper-parameters etc.
However, this is not the end of story. There are cases where we need to save something
more than just the model itself. For example, in distrbuted training, XGBoost performs
checkpointing operation. Or for some reasons, your favorite distributed computing
framework decide to copy the model from one worker to another and continue the training in
there. In such cases, the serialisation output is required to contain enougth information
to continue previous training without user providing any parameters again. We consider
such scenario as memory snapshot (or memory based serialisation method) and distinguish it
with normal model IO operation. In Python, this can be invoked by pickling the
``Booster`` object, while in R the same can be achieved by accessing ``bst$raw``. Please
refer to corresponding language binding document for precise API (as this feature is quite
new, please open an issue if you can't find appropriate documents, or better, a PR).

.. note::

The old binary format doesn't distinguish difference between model and raw memory
serialisation format, it's a mix of everything, which is part of the reason why we want
to replace it with a more robust serialisation method. JVM Package has its own memory
based serialisation methods, which may lead to some inconsistance in output model. It's
a known issue we are trying to address.

To enable JSON format support for model IO (saving only the trees and objective), provide
a filename with ``.json`` as file extension:

.. code-block:: python
bst.save_model('model_file_name.json')
While for enabling JSON as memory based serialisation format, pass
``enable_experimental_json_serialization`` as a training parameter. In Python this can be
done by:

.. code-block:: python
bst = xgboost.train({'enable_experimental_json_serialization': True}, dtrain)
with open('filename', 'wb') as fd:
pickle.dump(bst, fd)
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence
parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
As the name suggested, memory based serialisation captures many stuffs internal to
XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable
output format. That being said, loading pickled booster (memory snapshot) in a different
XGBoost version may lead to errors or undefined behaviors. But we promise the stable
output format of binary model and JSON model (once it's no-longer experimental) as they
are designed to be reusable. This scheme fits as Python itself doesn't guarantee pickled
bytecode can be used in different Python version.

***************************
Custom objective and metric
***************************

XGBoost accepts user provided objective and metric functions as an extension. These
functions are not saved in model file as they are language dependent feature. With
Python, user can pickle the model to include these functions in saved binary. One
drawback is, the output from pickle is not a stable serialization format and doesn't work
on different Python version or XGBoost version, not to mention different language
environment. Another way to workaround this limitation is to provide these functions
again after the model is loaded. If the customized function is useful, please consider
making a PR for implementing it inside XGBoost, this way we can have your functions
working with different language bindings.

********************************************************
Saving and Loading the internal parameters configuration
********************************************************

XGBoost's ``C API`` and ``Python API`` supports saving and loading the internal
configuration directly as a JSON string. In Python package:

.. code-block:: python
bst = xgboost.train(...)
config = bst.save_config()
print(config)
Will print out something similiar to (not actual output as it's too long for demonstration):

.. code-block:: json
{
"Learner": {
"generic_parameter": {
"enable_experimental_json_serialization": "0",
"gpu_id": "0",
"gpu_page_size": "0",
"n_jobs": "0",
"random_state": "0",
"seed": "0",
"seed_per_iteration": "0"
},
"gradient_booster": {
"gbtree_train_param": {
"num_parallel_tree": "1",
"predictor": "gpu_predictor",
"process_type": "default",
"tree_method": "gpu_hist",
"updater": "grow_gpu_hist",
"updater_seq": "grow_gpu_hist"
},
"name": "gbtree",
"updater": {
"grow_gpu_hist": {
"gpu_hist_train_param": {
"debug_synchronize": "0",
"gpu_batch_nrows": "0",
"single_precision_histogram": "0"
},
"train_param": {
"alpha": "0",
"cache_opt": "1",
"colsample_bylevel": "1",
"colsample_bynode": "1",
"colsample_bytree": "1",
"default_direction": "learn",
"enable_feature_grouping": "0",
"eta": "0.300000012",
"gamma": "0",
"grow_policy": "depthwise",
"interaction_constraints": "",
"lambda": "1",
"learning_rate": "0.300000012",
"max_bin": "256",
"max_conflict_rate": "0",
"max_delta_step": "0",
"max_depth": "6",
"max_leaves": "0",
"max_search_group": "100",
"refresh_leaf": "1",
"sketch_eps": "0.0299999993",
"sketch_ratio": "2",
"subsample": "1"
}
}
}
},
"learner_train_param": {
"booster": "gbtree",
"disable_default_eval_metric": "0",
"dsplit": "auto",
"objective": "reg:squarederror"
},
"metrics": [],
"objective": {
"name": "reg:squarederror",
"reg_loss_param": {
"scale_pos_weight": "1"
}
}
},
"version": [1, 0, 0]
}
You can load it back to the model generated by same version of XGBoost by:

.. code-block:: python
bst.load_config(config)
This way users can study the internal representation more closely.

************
Future Plans
************

Right now using the JSON format incurs longer serialisation time, we have been working on
optimizing the JSON implementation to close the gap between binary format and JSON format.
You can track the progress in `#5046 <https://github.com/dmlc/xgboost/pull/5046>`_.
Another important item for JSON format support is a stable and documented `schema
<https://json-schema.org/>`_, based on which one can easily reuse the saved model.
64 changes: 41 additions & 23 deletions include/xgboost/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -428,15 +428,15 @@ XGB_DLL int XGBoosterPredict(BoosterHandle handle,
const float **out_result);

/*!
* \brief load model from existing file
* \brief Load model from existing file
* \param handle handle
* \param fname file name
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterLoadModel(BoosterHandle handle,
const char *fname);
/*!
* \brief save model into existing file
* \brief Save model into existing file
* \param handle handle
* \param fname file name
* \return 0 when success, -1 when failure happens
Expand Down Expand Up @@ -464,6 +464,45 @@ XGB_DLL int XGBoosterLoadModelFromBuffer(BoosterHandle handle,
XGB_DLL int XGBoosterGetModelRaw(BoosterHandle handle,
bst_ulong *out_len,
const char **out_dptr);

/*!
* \brief Initialize the booster from rabit checkpoint.
* This is used in distributed training API.
* \param handle handle
* \param version The output version of the model.
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterLoadRabitCheckpoint(BoosterHandle handle,
int* version);

/*!
* \brief Save the current checkpoint to rabit.
* \param handle handle
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterSaveRabitCheckpoint(BoosterHandle handle);


/*!
* \brief Save XGBoost's internal configuration into a JSON document.
* \param handle handle to Booster object.
* \param out_str A valid pointer an array of characters. The characters array is
* allocated and managed by XGBoost, while pointer to that array needs to
* be managed by caller.
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterSaveJsonParameters(BoosterHandle handle,
bst_ulong *out_len,
char const** out_str);
/*!
* \brief Load XGBoost's internal configuration from a JSON document.
* \param handle handle to Booster object.
* \param json_parameters string representation of a JSON document.
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterLoadJsonParameters(BoosterHandle handle,
char const* json_parameters);

/*!
* \brief dump model, return array of strings representing model dump
* \param handle handle
Expand Down Expand Up @@ -570,25 +609,4 @@ XGB_DLL int XGBoosterSetAttr(BoosterHandle handle,
XGB_DLL int XGBoosterGetAttrNames(BoosterHandle handle,
bst_ulong* out_len,
const char*** out);

// --- Distributed training API----
// NOTE: functions in rabit/c_api.h will be also available in libxgboost.so
/*!
* \brief Initialize the booster from rabit checkpoint.
* This is used in distributed training API.
* \param handle handle
* \param version The output version of the model.
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterLoadRabitCheckpoint(
BoosterHandle handle,
int* version);

/*!
* \brief Save the current checkpoint to rabit.
* \param handle handle
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterSaveRabitCheckpoint(BoosterHandle handle);

#endif // XGBOOST_C_API_H_
2 changes: 1 addition & 1 deletion include/xgboost/gbm.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ struct LearnerModelParam;
/*!
* \brief interface of gradient boosting model.
*/
class GradientBooster {
class GradientBooster : public Model, public Configurable {
protected:
GenericParameter const* generic_param_;

Expand Down
9 changes: 8 additions & 1 deletion include/xgboost/json_io.h
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,12 @@ class JsonReader {
} cursor_;

StringView raw_str_;
bool initialized_;

public:
size_t Pos() const { return cursor_.Pos(); }
size_t Length() const { return raw_str_.size(); }
bool Initialized() const { return initialized_; }

protected:
void SkipSpaces();
Expand Down Expand Up @@ -109,8 +115,9 @@ class JsonReader {

public:
explicit JsonReader(StringView str) :
raw_str_{str} {}
raw_str_{str}, initialized_{true} {}

JsonReader() : initialized_{false} {};
virtual ~JsonReader() = default;

Json Load();
Expand Down
12 changes: 1 addition & 11 deletions include/xgboost/learner.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,24 +45,14 @@ class Json;
*
* \endcode
*/
class Learner : public Model, public rabit::Serializable {
class Learner : public Model, public Configurable, public rabit::Serializable {
public:
/*! \brief virtual destructor */
~Learner() override;
/*!
* \brief Configure Learner based on set parameters.
*/
virtual void Configure() = 0;
/*!
* \brief load model from stream
* \param fi input stream.
*/
void Load(dmlc::Stream* fi) override = 0;
/*!
* \brief save model to stream.
* \param fo output stream
*/
void Save(dmlc::Stream* fo) const override = 0;
/*!
* \brief update the model for one iteration
* With the specified objective function.
Expand Down
Loading

0 comments on commit 076f58c

Please sign in to comment.