-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Split up model IO and serialization. * Add JSON for both model IO and serialisation. * Add tests for JSON IO in both cxx and Python. * Rigorous tests for training continuation. * Add basic documentation for the serialisation format. * Enabled save/load config in Python pickle.
- Loading branch information
1 parent
e089e16
commit 076f58c
Showing
41 changed files
with
1,979 additions
and
452 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,6 +74,7 @@ tags | |
*.class | ||
target | ||
*.swp | ||
.gdb_history | ||
|
||
# cpp tests and gcov generated files | ||
*.gcov | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
######################## | ||
Introduction to Model IO | ||
######################## | ||
|
||
In XGBoost 1.0.0, we introduced experimental support of using `JSON | ||
<https://www.json.org/json-en.html>`_ for saving/loading XGBoost models and related | ||
hyper-parameters for training, aiming to replace the old binary internal format with an | ||
open format that can be easily reused. The support for binary format will be continued in | ||
the future until JSON format is no-longer experimental and has satisfying performance. | ||
This tutorial aims to share some basic insights into the JSON serialisation method used in | ||
XGBoost. Without explicitly mentioned, the following sections assume you are using the | ||
experimental JSON format, which can be enabled by passing | ||
``enable_experimental_json_serialization=True`` as training parameter, or provide the file | ||
name with ``.json`` as file extension when saving/loading model: | ||
``booster.save_model('model.json')``. More details below. | ||
|
||
Before we get started, XGBoost is a gradient boosting library with focus on tree model, | ||
which means inside XGBoost, there are 2 distinct parts: the model and algorithms used to | ||
build it. If you come from Deep Learning community, then it should be clear to you that | ||
there are differences between the neural network structures composed of weights with fixed | ||
tensor operations, and the optimizers used to train them. | ||
|
||
So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters | ||
like number of input columns in trained trees, and the objective function, which combined | ||
to represent the concept of "model" in XGBoost. As for why are we saving the objective as | ||
part of model, that's because objective controls transformation of global bias (called | ||
``base_score`` in XGBoost). Users can share this model with others for prediction, | ||
evaluation or continue the training with a different set of hyper-parameters etc. | ||
However, this is not the end of story. There are cases where we need to save something | ||
more than just the model itself. For example, in distrbuted training, XGBoost performs | ||
checkpointing operation. Or for some reasons, your favorite distributed computing | ||
framework decide to copy the model from one worker to another and continue the training in | ||
there. In such cases, the serialisation output is required to contain enougth information | ||
to continue previous training without user providing any parameters again. We consider | ||
such scenario as memory snapshot (or memory based serialisation method) and distinguish it | ||
with normal model IO operation. In Python, this can be invoked by pickling the | ||
``Booster`` object, while in R the same can be achieved by accessing ``bst$raw``. Please | ||
refer to corresponding language binding document for precise API (as this feature is quite | ||
new, please open an issue if you can't find appropriate documents, or better, a PR). | ||
|
||
.. note:: | ||
|
||
The old binary format doesn't distinguish difference between model and raw memory | ||
serialisation format, it's a mix of everything, which is part of the reason why we want | ||
to replace it with a more robust serialisation method. JVM Package has its own memory | ||
based serialisation methods, which may lead to some inconsistance in output model. It's | ||
a known issue we are trying to address. | ||
|
||
To enable JSON format support for model IO (saving only the trees and objective), provide | ||
a filename with ``.json`` as file extension: | ||
|
||
.. code-block:: python | ||
bst.save_model('model_file_name.json') | ||
While for enabling JSON as memory based serialisation format, pass | ||
``enable_experimental_json_serialization`` as a training parameter. In Python this can be | ||
done by: | ||
|
||
.. code-block:: python | ||
bst = xgboost.train({'enable_experimental_json_serialization': True}, dtrain) | ||
with open('filename', 'wb') as fd: | ||
pickle.dump(bst, fd) | ||
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence | ||
parameter ``enable_experimental_json_serialization`` is required to enable JSON format. | ||
As the name suggested, memory based serialisation captures many stuffs internal to | ||
XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable | ||
output format. That being said, loading pickled booster (memory snapshot) in a different | ||
XGBoost version may lead to errors or undefined behaviors. But we promise the stable | ||
output format of binary model and JSON model (once it's no-longer experimental) as they | ||
are designed to be reusable. This scheme fits as Python itself doesn't guarantee pickled | ||
bytecode can be used in different Python version. | ||
|
||
*************************** | ||
Custom objective and metric | ||
*************************** | ||
|
||
XGBoost accepts user provided objective and metric functions as an extension. These | ||
functions are not saved in model file as they are language dependent feature. With | ||
Python, user can pickle the model to include these functions in saved binary. One | ||
drawback is, the output from pickle is not a stable serialization format and doesn't work | ||
on different Python version or XGBoost version, not to mention different language | ||
environment. Another way to workaround this limitation is to provide these functions | ||
again after the model is loaded. If the customized function is useful, please consider | ||
making a PR for implementing it inside XGBoost, this way we can have your functions | ||
working with different language bindings. | ||
|
||
******************************************************** | ||
Saving and Loading the internal parameters configuration | ||
******************************************************** | ||
|
||
XGBoost's ``C API`` and ``Python API`` supports saving and loading the internal | ||
configuration directly as a JSON string. In Python package: | ||
|
||
.. code-block:: python | ||
bst = xgboost.train(...) | ||
config = bst.save_config() | ||
print(config) | ||
Will print out something similiar to (not actual output as it's too long for demonstration): | ||
|
||
.. code-block:: json | ||
{ | ||
"Learner": { | ||
"generic_parameter": { | ||
"enable_experimental_json_serialization": "0", | ||
"gpu_id": "0", | ||
"gpu_page_size": "0", | ||
"n_jobs": "0", | ||
"random_state": "0", | ||
"seed": "0", | ||
"seed_per_iteration": "0" | ||
}, | ||
"gradient_booster": { | ||
"gbtree_train_param": { | ||
"num_parallel_tree": "1", | ||
"predictor": "gpu_predictor", | ||
"process_type": "default", | ||
"tree_method": "gpu_hist", | ||
"updater": "grow_gpu_hist", | ||
"updater_seq": "grow_gpu_hist" | ||
}, | ||
"name": "gbtree", | ||
"updater": { | ||
"grow_gpu_hist": { | ||
"gpu_hist_train_param": { | ||
"debug_synchronize": "0", | ||
"gpu_batch_nrows": "0", | ||
"single_precision_histogram": "0" | ||
}, | ||
"train_param": { | ||
"alpha": "0", | ||
"cache_opt": "1", | ||
"colsample_bylevel": "1", | ||
"colsample_bynode": "1", | ||
"colsample_bytree": "1", | ||
"default_direction": "learn", | ||
"enable_feature_grouping": "0", | ||
"eta": "0.300000012", | ||
"gamma": "0", | ||
"grow_policy": "depthwise", | ||
"interaction_constraints": "", | ||
"lambda": "1", | ||
"learning_rate": "0.300000012", | ||
"max_bin": "256", | ||
"max_conflict_rate": "0", | ||
"max_delta_step": "0", | ||
"max_depth": "6", | ||
"max_leaves": "0", | ||
"max_search_group": "100", | ||
"refresh_leaf": "1", | ||
"sketch_eps": "0.0299999993", | ||
"sketch_ratio": "2", | ||
"subsample": "1" | ||
} | ||
} | ||
} | ||
}, | ||
"learner_train_param": { | ||
"booster": "gbtree", | ||
"disable_default_eval_metric": "0", | ||
"dsplit": "auto", | ||
"objective": "reg:squarederror" | ||
}, | ||
"metrics": [], | ||
"objective": { | ||
"name": "reg:squarederror", | ||
"reg_loss_param": { | ||
"scale_pos_weight": "1" | ||
} | ||
} | ||
}, | ||
"version": [1, 0, 0] | ||
} | ||
You can load it back to the model generated by same version of XGBoost by: | ||
|
||
.. code-block:: python | ||
bst.load_config(config) | ||
This way users can study the internal representation more closely. | ||
|
||
************ | ||
Future Plans | ||
************ | ||
|
||
Right now using the JSON format incurs longer serialisation time, we have been working on | ||
optimizing the JSON implementation to close the gap between binary format and JSON format. | ||
You can track the progress in `#5046 <https://github.com/dmlc/xgboost/pull/5046>`_. | ||
Another important item for JSON format support is a stable and documented `schema | ||
<https://json-schema.org/>`_, based on which one can easily reuse the saved model. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.