-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: JSON as Next-Generation Model Serialization Format #3980
Comments
I have a few more problems need to address, mostly related to parameters. Introducing JSON is kind of part of the plan. It won't have any conflict with current proposal(nice job). Will add proper comments as soon as possible. :) And it's really nice to have a organized RFC, it's much better than my original attempt. |
... and CBOR |
@hcho3 can you please elaborate a bit more on how do we store the tree structure and give a minimum example? |
@tqchen I’ll put up the full schema soon, and as part of that will describe how tree structures will be stored. |
Other models
Linear models are also supported, actually much easier than tree models. And I have passion of adding new interesting algorithms. Deprecating (changing) parameterRFC addresses adding new parameters explicitly. But in reality we might deprecate parameters in the future due to deprecated feature of algorithms, duplicated parameters for different components ( Backward compatibilityReading old model from newer version of XGBoost. Extra parameters removed in newer version but presented in old model file can be simply ignore. This corresponds to situation of the forward compatibility of adding new parameters. Forward compatibilityReading new model from older version of XGBoost. This corresponds to situation of backward compatibility in adding new parameters. Here we will have removed parameters, which are not presented in model file, as missing values. Since most of the parameters have default value, parameters not presented in model file (missing) can also be simply ignored. Summarize
Saving complete snapshot of XGBoostMost of the classes (in C++) in XGBoost have their own parameters. For examples, objectives, split-evaluator. Currently these parameters are not saved in a unified way and as a matter of fact, rarely saved. My hope is we add IO interface to these classes, and recurs over all of them when Saving/Loading XGBoost. For examples, ProblemsThe this way we might actually need schema less representation, for example, Parameter validationThis is a future plan. If there's a typo or unused parameter in user specified model, XGBoost simply ignores it, which might lead to degraded performance or unexpected behavior. We @RAMitchell want to add extra checks for wrong/unused parameters. It's not yet clear to me how to achieve this, but the basic idea is to let each component responsible for checking/marking their own unique parameters and let a manager class (maybe @hcho3 Feel free let me know if I can help in any way. |
@trivialfis I like the idea of complete snapshot, having performed a messy hack to preserve GPU predictor parameters (#3856). That said, we want to do it in a way that minimizes the boilerplate code. I will re-read your comment when I complete the draft for 1.0 schema. Mainly, I'm documenting what's currently being saved. |
@trivialfis Also, I used to be afraid of saving more things into the model, because of future maintenance liability caused by binary serialization. I may be more comfortable now with saving everything, now that JSON serialization could forestall compatibility headaches. |
IMHO we should make XGBoost save itself the hyperparams used to create and evaluate the model (with possibility to disable that) and columns names and types keeping their order (with possibility to disable that and/or to discard the stored names and use integers (starting from 0) without model reloading). Column names are needed in order to reshape a pandas DataFrame to make the columns have the same order the model was trained on (currently I store this info in separate files). But I still insist on keeping data strictly schemed, no arbitrary stuff should be saved there, since it would result in bloated models. |
@KOLANICH I'm not so sure about that, since column names and types are external to the core C++ codebase. So far, we've had Python / R / Java wrappers to save these information. @trivialfis @RAMitchell What do you think? |
It's a viable suggestion that we manage all these in C++, this may lead to a cleaner data management. But that's a topic for another day since we have to first implement this management code then implement IO. Adding this feature later won't bring too many hassles since we will have a very good backward compatibility as explained by @hcho3 . A new issue might be desired after this get sorted out.
You can make suggestion what specific items should not be saved into the model file after having a initial draft (I imagine there will be a long reviewing process, you are welcome to join), otherwise it's hard to argue what goes into the category "arbitrary stuff". |
@trivialfis @tqchen I've uploaded the full schema as a PDF attachment. |
I may want to put up the schema doc in a GitHub repo, so that we can update the LaTeX source upon every schema update and have it automatically compiled into PDF. |
@trivialfis The PDF doc didn’t address your suggestion. Can you take a look and see how the complete snapshot idea can be implemented? |
@hcho3 If you put the doc in a GitHub repo I can make PRs. Then you can comment on my changes. :) |
Let us just use markdown for documenting schema, it should be in docs eventually |
I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset. If we allowed devs to do such things effortlessly, they would do it and this would result in heavily bloated model files.
Why not to use JSONSchema? It is a machine-readable spec, it can be automatically validated, and it can be rendered into documentation, including a nice interactive one. |
You can't really prevent the user to save arbitrary things in their JSON files. What we can do is to enumerate which fields are to be recognized by XGBoost. My schema document does this. Stuff that's not recognized by the schema will be ignored.
I'm not sure how we can support For now, I'll be typing up the schema doc in RST. |
Yes, we cannot. But we can discourage users to store their arbitrary data by avoiding providing API for adding and editing them. So if he wants to do it, he has to parse the resulting blob and add them manually and serialize back, so it may be easier for him to store them in a separate file. |
@tqchen I've typeset the schema in RST: https://xgboost-json-schema.readthedocs.io/en/latest/ @trivialfis You can submit a pull request at https://github.com/hcho3/xgboost-json-schema |
Update: See hcho3/xgboost-json-schema#3 for a discussion on serializing a more complete snapshot of XGBoost. |
Let me try to keep the threads together. Copied from hcho3/xgboost-json-schema#3 . Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a For example:The Let's take {
"learner_model_param" : LearnerModelParam,
"predictor_param" : PredictorParam,
"name_obj" : string,
"name_gbm" : string,
"gbm" : GBM,
"attributes" : StringKeyValuePairCollection,
"eval_metrics" : [ array of string ],
"count_poisson_max_delta_step" : floating-point
} Here the draft specifies we save {
// This belongs to learner, hence handled by learner
"LearnerTrainParam": { LearnerTrainParam },
// No `LearnerModelParameter`, we don't need it since JSON can save complete model.
"predictor" : "gpu_predictor",
"gpu_predictor" : {
"GPUPredictionParam": { ... }
},
"gbm" : "gbtree",
"gbtree" : { GBM },
// This can also be an array, I won't argue which one is better.
"eval_metrics" : {
"merror": {...},
"mae": {...}
}
} Update: Actually For actual IO of void Learner::Load(KVStore& kv_store) {
std::string predictor_name = kv_store["predictor"].ToString(); // say "gpu_predictor" or "cpu_predictor"
auto p_predictor = Predictor::Create(predictor_name);
p_predictor->Load(kv_store[predictor_name]);
// (.. other io ...)
KVStore& eval_metrics = kv_store["eval_metrics"];
std::vector<Metric> metrics (eval_metrics.ToMap().size());
for (auto& m : metrics) {
metrics.Load(eval_metrics);
}
} Inside void Load(KVStore& kv_store) {
KVStore self = kv_store["mae"]; // Look up itself by name.
// load parameters from `self`
// load other stuffs if needed.
} MotivationThe reason I want to do it in this way are:
The most important one is (2). What should be savedPointed out by @hcho3 , we should draft a list for what goes into final dump file. I will start working on it once these are approved by participants. Possible objections
|
@hcho3 @KOLANICH @tqchen @thvasilo @RAMitchell Can you take a look if time allows? |
@trivialfis I cast my vote +1 for your proposal. I think it is possible to use |
@hcho3 Let me give a try on drafting the schema. :) |
@LeZhengThu No, this thread is just a proposal. It will be a while until this feature implemented. |
@hcho3 Thanks. Hope this feature can go live soon. I like the idea very much. |
@hcho3 The last time we talked about JSON, it was suggested that |
One issue that might arise with JSON serialization is the treatment of serialized float values. Presumably xgboost will be aware that all decimal values in the JSON should be cast as floats; however I believe most typical JSON parsers will treat the values as doubles, leading to discrepancies for downstream users who don't implement a post-parsing step to cast doubles to floats in the resulting object. As @hcho3 shows, it is possible to do the round-trip using the As @thvasilo points out, any downstream users like SHAP by @slundberg and Treelite will need to be aware of this treatment and implement a post-parsing step to cast any doubles to floats (and then treat them as floats in any further calculations, ie. As this is my first comment in this issue, I hope it is helpful! |
2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see dmlc#3980 (comment)
@ras44 The JSON schema clearly says that all floating-point values are 32-bit: https://xgboost-json-schema.readthedocs.io/en/latest/#notations |
2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)
@hcho3 Thanks for your note. My main point is that the if the JSON dump is to be used by anything other than xgboost (which will presumably need to implement a custom JSON parser that encodes/decodes floats according the schema), the user will have to consider multiple aspects in order produce the same results as xgboost. It appears this is already happening in issues such as #3960, #4060, #4097, #4429. I am providing PR #4439 explaining some of the issues and considerations in case it could be of use for people who aim to work with a JSON dump. |
@hcho3 I will go ahead and create an initial draft implementation based on schema. |
I will try to use dmlc::JSON, but add abstractions whenever needed. |
Thanks for the update. Sorry I haven't had a chance to tackle the implementation yet. Let me know if you need any help or feedback. |
I want to revisit some of this discussion around using external dependencies as our JSON engine. After some discussion with @trivialfis and looking at #5046, it seems clear to me that a simple, partial json implementation is possible in dmlc-core or xgboost but not feature complete or with good performance. We are looking at thousands of lines of code for a custom implementation with good enough performance. Achieving better performance than the naive implementation may turn out to be critical as our distributed implementations serialize during training. I am normally strongly against dependencies but in this case using a header only json dependency may solve more problems than it creates. |
I incline we use customized json writer/reader as long as well tested, introducing dependency out of our control could be liability with many unused features. Having Json could help move to simplify rabit recovery protocol. failed worker can recover from flexible sized key/value JSON payload from adjacent host. |
nlohmann/JSON is not especially performant. |
@KOLANICH Indeed, nice interface but neither memory usage nor computation is efficient. |
Closing as now experimental support with Schema is mainlined. Further improvement will come as separated PRs. |
RFC: JSON as Next-Generation Model Serialization Format
In this RFC document, I propose that XGBoost project would eventually migrate to JSON for serializing decision tree models.
Special thanks to @trivialfis for initially coming up with the idea of using JSON.
Motivation: Why a new serialization format?
The current method of serialization used in XGBoost is a binary dump of model parameters. For instance, the decision tree class (
RegTree
) is serialized as follows:The binary serialization method has a few benefits: it is straight-forward to implement, has low memory overhead (since we can read from and write to file streams directly), is fast to run (no pre-processing required), and produces compact model binaries.
Unfortunately, the method has one significant shortcoming: it is not possible to add new fields without breaking backward compatibility. Backward compatibility refers to the ability on the part of a new version of XGBoost to read model files produced by older versions of XGBoost. To see why we can't add new fields, let's refer to the decision tree class example shown above. The current
NodeStat
class has four fields:The total byte size of
NodeStat
is 16 bytes. Now imagine a fictitious scenario where we want to add a new field calledinstance_cnt
. This field would store the number of instances (data points) that are associated with the node:Note that the new version of
NodeStat
is now 24 bytes. Now we have already broken backward compatibility: when the latest version of XGBoost runs the snippetit will attempt to read
24 * M
bytes, whereM
is the number of nodes in the decision tree. However, if the saved model was serialized in an old version of XGBoost, there would be only16 * M
bytes to read in the serialized file! The program will either crash or show some kind of undefined behavior.What would be the work-around? We can add extra logic in the snippet above to check which version of XGBoost produced the serialized file:
That's a lot of lines for reading a single C++ vector. In general, the extra logic for backward compatibility can accumulate over time and inflict future maintenance burden to contributors. To make matters worse, current XGBoost codebase does not save the version number. So there is no way to query the version of the serialized file. The proposed work-around is not only messy but also not actually feasible.
Proposal: Use JSON serialization to ensure backward and forward compatibility
JSON (JavaScript Object Notation) is a light-weight serialization format built on two fundamental structures: JSON objects (a set of key-value pairs) and JSON arrays (an ordered list of values). In this section, I will argue that it is possible to JSON as a future-proof method for ensuring backward and forward compatibility, defined as follows:
We will make use of two useful characteristics of JSON objects:
Let's return to the
NodeStat
example from the previous section. Version 1 of theNodeStat
can be expressed as a JSON object as follows:(The values are made up as an example.) Version 2 would be something like
Let's first check if backward compatibility holds. The latest version of XGBoost will attempt to read the keys
loss_chg
,sum_hess
,base_weight
,leaf_child_cnt
, andinstance_cnt
from the given JSON file. If the JSON file was produced by an old version, it will only haveloss_chg
,sum_hess
,base_weight
, andleaf_child_cnt
. The JSON parser is able to reliably detect thatinstance_cnt
is missing. This is much better than having the program crash upon encountering missing bytes, as the binary parser would do. Upon detecting missing keys, we can mark the corresponding missing values. For theNodeStat
example, we'd put in -1 to indicate the missing value, since -1 cannot be a valid value for the instance count. For optional fields such asinstance_cnt
, this response is sufficient.On the other hand, some fields are not so optional. For instance, a brand new feature could be added that makes use of new key-value pairs. In this case, we will need to throw an error whenever the user attempts to make use of the particular feature. We are still doing better than before, however, since the JSON parser does not have to perform this kind of error handling. The JSON parser would simply mark missing key-value pairs as missing, and later parts of the codebase that use the missing pairs may choose to report errors. We are providing what is known as graceful degradation, where users can bring old model files and still use basic functionalities.
How about forward compatibility? The JSON file given by the later version of XGBoost will contain the extra key
instance_cnt
, which can be simply ignored by the older version. So forward compatibility holds as well.Freed from future compatibility issues, developers will be able to focus on more interesting development work.
Semantic versioning
Every JSON files produced by XGBoost should have major and minor version fields at the root level:
All versions that have the same major version are compatible with each other. For example, a program using format version 1.1 can read a model with any version of form 1.x. The minor version should be increased whenever a new field is added to a JSON object.
The major version should be kept to 1, unless a major revision occurs that breaks backward and/or forward compatibility. So version 2.x will not be compatible with 1.x.
See Appendix A for the full schema of Version 1.0.
Other benefits of JSON
Compatibility is the largest motivator for me to propose JSON as the next-generation serialization format. However, there are other benefits too:
dmlc::Parameter
) is represented neatly as JSON objects, since parameter objects are just collections of key-value pairs. We are now free to add more fields to DMLC parameter objects.Implementation of the proposal with DMLC JSON parser
There are many excellent 3rd-party JSON libraries for C++, e.g. Tencent/RapidJSON. However, we will eschew 3rd-party libraries. As @tqchen pointed out, extra dependencies makes it more difficult to port XGBoost to various platforms and targets. In addition, most JSON libraries assume free-form schema-less objects, where JSON objects and arrays may nest each other arbitrarily (thanks @KOLANICH and @tqchen for pointing this out). This assumption adds unnecessary overhead, when our proposed JSON XGBoost format has a regular schema.
Fortunately, the DMLC-Core repository comes with a built-in JSON parser and serializer. What's more, the DMLC JSON parser (
dmlc::JSONReader
) lets us make assumptions about how JSON objects and arrays are to be nested. For instance, if we expect a JSON object that contains a single key-value pair with the keyfoo
and the value an array of integers, we'd write:One tricky part is that the built-in helper class for reading structures will throw errors upon encountering missing or extra key-value pairs. I will write a custom helper class as part of the draft implementation of the RFC. The custom helper class will handle missing key-value pairs in a consistent way. Appendix B describes how the missing values are to be represented for each data type.
Addressing potential objections
Q. How about space efficiency? JSON will result in bigger model files than the current binary serialization method.
A. There are a few 3rd-party libraries (e.g. MessagePack) that lets us encode JSON into more compact binary form. We will consider adding a plug-in for generating binary encoding, so that 3rd-party dependencies remain optional. There will still be some overhead for storing integer and float storage, but the cost is in my opinion worthwhile; see previous sections for the compatibility benefits of JSON.
Q. How about read/write performance? All the text operations will slow things down.
A. I admit that it is hard to beat the performance of binary serialization; how can you do better than simply dumping bit-patterns from the memory? We are really making a trade-off here, trading read/write speed for future extensibility. However, we are able to soften the blow by serializing decision trees in parallel, see microsoft/LightGBM#1083 for an example.
Q. Binary formats are not necessary closed to future extension. With some care, it is possible to design a new binary format with future extensibility in mind.
A. It is certainly possible to design a binary format to enable future extensions. In fact, had the current version of XGBoost stored the version number in model artifacts, we could have added extra logic for handling multiple formats. However, I still think JSON is superior, for two reasons: 1) designing an extensible binary format takes great precision and care to get right; and 2) extra logic for managing multiple versions in a custom parser can be messy, as we saw in the first section.
Appedix A. Full 1.0 Schema
The full schema can be accessed at https://xgboost-json-schema.readthedocs.io/en/latest/. The source is hosted at https://github.com/hcho3/xgboost-json-schema.
Appendix B: Representation of missing values for various data types
""
nan
std::numeric_limits<int_type>::max()
null
(null object)The text was updated successfully, but these errors were encountered: