RFC: JSON as Next-Generation Model Serialization Format #3980

hcho3 · 2018-12-08T04:45:04Z

RFC: JSON as Next-Generation Model Serialization Format

Motivation: Why a new serialization format?
Proposal: Use JSON serialization to ensure backward and forward compatibility
Semantic versioning
Other benefits of JSON
Implementation of the proposal with DMLC JSON parser
Addressing potential objections
Appendix A: Full 1.0 Schema
Appendix B: Representation of missing values for various data types

In this RFC document, I propose that XGBoost project would eventually migrate to JSON for serializing decision tree models.

Special thanks to @trivialfis for initially coming up with the idea of using JSON.

Motivation: Why a new serialization format?

The current method of serialization used in XGBoost is a binary dump of model parameters. For instance, the decision tree class (RegTree) is serialized as follows:

/* Adopted from include/xgboost/tree_model.h; simplified for presentation */
void Save(dmlc::Stream* fo) {
  // Save tree-wide parameters
  fo->Write(&param, sizeof(TreeParam));
  // Save tree nodes
  fo->Write(nodes_.data(), sizeof(Node) * nodes_.size());
  // Save per-node statistics
  fo->Write(stats_.data(), sizeof(NodeStat) * stats_.size());
  // Save optional leaf vector
  if (param.size_leaf_vector != 0) {
    fo->Write(leaf_vector_);
  }
}
void Load(dmlc::Stream* fi) {
  // Load tree-wide parameters
  fi->Read(&param, sizeof(TreeParam));
  nodes_.resize(param.num_nodes);
  stats_.resize(param.num_nodes);
  // Load tree nodes
  fi->Read(nodes_.data(), sizeof(Node) * nodes_.size());
  // Load per-node statistics
  fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
  // Load optional leaf vector
  if (param.size_leaf_vector != 0) {
    fi->Read(&leaf_vector_);
  }
}

The binary serialization method has a few benefits: it is straight-forward to implement, has low memory overhead (since we can read from and write to file streams directly), is fast to run (no pre-processing required), and produces compact model binaries.

Unfortunately, the method has one significant shortcoming: it is not possible to add new fields without breaking backward compatibility. Backward compatibility refers to the ability on the part of a new version of XGBoost to read model files produced by older versions of XGBoost. To see why we can't add new fields, let's refer to the decision tree class example shown above. The current NodeStat class has four fields:

/* NodeStat: Class to store statistics for a node in a decision tree
 * Version 1 */
struct NodeStat {
  // loss change caused by current split
  float loss_chg;
  // sum of hessian values, used to measure coverage of data
  float sum_hess;
  // weight of current node
  float base_weight;
  // number of leaf nodes among the children
  int leaf_child_cnt;
};

The total byte size of NodeStat is 16 bytes. Now imagine a fictitious scenario where we want to add a new field called instance_cnt. This field would store the number of instances (data points) that are associated with the node:

/* NodeStat: Class to store statistics for a node in a decision tree
 * Version 2 */
struct NodeStat {
  // loss change caused by current split
  float loss_chg;
  // sum of hessian values, used to measure coverage of data
  float sum_hess;
  // weight of current node
  float base_weight;
  // number of leaf nodes among the children
  int leaf_child_cnt;
  // NEW: number of instances associated with the node
  int64_t instance_cnt;
};

Note that the new version of NodeStat is now 24 bytes. Now we have already broken backward compatibility: when the latest version of XGBoost runs the snippet

// Load per-node statistics
fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());

it will attempt to read 24 * M bytes, where M is the number of nodes in the decision tree. However, if the saved model was serialized in an old version of XGBoost, there would be only 16 * M bytes to read in the serialized file! The program will either crash or show some kind of undefined behavior.

What would be the work-around? We can add extra logic in the snippet above to check which version of XGBoost produced the serialized file:

if ( /* if the model was generated with old version */ ) {
  for (size_t i = 0; i < stats.size(); ++i) {
    NodeStatOld tmp;  // Version 1 of NodeStat
    fi->Read(&tmp, sizeof(NodeStatOld));
    stats_[i].loss_chg = tmp.loss_chg;
    stats_[i].sum_hess = tmp.sum_hess;
    stats_[i].base_weight = tmp.base_weight;
    stats_[i].leaf_child_cnt = tmp.leaf_child_cnt;
    // instance count is missing
    stats_[i].instance_cnt = -1;
  }
} else {  /* the model was generated with new version */
  fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
}

That's a lot of lines for reading a single C++ vector. In general, the extra logic for backward compatibility can accumulate over time and inflict future maintenance burden to contributors. To make matters worse, current XGBoost codebase does not save the version number. So there is no way to query the version of the serialized file. The proposed work-around is not only messy but also not actually feasible.

Proposal: Use JSON serialization to ensure backward and forward compatibility

JSON (JavaScript Object Notation) is a light-weight serialization format built on two fundamental structures: JSON objects (a set of key-value pairs) and JSON arrays (an ordered list of values). In this section, I will argue that it is possible to JSON as a future-proof method for ensuring backward and forward compatibility, defined as follows:

Backward compatibility: Later versions of XGBoost can read models produced by previous versions of XGBoost.
Forward compatibility: Previous versions of XGBoost can read models produced by later versions of XGBoost.

We will make use of two useful characteristics of JSON objects:

The presence of extra keys in a JSON object does not cause fatal errors.
If a JSON object are missing some keys we wanted, we can reliably detect this fact.

Let's return to the NodeStat example from the previous section. Version 1 of the NodeStat can be expressed as a JSON object as follows:

{
  "loss_chg": -1.5,
  "sum_hess": 12.4,
  "base_weight": 1.0,
  "leaf_child_cnt": 2
}

(The values are made up as an example.) Version 2 would be something like

{
  "loss_chg": -1.5,
  "sum_hess": 12.4,
  "base_weight": 1.0,
  "leaf_child_cnt": 2,
  "instance_cnt": 3230
}

Let's first check if backward compatibility holds. The latest version of XGBoost will attempt to read the keys loss_chg, sum_hess, base_weight, leaf_child_cnt, and instance_cnt from the given JSON file. If the JSON file was produced by an old version, it will only have loss_chg, sum_hess, base_weight, and leaf_child_cnt. The JSON parser is able to reliably detect that instance_cnt is missing. This is much better than having the program crash upon encountering missing bytes, as the binary parser would do. Upon detecting missing keys, we can mark the corresponding missing values. For the NodeStat example, we'd put in -1 to indicate the missing value, since -1 cannot be a valid value for the instance count. For optional fields such as instance_cnt, this response is sufficient.

On the other hand, some fields are not so optional. For instance, a brand new feature could be added that makes use of new key-value pairs. In this case, we will need to throw an error whenever the user attempts to make use of the particular feature. We are still doing better than before, however, since the JSON parser does not have to perform this kind of error handling. The JSON parser would simply mark missing key-value pairs as missing, and later parts of the codebase that use the missing pairs may choose to report errors. We are providing what is known as graceful degradation, where users can bring old model files and still use basic functionalities.

How about forward compatibility? The JSON file given by the later version of XGBoost will contain the extra key instance_cnt, which can be simply ignored by the older version. So forward compatibility holds as well.

Freed from future compatibility issues, developers will be able to focus on more interesting development work.

Semantic versioning

Every JSON files produced by XGBoost should have major and minor version fields at the root level:

{
  "major_version": 1,
  "minor_version": 0,
  
  ... other key-value pairs ...
}

All versions that have the same major version are compatible with each other. For example, a program using format version 1.1 can read a model with any version of form 1.x. The minor version should be increased whenever a new field is added to a JSON object.

The major version should be kept to 1, unless a major revision occurs that breaks backward and/or forward compatibility. So version 2.x will not be compatible with 1.x.

See Appendix A for the full schema of Version 1.0.

Other benefits of JSON

Compatibility is the largest motivator for me to propose JSON as the next-generation serialization format. However, there are other benefits too:

JSON parsers are available in virtually every language and platform, so we get a parser and serializer for free (in terms of programmer effort) for many languages. It will be easy to parse XGBoost models in Python, Java, JavaScript, and many others. This fact was pointed by @KOLANICH.
It is human-readable text, good for legibility.
DMLC parameter objects (dmlc::Parameter) is represented neatly as JSON objects, since parameter objects are just collections of key-value pairs. We are now free to add more fields to DMLC parameter objects.
Being text, it is endian-independent, provided we use UTF-8 for text encoding. I have received a report that a big-endian machine failed to load a binary serialized file that was copied from a little-endian machine.
JSON is relatively compact compared to other text-based formats (e.g. XML).

Implementation of the proposal with DMLC JSON parser

There are many excellent 3rd-party JSON libraries for C++, e.g. Tencent/RapidJSON. However, we will eschew 3rd-party libraries. As @tqchen pointed out, extra dependencies makes it more difficult to port XGBoost to various platforms and targets. In addition, most JSON libraries assume free-form schema-less objects, where JSON objects and arrays may nest each other arbitrarily (thanks @KOLANICH and @tqchen for pointing this out). This assumption adds unnecessary overhead, when our proposed JSON XGBoost format has a regular schema.

Fortunately, the DMLC-Core repository comes with a built-in JSON parser and serializer. What's more, the DMLC JSON parser (dmlc::JSONReader) lets us make assumptions about how JSON objects and arrays are to be nested. For instance, if we expect a JSON object that contains a single key-value pair with the key foo and the value an array of integers, we'd write:

dmlc::JSONReader reader(&input_stream);
std::string key, value;
reader.BeginObject();  // Assume: the top-level object is JSON object, not array
assert(reader.NextObjectItem(&key));  // Assume: there is one key
assert(key == "foo");  // Assume: the key is "foo"
reader.BeginArray();  // Assume: the value for "foo" is an array

std::vector<int> foo_values;
while (reader.NextArrayItem()) {
  int val;
  reader.Read(&val);  // Assume: each element in the array is an integer
  foo_values.push_back(val);
}
assert(!reader.NextObjectItem(&key));  // Assume: there is only one key

One tricky part is that the built-in helper class for reading structures will throw errors upon encountering missing or extra key-value pairs. I will write a custom helper class as part of the draft implementation of the RFC. The custom helper class will handle missing key-value pairs in a consistent way. Appendix B describes how the missing values are to be represented for each data type.

Addressing potential objections

Q. How about space efficiency? JSON will result in bigger model files than the current binary serialization method.

A. There are a few 3rd-party libraries (e.g. MessagePack) that lets us encode JSON into more compact binary form. We will consider adding a plug-in for generating binary encoding, so that 3rd-party dependencies remain optional. There will still be some overhead for storing integer and float storage, but the cost is in my opinion worthwhile; see previous sections for the compatibility benefits of JSON.

Q. How about read/write performance? All the text operations will slow things down.

A. I admit that it is hard to beat the performance of binary serialization; how can you do better than simply dumping bit-patterns from the memory? We are really making a trade-off here, trading read/write speed for future extensibility. However, we are able to soften the blow by serializing decision trees in parallel, see microsoft/LightGBM#1083 for an example.

Q. Binary formats are not necessary closed to future extension. With some care, it is possible to design a new binary format with future extensibility in mind.

A. It is certainly possible to design a binary format to enable future extensions. In fact, had the current version of XGBoost stored the version number in model artifacts, we could have added extra logic for handling multiple formats. However, I still think JSON is superior, for two reasons: 1) designing an extensible binary format takes great precision and care to get right; and 2) extra logic for managing multiple versions in a custom parser can be messy, as we saw in the first section.

Appedix A. Full 1.0 Schema

The full schema can be accessed at https://xgboost-json-schema.readthedocs.io/en/latest/. The source is hosted at https://github.com/hcho3/xgboost-json-schema.

Appendix B: Representation of missing values for various data types

String: ""
Floating-point: nan
Integer: std::numeric_limits<int_type>::max()
JSON Object/Array: null (null object)

The text was updated successfully, but these errors were encountered:

trivialfis · 2018-12-08T06:41:56Z

I have a few more problems need to address, mostly related to parameters. Introducing JSON is kind of part of the plan. It won't have any conflict with current proposal(nice job). Will add proper comments as soon as possible. :) And it's really nice to have a organized RFC, it's much better than my original attempt.

KOLANICH · 2018-12-08T07:33:54Z

(e.g. MessagePack)

... and CBOR

tqchen · 2018-12-08T14:43:35Z

@hcho3 can you please elaborate a bit more on how do we store the tree structure and give a minimum example?

hcho3 · 2018-12-08T15:55:05Z

@tqchen I’ll put up the full schema soon, and as part of that will describe how tree structures will be stored.

trivialfis · 2018-12-09T10:12:22Z

Other models

migrate to JSON for serializing decision tree models.

Linear models are also supported, actually much easier than tree models. And I have passion of adding new interesting algorithms.

Deprecating (changing) parameter

RFC addresses adding new parameters explicitly. But in reality we might deprecate parameters in the future due to deprecated feature of algorithms, duplicated parameters for different components (n_gpus and gpu_id comes to mind) or broken features disabled temporally. Luckily, handling removed parameters is the reversed of adding new parameters, which is addressed in this RFC.

Backward compatibility

Reading old model from newer version of XGBoost. Extra parameters removed in newer version but presented in old model file can be simply ignore. This corresponds to situation of the forward compatibility of adding new parameters.

Forward compatibility

Reading new model from older version of XGBoost. This corresponds to situation of backward compatibility in adding new parameters. Here we will have removed parameters, which are not presented in model file, as missing values. Since most of the parameters have default value, parameters not presented in model file (missing) can also be simply ignored.

Summarize

	Forward	Backward
Add	Extra key can be simply ignored by older version.	Use default value to handing missing values or do graceful degradation
Remove	Use default value to handing missing values or do graceful degradation	Extra key can be ignored by newer version

Saving complete snapshot of XGBoost

Most of the classes (in C++) in XGBoost have their own parameters. For examples, objectives, split-evaluator. Currently these parameters are not saved in a unified way and as a matter of fact, rarely saved. My hope is we add IO interface to these classes, and recurs over all of them when Saving/Loading XGBoost. For examples, learner class can call Save/Load methods from updater, then updater call Save/Load of split-evaluators and so on. Once finished and the call stack returns, we will have a complete representation of XGBoost's states' snapshot.

Problems

The this way we might actually need schema less representation, for example, MonotonicConstraint is one of split-evaluators, which contains array (std::vector) parameters. And we have a list of split-evaluators. It's not yet clear to me that how to achieve this with dmlc::JSON* without a lots of boilerplate code. Please know that every added helper class is a fragment of code that's hard to integrate.

Parameter validation

This is a future plan. If there's a typo or unused parameter in user specified model, XGBoost simply ignores it, which might lead to degraded performance or unexpected behavior. We @RAMitchell want to add extra checks for wrong/unused parameters. It's not yet clear to me how to achieve this, but the basic idea is to let each component responsible for checking/marking their own unique parameters and let a manager class (maybe learner) to check unused (not marked by other components) parameters after all components are done with their configuration. Maybe this logic can be cooperated into handling missing/extra parameters of model IO?

@hcho3 Feel free let me know if I can help in any way.

hcho3 · 2018-12-10T06:58:51Z

@trivialfis I like the idea of complete snapshot, having performed a messy hack to preserve GPU predictor parameters (#3856). That said, we want to do it in a way that minimizes the boilerplate code. I will re-read your comment when I complete the draft for 1.0 schema. Mainly, I'm documenting what's currently being saved.

hcho3 · 2018-12-10T07:00:38Z

@trivialfis Also, I used to be afraid of saving more things into the model, because of future maintenance liability caused by binary serialization. I may be more comfortable now with saving everything, now that JSON serialization could forestall compatibility headaches.

KOLANICH · 2018-12-10T07:43:37Z

IMHO we should make XGBoost save itself the hyperparams used to create and evaluate the model (with possibility to disable that) and columns names and types keeping their order (with possibility to disable that and/or to discard the stored names and use integers (starting from 0) without model reloading). Column names are needed in order to reshape a pandas DataFrame to make the columns have the same order the model was trained on (currently I store this info in separate files). But I still insist on keeping data strictly schemed, no arbitrary stuff should be saved there, since it would result in bloated models.

hcho3 · 2018-12-10T07:54:11Z

@KOLANICH I'm not so sure about that, since column names and types are external to the core C++ codebase. So far, we've had Python / R / Java wrappers to save these information.

@trivialfis @RAMitchell What do you think?

trivialfis · 2018-12-10T08:30:26Z

@KOLANICH @hcho3

and columns names and types keeping their order

It's a viable suggestion that we manage all these in C++, this may lead to a cleaner data management. But that's a topic for another day since we have to first implement this management code then implement IO. Adding this feature later won't bring too many hassles since we will have a very good backward compatibility as explained by @hcho3 . A new issue might be desired after this get sorted out.

no arbitrary stuff should be saved there, since it would cause bloat.

You can make suggestion what specific items should not be saved into the model file after having a initial draft (I imagine there will be a long reviewing process, you are welcome to join), otherwise it's hard to argue what goes into the category "arbitrary stuff".

hcho3 · 2018-12-10T10:33:04Z

@trivialfis @tqchen I've uploaded the full schema as a PDF attachment.

hcho3 · 2018-12-10T10:37:08Z

I may want to put up the schema doc in a GitHub repo, so that we can update the LaTeX source upon every schema update and have it automatically compiled into PDF.

hcho3 · 2018-12-10T10:44:56Z

@trivialfis The PDF doc didn’t address your suggestion. Can you take a look and see how the complete snapshot idea can be implemented?

trivialfis · 2018-12-10T11:59:31Z

@hcho3 If you put the doc in a GitHub repo I can make PRs. Then you can comment on my changes. :)

tqchen · 2018-12-10T17:32:01Z

Let us just use markdown for documenting schema, it should be in docs eventually

KOLANICH · 2018-12-10T20:00:19Z

otherwise it's hard to argue what goes into the category "arbitrary stuff".

I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset. If we allowed devs to do such things effortlessly, they would do it and this would result in heavily bloated model files.

Let us just use markdown for documenting schema, it should be in docs eventually

Why not to use JSONSchema? It is a machine-readable spec, it can be automatically validated, and it can be rendered into documentation, including a nice interactive one.

hcho3 · 2018-12-10T20:28:05Z

@KOLANICH

I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset.

You can't really prevent the user to save arbitrary things in their JSON files. What we can do is to enumerate which fields are to be recognized by XGBoost. My schema document does this. Stuff that's not recognized by the schema will be ignored.

Why not to use JSONSchema?

I'm not sure how we can support StringKeyValuePairCollection class which allows for arbitrary key-value pairs.

For now, I'll be typing up the schema doc in RST.

KOLANICH · 2018-12-10T21:09:57Z

You can't really prevent the user to save arbitrary things in their JSON files.

Yes, we cannot. But we can discourage users to store their arbitrary data by avoiding providing API for adding and editing them. So if he wants to do it, he has to parse the resulting blob and add them manually and serialize back, so it may be easier for him to store them in a separate file.

hcho3 · 2018-12-11T10:13:52Z

@tqchen I've typeset the schema in RST: https://xgboost-json-schema.readthedocs.io/en/latest/

@trivialfis You can submit a pull request at https://github.com/hcho3/xgboost-json-schema

hcho3 · 2018-12-12T09:31:26Z

Update: See hcho3/xgboost-json-schema#3 for a discussion on serializing a more complete snapshot of XGBoost.

trivialfis · 2018-12-13T03:15:10Z

Let me try to keep the threads together. Copied from hcho3/xgboost-json-schema#3 .

Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a utf-8 version of current binary format. Can we open up the possibility of introducing a schema that match the complete XGBoost itself rather than the old binary format?

For example:

The // in JSON code snippet is comment for demonstration purpose, should not show up in actual model file.

Let's take Learner class in the draft as an example:

{
  "learner_model_param" : LearnerModelParam,
  "predictor_param" : PredictorParam,
  "name_obj" : string,
  "name_gbm" : string,
  "gbm" : GBM,
  "attributes" : StringKeyValuePairCollection,
  "eval_metrics" : [ array of string ],
  "count_poisson_max_delta_step" : floating-point
}

Here the draft specifies we save predictor_param and count_posson_max_delta_step, which don't belong to Learner itself. Instead I propose we save something like this for Learner:

{
  // This belongs to learner, hence handled by learner
  "LearnerTrainParam": { LearnerTrainParam },   
  // No `LearnerModelParameter`, we don't need it since JSON can save complete model.

  "predictor" : "gpu_predictor",
  "gpu_predictor" : {
    "GPUPredictionParam": { ... }
  },

  "gbm" : "gbtree",
  "gbtree" : { GBM },

  // This can also be an array, I won't argue which one is better.
  "eval_metrics" : {
    "merror": {...},
    "mae": {...}
  }
}

Update: Actually predictor should be handled in gbm, but lets keep it here for the sake of demonstration.

For actual IO of GPUPredictionParam, we leave it to gpu_predictor. Same goes for GBM.
For how to do that, we can implement this in Learner class:

void Learner::Load(KVStore& kv_store) {
  std::string predictor_name = kv_store["predictor"].ToString();  // say "gpu_predictor" or "cpu_predictor"
  auto p_predictor = Predictor::Create(predictor_name);
  p_predictor->Load(kv_store[predictor_name]);

  // (.. other io ...)

  KVStore& eval_metrics = kv_store["eval_metrics"];
  std::vector<Metric> metrics (eval_metrics.ToMap().size());
  for (auto& m : metrics) {
    metrics.Load(eval_metrics);
  }
}

Inside Metric, let's say mae:

void Load(KVStore& kv_store) {
   KVStore self = kv_store["mae"];  // Look up itself by name.
  // load parameters from `self`
  // load other stuffs if needed.
}

Motivation

The reason I want to do it in this way are:

No extra_attributes. That's a remedy for not being able to add new fields. Now we are able to do so.
Modular. Each C++ class handles what it has, once there are need for special code handling changes of dumped model, Learner doesn't get bloated.
Organized. This way it would be much easier for human to interpret the model since it follows a hierarchy that matches XGBoost itself, what you see from the JSON file, is what the internal of XGBoost looks like.
Complete. Described in (3).
Other models. Related to (2). We can save linear model easily, because it handles its own IO.
Can be In-cooperated existing Configuration. Inside XGBoost, the functions like Configure, Init are just another way of loading the class itself from parameters.

The most important one is (2).

What should be saved

Pointed out by @hcho3 , we should draft a list for what goes into final dump file. I will start working on it once these are approved by participants.

Possible objections

Size of the model file.
Most of the added fields are parameters. They are important part of the model. A clean and complete representation should worth the space.
Schema less
Previously I use split_evaluator as an example in RFC: JSON as Next-Generation Model Serialization Format #3980 (comment) . It's possible that we @RAMitchell ~~remove~~ replace it with something simpler due to not being compatible with GPU. So we should still have a schema, but slightly more complicate than current schema.

trivialfis · 2018-12-13T03:17:31Z

@hcho3 @KOLANICH @tqchen @thvasilo @RAMitchell Can you take a look if time allows?

hcho3 · 2018-12-14T08:29:50Z

@trivialfis I cast my vote +1 for your proposal. I think it is possible to use dmlc::JSONReader directly, without the intermediate KVStore structure, as long as we use a schema in the form similar to my current proposal. I will upload a trial implementation, once I'm done getting ready for holidays.

trivialfis · 2018-12-14T14:28:31Z

@hcho3 Let me give a try on drafting the schema. :)

hcho3 · 2019-04-05T20:03:07Z

@LeZhengThu No, this thread is just a proposal. It will be a while until this feature implemented.

LeZhengThu · 2019-04-08T18:32:43Z

@hcho3 Thanks. Hope this feature can go live soon. I like the idea very much.

trivialfis · 2019-04-14T05:38:38Z

@hcho3 The last time we talked about JSON, it was suggested that dmlc::Json* should be used instead of 3-party implementation or self designed Json implementation. And I remember that you have some thoughts about how to use dmlc::Json*?

ras44 · 2019-05-02T17:26:08Z

One issue that might arise with JSON serialization is the treatment of serialized float values. Presumably xgboost will be aware that all decimal values in the JSON should be cast as floats; however I believe most typical JSON parsers will treat the values as doubles, leading to discrepancies for downstream users who don't implement a post-parsing step to cast doubles to floats in the resulting object.

As @hcho3 shows, it is possible to do the round-trip using the maxdigits10, which guarantees that the float will be reproducible from the decimal text but all parsed decimal values must be cast to a float for the resulting object to represent the state of the model correctly. That is the step that is not guaranteed via a simple JSON-parse.

As @thvasilo points out, any downstream users like SHAP by @slundberg and Treelite will need to be aware of this treatment and implement a post-parsing step to cast any doubles to floats (and then treat them as floats in any further calculations, ie. fexp instead of exp, etc.).

As this is my first comment in this issue, I hope it is helpful!

2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see dmlc#3980 (comment)

hcho3 · 2019-05-02T18:17:42Z

@ras44 The JSON schema clearly says that all floating-point values are 32-bit: https://xgboost-json-schema.readthedocs.io/en/latest/#notations

2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)

ras44 · 2019-05-04T00:13:40Z

@hcho3 Thanks for your note. My main point is that the if the JSON dump is to be used by anything other than xgboost (which will presumably need to implement a custom JSON parser that encodes/decodes floats according the schema), the user will have to consider multiple aspects in order produce the same results as xgboost. It appears this is already happening in issues such as #3960, #4060, #4097, #4429. I am providing PR #4439 explaining some of the issues and considerations in case it could be of use for people who aim to work with a JSON dump.

trivialfis · 2019-05-09T18:13:41Z

@hcho3 I will go ahead and create an initial draft implementation based on schema.

trivialfis · 2019-05-09T18:16:17Z

I will try to use dmlc::JSON, but add abstractions whenever needed.

hcho3 · 2019-05-09T18:18:35Z

Thanks for the update. Sorry I haven't had a chance to tackle the implementation yet. Let me know if you need any help or feedback.

RAMitchell · 2019-11-19T03:07:48Z

I want to revisit some of this discussion around using external dependencies as our JSON engine. After some discussion with @trivialfis and looking at #5046, it seems clear to me that a simple, partial json implementation is possible in dmlc-core or xgboost but not feature complete or with good performance. We are looking at thousands of lines of code for a custom implementation with good enough performance.

Achieving better performance than the naive implementation may turn out to be critical as our distributed implementations serialize during training.

I am normally strongly against dependencies but in this case using a header only json dependency may solve more problems than it creates.

trivialfis · 2019-11-19T03:35:27Z

@hcho3 @CodingCat @chenqin @tqchen

chenqin · 2019-11-19T05:40:49Z

I incline we use customized json writer/reader as long as well tested, introducing dependency out of our control could be liability with many unused features. Having Json could help move to simplify rabit recovery protocol. failed worker can recover from flexible sized key/value JSON payload from adjacent host.

KOLANICH · 2019-11-19T20:23:36Z

using a header only json dependency

nlohmann/JSON is not especially performant.

trivialfis · 2019-11-21T13:41:38Z

@KOLANICH Indeed, nice interface but neither memory usage nor computation is efficient.

trivialfis · 2019-12-24T03:35:08Z

Closing as now experimental support with Schema is mainlined. Further improvement will come as separated PRs.

hcho3 mentioned this issue Dec 8, 2018

[DISCUSSION] Adopting JSON-like format as next-generation model format #3916

Closed

hcho3 added the status: RFC label Dec 8, 2018

hcho3 mentioned this issue Dec 8, 2018

Use Json for Model IO. #3930

Closed

hcho3 referenced this issue in hcho3/xgboost-json-schema Dec 11, 2018

Initial commit

46892b5

trivialfis mentioned this issue Dec 12, 2018

Is it necessary to stick to schema of current binary format? hcho3/xgboost-json-schema#3

Closed

hcho3 pinned this issue Dec 14, 2018

ras44 added a commit to ras44/xgboost that referenced this issue May 2, 2019

max_digits10 guarantees float decimal roundtrip

7119c83

2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see dmlc#3980 (comment)

ras44 mentioned this issue May 2, 2019

max_digits10 guarantees float decimal roundtrip #4435

Merged

hcho3 pushed a commit that referenced this issue May 3, 2019

max_digits10 guarantees float decimal roundtrip (#4435)

2be85fc

2 additional digits are not needed to guarantee that casting the decimal representation will result in the same float, see #3980 (comment)

hcho3 unpinned this issue May 26, 2019

hcho3 mentioned this issue Jul 3, 2019

ONNX export is now available in onnxmltools #4626

Closed

RAMitchell mentioned this issue Jul 16, 2019

Allow releasing GPU memory #4668

Closed

CodingCat mentioned this issue Jul 20, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

cpfarrell mentioned this issue Aug 2, 2019

[jvm-package] Add documentation warning for saving Spark trained model with non-default missing value in native format #4727

Closed

This was referenced Sep 12, 2019

[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808

Merged

[wip] json format support of learner and model payload #4852

Closed

trivialfis mentioned this issue Sep 26, 2019

Add JSON IO for various components. #4732

Closed

24 tasks

hcho3 mentioned this issue Nov 20, 2019

[WIP] Optimize the JSON implementation. #5046

Closed

trivialfis closed this as completed Dec 24, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 27, 2020

RFC: JSON as Next-Generation Model Serialization Format #3980

RFC: JSON as Next-Generation Model Serialization Format #3980

Comments

hcho3 commented Dec 8, 2018 • edited Loading

RFC: JSON as Next-Generation Model Serialization Format

Motivation: Why a new serialization format?

Proposal: Use JSON serialization to ensure backward and forward compatibility

Semantic versioning

Other benefits of JSON

Implementation of the proposal with DMLC JSON parser

Addressing potential objections

Appedix A. Full 1.0 Schema

Appendix B: Representation of missing values for various data types

trivialfis commented Dec 8, 2018

KOLANICH commented Dec 8, 2018

tqchen commented Dec 8, 2018

hcho3 commented Dec 8, 2018

trivialfis commented Dec 9, 2018 • edited Loading

Other models

Deprecating (changing) parameter

Backward compatibility

Forward compatibility

Summarize

Saving complete snapshot of XGBoost

Problems

Parameter validation

hcho3 commented Dec 10, 2018

hcho3 commented Dec 10, 2018

KOLANICH commented Dec 10, 2018 • edited Loading

hcho3 commented Dec 10, 2018 • edited Loading

trivialfis commented Dec 10, 2018 • edited Loading

hcho3 commented Dec 10, 2018

hcho3 commented Dec 10, 2018

hcho3 commented Dec 10, 2018 • edited Loading

trivialfis commented Dec 10, 2018

tqchen commented Dec 10, 2018

KOLANICH commented Dec 10, 2018

hcho3 commented Dec 10, 2018 • edited Loading

KOLANICH commented Dec 10, 2018

hcho3 commented Dec 11, 2018

hcho3 commented Dec 12, 2018

trivialfis commented Dec 13, 2018

For example:

Motivation

What should be saved

Possible objections

trivialfis commented Dec 13, 2018

hcho3 commented Dec 14, 2018

trivialfis commented Dec 14, 2018

hcho3 commented Apr 5, 2019

LeZhengThu commented Apr 8, 2019

trivialfis commented Apr 14, 2019

ras44 commented May 2, 2019

hcho3 commented May 2, 2019

ras44 commented May 4, 2019 • edited Loading

trivialfis commented May 9, 2019

trivialfis commented May 9, 2019

hcho3 commented May 9, 2019

RAMitchell commented Nov 19, 2019

trivialfis commented Nov 19, 2019 • edited Loading

chenqin commented Nov 19, 2019

KOLANICH commented Nov 19, 2019

trivialfis commented Nov 21, 2019

trivialfis commented Dec 24, 2019

hcho3 commented Dec 8, 2018 •

edited

Loading

trivialfis commented Dec 9, 2018 •

edited

Loading

KOLANICH commented Dec 10, 2018 •

edited

Loading

hcho3 commented Dec 10, 2018 •

edited

Loading

trivialfis commented Dec 10, 2018 •

edited

Loading

hcho3 commented Dec 10, 2018 •

edited

Loading

hcho3 commented Dec 10, 2018 •

edited

Loading

ras44 commented May 4, 2019 •

edited

Loading

trivialfis commented Nov 19, 2019 •

edited

Loading