[WIP] Initial commit for JSON. #4683

trivialfis · 2019-07-19T11:53:26Z

Currently depends on #4577 . This is a monolithic PR for early review. I will try to split it up into 2 or 3 dependent PRs later once it's ready. I can create stacked PRs if pushing to origin is allowed. :-)

TODOs:

@hcho3 @RAMitchell Sorry for the long wait.

Notes to myself

leaf_child_cnt is not yet omitted.
num_boosted_rounds is not saved.

Merged parts

Initial implementation #4708
Slightly improved version #4739

RAMitchell

The JSON parsing code looks like it should go in dmlc-core. How hard is it for us to upstream it? I am open to having it in xgboost if it's too difficult, but we should at least consider it.

RAMitchell · 2019-07-19T23:44:12Z

doc/json-schema.rst

@@ -0,0 +1,1059 @@
+################################
+XGBoost JSON Schema, Version 1.0


What is the plan to maintain this? I can see this going out of date quickly without some kind of CI enforcement. This looks like we are exposing the complete serialisation format such that someone could construct a working xgboost model completely external to xgboost. This could be a very useful feature, but also with some cost to support. Note that up until now we have not provided a schema for our binary format, only tried to enforce backwards compatibility.

@RAMitchell We can automate the schema enforcement if we express our model schema using JSON Schema. There are tools available to validate JSON files against a given JSON schema.

hcho3 · 2019-07-20T00:38:11Z

@RAMitchell I agree that NIH (JSON parsing code) should eventually be merged into dmlc-core. For now, let's keep it in XGBoost and make sure it works.

hcho3 · 2019-07-20T19:13:48Z

I can create stacked PRs if pushing to origin is allowed. :-)

Why don’t you create a new branch dev_json and push your code there?

trivialfis · 2019-07-22T08:16:00Z

@hcho3 Jenkins seems to return a 404 at first load. It works fine after a refresh. I have been experiencing the error for a few days so it might not be an incident.

trams

Thank you for your tremendous work

I am very interested in this change. It would simply writing converters of xgboost to other formats.

General question.
How close is your format to xgboost json dump?

include/xgboost/tree_model.h

src/c_api/c_api.cc

trams · 2019-07-23T00:11:39Z

src/gbm/gblinear_model.cc

+  for (size_t i = 0; i < n_weights; ++i) {
+    auto w = weight[i];
+    if (i != n_weights - 1) {
+      raw += std::to_string(w) + ',';


I am concerned that std::to_string is locale dependent (https://en.cppreference.com/w/cpp/string/basic_string/to_string)
and in some locales (russian or ukrainian) they use ',' as a floating point which may break the output.

What do you think about using a locale independent function (if we can use C++17 we should consider std::to_chars)

Also as far as I understood documentation std::to_string just calls fprintf (https://en.cppreference.com/w/cpp/io/c/fprintf) which by default outputs floating point numbers with precision 6 which is not always enough to reconstruct the same number

Good point. I will look into std::to_chars, thanks for the suggestion.

Implementing std::{from, to}_chars for integer is easy, but floating point number is messy. I think I'm just gonna use std::locale to work around it instead.

I think worst case we can set locale at the top of conversion.

I am more concern that we should output at least numeric_limit::max_digits10 digits so we can reconstruct the same float and I do not believe that the default does this (I think it outputs only 6 instead of 9 digits)

This might be useful but I've not done C++ for a few year and might be outdated: https://www.boost.org/doc/libs/1_59_0/libs/locale/doc/html/std_locales.html

@trivialfis FYI, fmtlib is becoming part of C++20: http://www.zverovich.net/2019/07/23/std-format-cpp20.html

@hcho3 Good to know.

BTW, as a general rule what version of a standard are we using for xgboost?
C++14?

@trams We adhere to C++11 standard. See https://xgboost.readthedocs.io/en/latest/contrib/coding_guide.html#c-coding-guideline

src/tree/tree_model.cc

trivialfis · 2019-07-23T02:18:58Z

How close is your format to xgboost json dump?

Not close. The dump prints only trees, but prints with much more readable structure. While this one dumps everything, but in a much more compact way for serialization.

hcho3 · 2019-07-23T03:33:02Z

@trivialfis It looks like the result page is inaccessible from anonymous (not logged-in) users. I got the same 404 when using Incognito Browsing Mode. Let me take a look.

Update. Doing a refresh removes 404. Investigating now.

trams

I rather like this pull request. I have several minor comments

trams · 2019-07-25T17:36:30Z

src/common/common.h

+  size_t i = 0;
+  auto constexpr u0 = static_cast<unsigned char>('0');
+  auto constexpr u10 = static_cast<unsigned char>(10);
+  while (i < size && std::isspace(str[i]) && str[i] != '\0') { ++i; }


Note that std::isspace('\0') is already false (see the bottom of this page https://en.cppreference.com/w/cpp/string/byte/isdigit) so the extra check std[i] != '\0' is excessive

trams · 2019-07-25T17:38:07Z

src/common/common.h

+  auto constexpr u0 = static_cast<unsigned char>('0');
+  auto constexpr u10 = static_cast<unsigned char>(10);
+  while (i < size && std::isspace(str[i]) && str[i] != '\0') { ++i; }
+  for (;  i < size && std::isdigit(str[i]); ++i) {


According to this https://en.cppreference.com/w/cpp/string/byte/isdigit the standard requires the static cast to unsigned char. I do not understand exactly why...

trams · 2019-07-25T17:39:00Z

src/common/common.h

+  auto const size = str.size();
+  size_t i = 0;
+  auto constexpr u0 = static_cast<unsigned char>('0');
+  auto constexpr u10 = static_cast<unsigned char>(10);


[Minor] I think it is a bit confusing to have two similar looking variables u0 and u10 which represent very different things. I would just use literal 10 in the formula

src/common/json.cc

trams · 2019-07-25T19:27:23Z

src/gbm/gbtree.cc

+    while (true) {
+      char ch = GetNextChar();
+      while (ch != ']' && ch != -1) { ch = GetNextChar(); }
+      ch = GetNextNonSpaceChar();


As far as I understand after line 622 we would have
ch == ']' or ch == -1
the first one is non white space. The second case indicates that the stream has been at the end.
Either way we do not need to skip non space chars. Or do I miss something here?

trams · 2019-07-25T19:40:45Z

src/tree/tree_model.cc

+    AppendFloat(s.loss_chg, false);
+    AppendFloat(s.sum_hess, false);
+    AppendFloat(s.base_weight, false);
+    AppendInt(s.leaf_child_cnt, true);;


Typo: two semicolons

trams · 2019-07-25T19:45:34Z

src/tree/tree_model.cc

+
+  std::string tree_raw;
+  tree_raw.reserve(
+      stats_.size() * std::numeric_limits<Number::Float>::max_digits10 * 2);


I am trying to figure out why you reserve this amount of memory.
It seems each stat has 3 floats and one integer so I would expect that the total length needed is

stats_.size() * ( 2 + float_length*3 + int_length + 3) + 2
2 is []
3 are commas
float_length = std::numeric_limitsNumber::Float::max_digits10 + 1
int_length ...

trams · 2019-07-25T19:48:00Z

src/tree/tree_model.cc

+  out["stats"] = JsonRaw(std::move(tree_raw));
+
+  tree_raw.clear();
+  tree_raw.reserve(nodes_.size() * 9 * 2);


I guess 9 comes from amount of digits one need to save while serializing float. Why do we have 2 here? We have 5 ints and 1 float below

I see that the logic for preallocating buffers for arrays is quite complex and I think this is not the only place we have it. What do you think about extracting this into separate functions?

It is not very critical. It is rather the optimization.

tests/cpp/common/test_json.cc

trivialfis · 2019-07-26T04:26:43Z

@trams Thanks for the review. I'm no expert with string manipulations so your suggestions are really helpful.

trivialfis · 2019-08-04T13:45:19Z

Extracting implementation to #4732 for easier review.

trivialfis · 2019-08-12T15:57:30Z

Sorry folks, this is no longer blocking 1.0 as I think it's necessary to improve dmlc::Parameter first before finishing JSON integration. See #4755 . I want to do it right so it might take some more time. Please understand.

RAMitchell reviewed Jul 19, 2019

View reviewed changes

CodingCat mentioned this pull request Jul 20, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

trivialfis force-pushed the json-rebase branch from 41b0bc2 to bcf43df Compare July 20, 2019 13:12

trivialfis force-pushed the json-rebase branch from c5bc334 to cc28f33 Compare July 22, 2019 08:03

trivialfis added the 1.0.0 label Jul 22, 2019

trams reviewed Jul 23, 2019

View reviewed changes

trivialfis force-pushed the json-rebase branch from fcc9386 to 2b087ce Compare July 25, 2019 06:50

trivialfis mentioned this pull request Jul 25, 2019

A simple Json implementation for future use. #4708

Merged

trams reviewed Jul 25, 2019

View reviewed changes

trivialfis force-pushed the json-rebase branch from fc37c9f to 5e05a3e Compare July 30, 2019 02:50

trivialfis added 2 commits August 4, 2019 07:03

Initial commit for JSON.

4760394

Add integer, serializable, CMake version.

317e124

trivialfis force-pushed the json-rebase branch from c601456 to 317e124 Compare August 4, 2019 11:30

trivialfis mentioned this pull request Aug 4, 2019

Add JSON IO for various components. #4732

Closed

24 tasks

trivialfis mentioned this pull request Aug 6, 2019

Add Json integer, remove specialization. #4739

Merged

trivialfis removed the 1.0.0 label Aug 12, 2019

trivialfis closed this Dec 21, 2019

trivialfis deleted the json-rebase branch December 23, 2019 12:10

lock bot locked as resolved and limited conversation to collaborators Mar 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Initial commit for JSON. #4683

[WIP] Initial commit for JSON. #4683

trivialfis commented Jul 19, 2019 •

edited

Loading

RAMitchell left a comment

RAMitchell Jul 19, 2019

hcho3 Jul 20, 2019

hcho3 commented Jul 20, 2019

hcho3 commented Jul 20, 2019

trivialfis commented Jul 22, 2019 •

edited

Loading

trams left a comment

trams Jul 23, 2019

trams Jul 23, 2019

trivialfis Jul 23, 2019

trivialfis Jul 23, 2019

trams Jul 23, 2019

AnthonyTruchet Jul 23, 2019

hcho3 Jul 25, 2019

trivialfis Jul 25, 2019

trams Jul 25, 2019

hcho3 Jul 25, 2019 •

edited

Loading

trivialfis commented Jul 23, 2019

hcho3 commented Jul 23, 2019 •

edited

Loading

trams left a comment

trams Jul 25, 2019

trams Jul 25, 2019

trams Jul 25, 2019

trams Jul 25, 2019

trams Jul 25, 2019

trams Jul 25, 2019

trams Jul 25, 2019

trivialfis commented Jul 26, 2019

trivialfis commented Aug 4, 2019

trivialfis commented Aug 12, 2019 •

edited

Loading

		@@ -0,0 +1,1059 @@
		################################
		XGBoost JSON Schema, Version 1.0

[WIP] Initial commit for JSON. #4683

[WIP] Initial commit for JSON. #4683

Conversation

trivialfis commented Jul 19, 2019 • edited Loading

Notes to myself

Merged parts

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 commented Jul 20, 2019

hcho3 commented Jul 20, 2019

trivialfis commented Jul 22, 2019 • edited Loading

trams left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hcho3 Jul 25, 2019 • edited Loading

Choose a reason for hiding this comment

trivialfis commented Jul 23, 2019

hcho3 commented Jul 23, 2019 • edited Loading

trams left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Jul 26, 2019

trivialfis commented Aug 4, 2019

trivialfis commented Aug 12, 2019 • edited Loading

trivialfis commented Jul 19, 2019 •

edited

Loading

trivialfis commented Jul 22, 2019 •

edited

Loading

hcho3 Jul 25, 2019 •

edited

Loading

hcho3 commented Jul 23, 2019 •

edited

Loading

trivialfis commented Aug 12, 2019 •

edited

Loading