-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Initial commit for JSON. #4683
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON parsing code looks like it should go in dmlc-core. How hard is it for us to upstream it? I am open to having it in xgboost if it's too difficult, but we should at least consider it.
@@ -0,0 +1,1059 @@ | |||
################################ | |||
XGBoost JSON Schema, Version 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan to maintain this? I can see this going out of date quickly without some kind of CI enforcement. This looks like we are exposing the complete serialisation format such that someone could construct a working xgboost model completely external to xgboost. This could be a very useful feature, but also with some cost to support. Note that up until now we have not provided a schema for our binary format, only tried to enforce backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell We can automate the schema enforcement if we express our model schema using JSON Schema. There are tools available to validate JSON files against a given JSON schema.
@RAMitchell I agree that NIH (JSON parsing code) should eventually be merged into dmlc-core. For now, let's keep it in XGBoost and make sure it works. |
Why don’t you create a new branch dev_json and push your code there? |
@hcho3 Jenkins seems to return a 404 at first load. It works fine after a refresh. I have been experiencing the error for a few days so it might not be an incident. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your tremendous work
I am very interested in this change. It would simply writing converters of xgboost to other formats.
General question.
How close is your format to xgboost json dump?
src/gbm/gblinear_model.cc
Outdated
for (size_t i = 0; i < n_weights; ++i) { | ||
auto w = weight[i]; | ||
if (i != n_weights - 1) { | ||
raw += std::to_string(w) + ','; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned that std::to_string is locale dependent (https://en.cppreference.com/w/cpp/string/basic_string/to_string)
and in some locales (russian or ukrainian) they use ',' as a floating point which may break the output.
What do you think about using a locale independent function (if we can use C++17 we should consider std::to_chars)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as far as I understood documentation std::to_string just calls fprintf (https://en.cppreference.com/w/cpp/io/c/fprintf) which by default outputs floating point numbers with precision 6 which is not always enough to reconstruct the same number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will look into std::to_chars
, thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing std::{from, to}_chars
for integer is easy, but floating point number is messy. I think I'm just gonna use std::locale
to work around it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think worst case we can set locale at the top of conversion.
I am more concern that we should output at least numeric_limit::max_digits10 digits so we can reconstruct the same float and I do not believe that the default does this (I think it outputs only 6 instead of 9 digits)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be useful but I've not done C++ for a few year and might be outdated: https://www.boost.org/doc/libs/1_59_0/libs/locale/doc/html/std_locales.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis FYI, fmtlib is becoming part of C++20: http://www.zverovich.net/2019/07/23/std-format-cpp20.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hcho3 Good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, as a general rule what version of a standard are we using for xgboost?
C++14?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trams We adhere to C++11 standard. See https://xgboost.readthedocs.io/en/latest/contrib/coding_guide.html#c-coding-guideline
Not close. The dump prints only trees, but prints with much more readable structure. While this one dumps everything, but in a much more compact way for serialization. |
@trivialfis It looks like the result page is inaccessible from anonymous (not logged-in) users. I got the same 404 when using Incognito Browsing Mode. Let me take a look. Update. Doing a refresh removes 404. Investigating now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rather like this pull request. I have several minor comments
size_t i = 0; | ||
auto constexpr u0 = static_cast<unsigned char>('0'); | ||
auto constexpr u10 = static_cast<unsigned char>(10); | ||
while (i < size && std::isspace(str[i]) && str[i] != '\0') { ++i; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that std::isspace('\0') is already false (see the bottom of this page https://en.cppreference.com/w/cpp/string/byte/isdigit) so the extra check std[i] != '\0' is excessive
auto constexpr u0 = static_cast<unsigned char>('0'); | ||
auto constexpr u10 = static_cast<unsigned char>(10); | ||
while (i < size && std::isspace(str[i]) && str[i] != '\0') { ++i; } | ||
for (; i < size && std::isdigit(str[i]); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this https://en.cppreference.com/w/cpp/string/byte/isdigit the standard requires the static cast to unsigned char. I do not understand exactly why...
auto const size = str.size(); | ||
size_t i = 0; | ||
auto constexpr u0 = static_cast<unsigned char>('0'); | ||
auto constexpr u10 = static_cast<unsigned char>(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Minor] I think it is a bit confusing to have two similar looking variables u0 and u10 which represent very different things. I would just use literal 10 in the formula
while (true) { | ||
char ch = GetNextChar(); | ||
while (ch != ']' && ch != -1) { ch = GetNextChar(); } | ||
ch = GetNextNonSpaceChar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand after line 622 we would have
ch == ']' or ch == -1
the first one is non white space. The second case indicates that the stream has been at the end.
Either way we do not need to skip non space chars. Or do I miss something here?
AppendFloat(s.loss_chg, false); | ||
AppendFloat(s.sum_hess, false); | ||
AppendFloat(s.base_weight, false); | ||
AppendInt(s.leaf_child_cnt, true);; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: two semicolons
|
||
std::string tree_raw; | ||
tree_raw.reserve( | ||
stats_.size() * std::numeric_limits<Number::Float>::max_digits10 * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to figure out why you reserve this amount of memory.
It seems each stat has 3 floats and one integer so I would expect that the total length needed is
stats_.size() * ( 2 + float_length*3 + int_length + 3) + 2
2 is []
3 are commas
float_length = std::numeric_limitsNumber::Float::max_digits10 + 1
int_length ...
out["stats"] = JsonRaw(std::move(tree_raw)); | ||
|
||
tree_raw.clear(); | ||
tree_raw.reserve(nodes_.size() * 9 * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 9 comes from amount of digits one need to save while serializing float. Why do we have 2 here? We have 5 ints and 1 float below
I see that the logic for preallocating buffers for arrays is quite complex and I think this is not the only place we have it. What do you think about extracting this into separate functions?
It is not very critical. It is rather the optimization.
@trams Thanks for the review. I'm no expert with string manipulations so your suggestions are really helpful. |
Extracting implementation to #4732 for easier review. |
Sorry folks, this is no longer blocking 1.0 as I think it's necessary to improve |
Currently depends on #4577 . This is a monolithic PR for early review. I will try to split it up into 2 or 3 dependent PRs later once it's ready. I can create stacked PRs if pushing to origin is allowed. :-)
TODOs:
default_left
is not saved.@hcho3 @RAMitchell Sorry for the long wait.
Notes to myself
leaf_child_cnt
is not yet omitted.num_boosted_rounds
is not saved.Merged parts
Initial implementation #4708
Slightly improved version #4739