[WIP] Optimize the JSON implementation. #5046

trivialfis · 2019-11-18T08:08:04Z

I reimplement the JSON in xgboost for faster performance. The new version for both generating and reading (they share the same code so perf difference should be invisible) numeric heavy data is about 200M per second, for manually generated large (~1GB) data it's still slower than rapidjson for about 15% due to inefficient memory management (they have a pool memory allocator while I just use a vector). But it should be more than enough for XGBoost IO as normally the generated file is lesser than 30MB, which can be processed instantly.

Per offline conversation with @RAMitchell , the complexity of a fast JSON implementation is not trivial, one of the critical part is floating point IO. For printing floating point values. I adopted ryu from https://github.com/ulfjack/ryu with some refactoring and comments. It's blazing fast but also complicated. Most of other existing ports are done line by line (see readme in that repo). So this PR is asking for comments do we want to proceed with the JSON IO idea.

The to_chars interface is adopted from c++17.

I haven't remove the old implementation yet. Will merge it into the model IO PR and close this one if the new implementation is approved.

@trams The optimization PR.

RAMitchell

2,700 lines is a lot of code. If you are not available someone else may have to understand and debug this. Lets revisit the idea of using rapidjson in #3980

trivialfis · 2019-11-19T03:54:28Z

Just to clarify, the new one will replace the old one so code gain will be about 700 lines. (Not to scare others). ;-)

trams · 2019-11-20T03:05:15Z

I still do not quite get why do we need a fast json implementation. Loading models fast is nice but I can't imagine the use case where it matters: you load a model once.

Either way this review is really big (at least for me). It is 2000 lines of additional code regardless whether we remove the previous implementation.

I briefly looked into it and I think one thing is missing is benchmarks. I understand why. Writing a proper benchmark is extremely hard. If you decide to do this I suggest to take a look at this amazing (eye opening video for me) talk from the last StangeLoop 2019 https://www.thestrangeloop.com/2019/performance-matters.html
the speaker talks about set of tools to do statistically sound benchmarks.

P.S. Writing a fast json parser is fun
P.S.S. Until mid December I do not think I will have enough time to look into this

trivialfis · 2019-11-20T09:23:29Z

Either way this review is really big (at least for me). It is 2000 lines of additional code regardless whether we remove the previous implementation.

I understand the issue here. So this PR is mostly for seeing whether others want to add a dependency or want to do it ourselves. It's easier to draw others' attention with actual code.

I briefly looked into it and I think one thing is missing is benchmarks.

I have some specific benchmark snippets for loading/writing XGBoost models comparing a few different implementations including rapidjson, json for modern c++, simdjson (no generation), the one currently living in XGBoost and this new one. Small summary is given in PR description. Will try to make a nice table and repo so everyone can see. Actually, the floating point IO implemented in this PR also helps model dumping (plotting models).

the speaker talks about set of tools to do statistically sound benchmarks.

That's interesting, I just use valgrind. Will try the tools introduced there.

hcho3 · 2019-11-20T10:15:03Z

@trams

I still do not quite get why do we need a fast json implementation. Loading models fast is nice but I can't imagine the use case where it matters: you load a model once.

Not quite. According to @RAMitchell, model serialization may become bottleneck in distributed training. See #3980 (comment)

trivialfis · 2019-11-21T17:35:23Z

@hcho3 Now the PR removed the old implementation. I will do some further clean-ups, benchmarking and unittests. But it would be a good time to discuss the road map for our IO issues.

Other than the previous RFCs, using JSON for model IO also benefits:

distributed setting where we can have easier way to visualize the model as suggested by @chenqin .
Allow easiest external use of trained model, anyone can load the model in basically any language.
We can actually add an API to extract the internal configuration and load it back, this not only allows helping users gaining inside to their parameters, but also they can use a JSON object as an alternative to our old params = {...} training arguments. To me this is very attractive as structured input is more intuitive and robust.

So I prefer we keep the JSON IO idea. The question is whether do we want to do it ourselves or make an external dependency. With former, everyone here might have to co-maintain this piece of code even we upstream it to dmlc-core as stated by @RAMitchell and inspired by @hcho3 bus factor. (And there are lots of buses in my place, so ...).

I prefer my own implementation, as:

I understand it and I can fix any issue with it fast. The most suitable implementation I can find is rapidjson, which is 5 times more code than this one. The code not living in this repository doesn't mean it's the code we don't need to care about.
It's fast. Not as fast as rapidjson (close) as summarized in PR description. But so far with valgrind the memory allocation/free is the bottleneck, which can be fixed with an in-place pool allocator. Also, even without further optimization it's still faster than loading a file from disk so I just stop adding more code...
We can upstream it to dmlc-core, the number parsing/writing will help the Parameter class as this one is guaranteed to make roundtrip reproducible result (Unrelated but Interesting fact: to_string doesn't fullfill the requirement according to https://github.com/miloyip/dtoa-benchmark ). - Also it helps model dumping and outputting metric evaluation result as suggested by @hcho3 .

Cons:

It's the first commit so everything is a bit raw. It's will take some more progress to make it mature as a JSON implementation by itself is not an easy task.
I learned many concepts here as a new comer, especially the floating point IO. I walked through the paper and understand it, but never tried to finished those proofs.

Other available alternatives that I have looked into the source code:

JSON for modern c++. It has a nice interface but not performant enough. (so is the old one here).
simdjson. Gained a lots of traction recently. But it's only a parser.
sajson. It's only a parser.
rapidjson. Suitable and very well written. But the code size is much larger, also this is not a fair comparison, but it has more issues on github than XGBoost ...

Please share your thoughts.

trivialfis · 2019-11-22T05:25:13Z

@hcho3 I'm not sure what's happening to Jenkins. I tried to ssh in but it timed out. I can't ping the IP returned by dig, seems it's disconnected somehow?

trivialfis requested review from hcho3 and RAMitchell November 18, 2019 08:10

RAMitchell reviewed Nov 19, 2019

View reviewed changes

RAMitchell mentioned this pull request Nov 19, 2019

RFC: JSON as Next-Generation Model Serialization Format #3980

Closed

[POC] Implement fast to_chars for integer and float.

66c4c06

trivialfis force-pushed the json-new-impl branch 2 times, most recently from e394810 to edce0c6 Compare November 21, 2019 16:55

Use newer JSON impl.

c589c33

trivialfis force-pushed the json-new-impl branch from edce0c6 to c589c33 Compare November 21, 2019 17:46

trivialfis added 4 commits November 22, 2019 02:48

Small cleanup.

dcebd2b

Make sure no exception can be thrown during parsing.

74948d6

Copy the string with constructing parameters.

b04281e

Small fixes and cleanup.

e5c639e

trivialfis closed this Nov 23, 2019

trivialfis reopened this Nov 23, 2019

trivialfis mentioned this pull request Nov 28, 2019

Add JSON IO for various components. #4732

Closed

24 tasks

This was referenced Dec 11, 2019

Small refinements for JSON model. #5112

Merged

Model Serialization and Endianness #4072

Closed

sandys mentioned this pull request Feb 23, 2020

ONNX export is now available in onnxmltools #4626

Closed

trivialfis mentioned this pull request Jun 9, 2020

Implement fast numeric printing routines. #5772

Merged

trivialfis closed this Feb 5, 2021

trivialfis deleted the json-new-impl branch February 5, 2021 19:13

StrikerRUS mentioned this pull request Dec 13, 2021

[RFC] Unify model format customize string or Json microsoft/LightGBM#4887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Optimize the JSON implementation. #5046

[WIP] Optimize the JSON implementation. #5046

trivialfis commented Nov 18, 2019 •

edited

Loading

RAMitchell left a comment •

edited

Loading

trivialfis commented Nov 19, 2019

trams commented Nov 20, 2019 •

edited

Loading

trivialfis commented Nov 20, 2019 •

edited

Loading

hcho3 commented Nov 20, 2019

trivialfis commented Nov 21, 2019 •

edited

Loading

trivialfis commented Nov 22, 2019 •

edited

Loading

[WIP] Optimize the JSON implementation. #5046

[WIP] Optimize the JSON implementation. #5046

Conversation

trivialfis commented Nov 18, 2019 • edited Loading

RAMitchell left a comment • edited Loading

Choose a reason for hiding this comment

trivialfis commented Nov 19, 2019

trams commented Nov 20, 2019 • edited Loading

trivialfis commented Nov 20, 2019 • edited Loading

hcho3 commented Nov 20, 2019

trivialfis commented Nov 21, 2019 • edited Loading

trivialfis commented Nov 22, 2019 • edited Loading

trivialfis commented Nov 18, 2019 •

edited

Loading

RAMitchell left a comment •

edited

Loading

trams commented Nov 20, 2019 •

edited

Loading

trivialfis commented Nov 20, 2019 •

edited

Loading

trivialfis commented Nov 21, 2019 •

edited

Loading

trivialfis commented Nov 22, 2019 •

edited

Loading