Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Optimize the JSON implementation. #5046

Closed
wants to merge 6 commits into from

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Nov 18, 2019

I reimplement the JSON in xgboost for faster performance. The new version for both generating and reading (they share the same code so perf difference should be invisible) numeric heavy data is about 200M per second, for manually generated large (~1GB) data it's still slower than rapidjson for about 15% due to inefficient memory management (they have a pool memory allocator while I just use a vector). But it should be more than enough for XGBoost IO as normally the generated file is lesser than 30MB, which can be processed instantly.

Per offline conversation with @RAMitchell , the complexity of a fast JSON implementation is not trivial, one of the critical part is floating point IO. For printing floating point values. I adopted ryu from https://github.com/ulfjack/ryu with some refactoring and comments. It's blazing fast but also complicated. Most of other existing ports are done line by line (see readme in that repo). So this PR is asking for comments do we want to proceed with the JSON IO idea.

The to_chars interface is adopted from c++17.

I haven't remove the old implementation yet. Will merge it into the model IO PR and close this one if the new implementation is approved.

@trams The optimization PR.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2,700 lines is a lot of code. If you are not available someone else may have to understand and debug this. Lets revisit the idea of using rapidjson in #3980

@trivialfis
Copy link
Member Author

Just to clarify, the new one will replace the old one so code gain will be about 700 lines. (Not to scare others). ;-)

@trams
Copy link
Contributor

trams commented Nov 20, 2019

I still do not quite get why do we need a fast json implementation. Loading models fast is nice but I can't imagine the use case where it matters: you load a model once.

Either way this review is really big (at least for me). It is 2000 lines of additional code regardless whether we remove the previous implementation.

I briefly looked into it and I think one thing is missing is benchmarks. I understand why. Writing a proper benchmark is extremely hard. If you decide to do this I suggest to take a look at this amazing (eye opening video for me) talk from the last StangeLoop 2019 https://www.thestrangeloop.com/2019/performance-matters.html
the speaker talks about set of tools to do statistically sound benchmarks.

P.S. Writing a fast json parser is fun
P.S.S. Until mid December I do not think I will have enough time to look into this

@trivialfis
Copy link
Member Author

trivialfis commented Nov 20, 2019

Either way this review is really big (at least for me). It is 2000 lines of additional code regardless whether we remove the previous implementation.

I understand the issue here. So this PR is mostly for seeing whether others want to add a dependency or want to do it ourselves. It's easier to draw others' attention with actual code.

I briefly looked into it and I think one thing is missing is benchmarks.

I have some specific benchmark snippets for loading/writing XGBoost models comparing a few different implementations including rapidjson, json for modern c++, simdjson (no generation), the one currently living in XGBoost and this new one. Small summary is given in PR description. Will try to make a nice table and repo so everyone can see. Actually, the floating point IO implemented in this PR also helps model dumping (plotting models).

the speaker talks about set of tools to do statistically sound benchmarks.

That's interesting, I just use valgrind. Will try the tools introduced there.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 20, 2019

@trams

I still do not quite get why do we need a fast json implementation. Loading models fast is nice but I can't imagine the use case where it matters: you load a model once.

Not quite. According to @RAMitchell, model serialization may become bottleneck in distributed training. See #3980 (comment)

@trivialfis trivialfis force-pushed the json-new-impl branch 2 times, most recently from e394810 to edce0c6 Compare November 21, 2019 16:55
@trivialfis
Copy link
Member Author

trivialfis commented Nov 21, 2019

@hcho3 Now the PR removed the old implementation. I will do some further clean-ups, benchmarking and unittests. But it would be a good time to discuss the road map for our IO issues.

Other than the previous RFCs, using JSON for model IO also benefits:

  • distributed setting where we can have easier way to visualize the model as suggested by @chenqin .
  • Allow easiest external use of trained model, anyone can load the model in basically any language.
  • We can actually add an API to extract the internal configuration and load it back, this not only allows helping users gaining inside to their parameters, but also they can use a JSON object as an alternative to our old params = {...} training arguments. To me this is very attractive as structured input is more intuitive and robust.

So I prefer we keep the JSON IO idea. The question is whether do we want to do it ourselves or make an external dependency. With former, everyone here might have to co-maintain this piece of code even we upstream it to dmlc-core as stated by @RAMitchell and inspired by @hcho3 bus factor. (And there are lots of buses in my place, so ...).

I prefer my own implementation, as:

  • I understand it and I can fix any issue with it fast. The most suitable implementation I can find is rapidjson, which is 5 times more code than this one. The code not living in this repository doesn't mean it's the code we don't need to care about.
  • It's fast. Not as fast as rapidjson (close) as summarized in PR description. But so far with valgrind the memory allocation/free is the bottleneck, which can be fixed with an in-place pool allocator. Also, even without further optimization it's still faster than loading a file from disk so I just stop adding more code...
  • We can upstream it to dmlc-core, the number parsing/writing will help the Parameter class as this one is guaranteed to make roundtrip reproducible result (Unrelated but Interesting fact: to_string doesn't fullfill the requirement according to https://github.com/miloyip/dtoa-benchmark ). - Also it helps model dumping and outputting metric evaluation result as suggested by @hcho3 .

Cons:

  • It's the first commit so everything is a bit raw. It's will take some more progress to make it mature as a JSON implementation by itself is not an easy task.
  • I learned many concepts here as a new comer, especially the floating point IO. I walked through the paper and understand it, but never tried to finished those proofs.

Other available alternatives that I have looked into the source code:

  • JSON for modern c++. It has a nice interface but not performant enough. (so is the old one here).
  • simdjson. Gained a lots of traction recently. But it's only a parser.
  • sajson. It's only a parser.
  • rapidjson. Suitable and very well written. But the code size is much larger, also this is not a fair comparison, but it has more issues on github than XGBoost ...

Please share your thoughts.

@trivialfis
Copy link
Member Author

trivialfis commented Nov 22, 2019

@hcho3 I'm not sure what's happening to Jenkins. I tried to ssh in but it timed out. I can't ping the IP returned by dig, seems it's disconnected somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants