-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSON IO for various components. #4732
Conversation
@hetong007 Do you have any suggestion for releasing CRAN source distribution with CMake instead of Make? |
Also @pommedeterresautee ;-). |
I'm probably the least experienced developer of XGBoost in terms of CMake. Adding @khotilov as another potential source of suggestions. |
@trivialfis @hetong007 I'm quite nervous about requiring CMake in the R package. There are lots of different machines the R package get tested, and we have to pass every single one of them in order to keep the package on CRAN. See https://cran.r-project.org/web/checks/check_results_xgboost.html for the list of machines. My concern is that not every one of the testing machines would have CMake installed. Of course, we could simply ask users to compile from source and not bother with CRAN, but then we'd lose the ability to install XGBoost with |
@hcho3 Thanks for the link. Let's keep them there for now. Will open a new issue when I have something concrete. |
Hi, there is another point to take into account, I have made a PR on I am wondering if switching to Cmake would break things or even worst make it harder for new comers to setup the toolchain to be able to install github code. |
@trams Please help reviewing. This is extracted from previous monolithic PR with specialization and schema removed. Currently the performance is quite bad, it takes about 1.5s to load and 2.0s to save for a 25MB JSON model. Running valgrind most of the time spent on memory allocation (std::vector from JsonArray). I plan to add a custom pool allocator for all objects or use input string as buffer in a later PR. For now I'm going to prioritize on correctness. Will fix the failing tests. Unsurprisingly, handling those dynamic parameters is difficult. But the general structure for JSON integration should be ready for review. |
@pommedeterresautee Thanks for the information. Let me see if I can simplify those makefiles. |
@trivialfis, thank you for mentioning me in this review. As for performance I recently learned about https://github.com/lemire/simdjson a very fast json parser (~2-3Gb per second). I do not know your reasons to implement your own json parser vs reusing the third party but either way I suggest to dive into their code for tips on how they implemented high performance json parser (see here one of their benchmarks https://lemire.me/blog/2019/08/02/json-parsing-simdjson-vs-json-for-modern-c/) |
@trams Thanks, I'm aware of SIMD json. But requiring avx is not feasible in XGBoost. I will try simpler method like iterative parsing, single allocator etc. I'm not too worried about that. We don't want to add dependencies. |
Emm. I might need to add an |
@RAMitchell Let me try to extract an even smaller part of this first to push the cudf PR forward. |
becbfa3
to
3f35ba7
Compare
@hcho3 @RAMitchell The added |
3f35ba7
to
8747f06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not full review. I hope to find some time to finish it in next few days
src/tree/tree_model.cc
Outdated
j_node[5] = n.DefaultLeft(); | ||
CHECK(IsA<Boolean>(j_node[5])); | ||
|
||
v_j_nodes[i] = j_node; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is here an implicit constructor call to Array? If so could you make it explicit to make it easier to read?
src/tree/tree_model.cc
Outdated
std::vector<Json> v_j_nodes(nodes_.size()); | ||
for (size_t i = 0; i < nodes_.size(); ++i) { | ||
auto const& n = nodes_[i]; | ||
std::vector<Json> j_node(6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a perfect case for using C++11 initializer_list
@trams Thanks for the review. Those are very helpful suggestions. Will get back to this once available. |
8747f06
to
791b878
Compare
@RAMitchell To summarize, other than bug fixes, this PR does 4 things:
So why is that so huge? Because the JSON IO creates a framework for easy testing. It's meant to help creating reproducible train, where binary IO is not feasible. Also with JSON, you can access and verify all internal parameters of XGBoost. Hence easy testing for second item. |
895e88c
to
27a6018
Compare
77118cc
to
f67378c
Compare
076f58c
to
9aee921
Compare
9aee921
to
92fd834
Compare
@RAMitchell The R tests actually passes on Jenkins ... I'm gonna ignore the error I found. > require(data.table)
Loading required package: data.table
data.table 1.12.2 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
> a <- data.table(c=1:3, d=2:4)
> str(a)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ c: int 1 2 3
$ d: int 2 3 4
- attr(*, ".internal.selfref")=<externalptr> Notice the single quote around |
220e8b8
to
528c1be
Compare
* Add tests for JSON IO in both CXX. * Rigorous tests for training continuation. * Add basic documentation for the serialization format.
abfafe4
to
f87fc30
Compare
@trivialfis Can we close this, or is there more work to be done in JSON serialization? |
Will close it once R package is updated. |
Sorry for being slow. |
num_feature
,base_score
andnum_output_group
across XGBoost withlearner_model_param_
(LearnerModelParam
). This is passed as a pointer to const across XGBoost. Deprecate all duplicated parameter. Mark original model parameter as legacy.num_output_group
andnum_class
.gbtree
instead ofpredictor
, as the cache needs to be in sync betweencpu_predictor
andgpu_predictor
.Todos:
num_output_group
,num_feature
,base_score
across XGBoost.num_output_group
andnum_class
, remove related configurations.gpu_id
. Closes [WIP] Hack for GPU ID config #5072 .nthreads
. This should be done together with [WIP] Clarify default number of threads. #4975.gpu_id
andgpu_predictor
from binary model.dmlc::Parameter
. #4755