-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized ApplySplit and UpdatePredictCache functions on CPU #5244
Conversation
@SmirnovEgorRu Can you try running this with the URL dataset too? |
@hcho3, sure,
Memory consumption is similar as before also (21231488 KB). |
4d18012
to
ff3ae2c
Compare
@SmirnovEgorRu Be careful with the change of author in git commit. ;-) |
d0f8c3f
to
4283e7d
Compare
4283e7d
to
0c1df2b
Compare
I measured data similar as for previous PR. Higgs dataset:
Parameters: { 'alpha': 0.9, 'max_bin': 256, 'scale_pos_weight': 2,
'learning_rate': 0.1, 'reg_lambda': 1, "min_child_weight": 0,
'max_depth': 8, 'max_leaves': 2**8, 'objective': 'binary:logistic' } Airline dataset:
Parameters: { 'alpha': 0.9, 'max_bin': 256, 'scale_pos_weight': 2,
'learning_rate': 0.1, 'reg_lambda': 1, "min_child_weight": 0,
'max_depth': 8, 'max_leaves': 2**8, 'objective': 'binary:logistic' } URL dataset:
Parameters: {'max_depth': 6,'tree_method':'hist'} Distributed mode on Mortgage data set:I used local cluster to test performance of distributed case.
Extended list of benchmarks:
OMP env: OMP_NUM_THREADS=48 OMP_PLACES={0}:96:1 HWAWS c5.metal, CLX 8275 @3.0GHz, 24 cores per socket, 2 sockets, HT: on, 96 threads totally |
@hcho3, I have a question related to CI. Also, you can see above - I tested many things including scaling by threads, scaling by niter, memory usage, different data sets including dense/sparse cases, distributed mode and accuracy. And I see improvements in performance for many cases, no perf degradation, memory consumption mostly became even slightly less in many cases, no accuracy loosing for all cases. And I want to propose to merge this into 1.0 release branch also, because I see 1.8x improvement in average across data sets, it should become good feature for the major release. I understand that the release branch has already been created, but I tried to cover all problematic things in my additional tests and I don't see any issues which can affect the product (if there are no real issues CI described above).
|
No, it would take a fair amount of time for us to review this PR, and it seems risky to approve this large magnitude PR so close to 1.0. Let us include it in the next 1.1 release. We plan to make a new release every 2 months or so. @trivialfis After 1.0, I am thinking of preparing 1.1 in about 6 weeks. There are a few fixes that I’d like to see. WDYT? |
@hcho3 Sorry, missed this one. Agreed. Let me know if I can help. Briefly looked into this PR, need some more time to understand the changes. |
* Remove SimpleArray as it's only used in column matrix, and resize is only called once per tree. * Reduce the number of parameters, specifically by computing prefetching at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance improvement is great! Thanks for your wonderful work on hist algorithm optimization. Here are a few things about code structure, please see inlined comments too.
- DMatrix should rarely be a parameter of builder, as it uses only the gradient index. So I believe most of pointers/references to DMatrix are unused variables, or just used for accessing meta info. Please remove them.
- Please reduce the usage of
ibegin
,iend
andoffset
as function parameters. You can pass a namedSpan
orRange1d
by you as parameter, then expand the pointer out inside function scope. This way it's more clear what are those parameters pointing to.
Codecov Report
@@ Coverage Diff @@
## master #5244 +/- ##
==========================================
+ Coverage 81.66% 83.76% +2.10%
==========================================
Files 11 11
Lines 2389 2409 +20
==========================================
+ Hits 1951 2018 +67
+ Misses 438 391 -47
Continue to review full report at Codecov.
|
@trivialfis, thank you very much for the review. |
@trivialfis, @hcho3, any changed required to be committed? |
I still need to understand the multiple by 2 expression and try to make it obvious. Will keep you posted. |
@trivialfis, added explanation in the code. |
One CI step is failed, but error is urllib.error.URLError: <urlopen error [Errno 54] Connection reset by peer> Could we restart this? |
Codecov Report
@@ Coverage Diff @@
## master #5244 +/- ##
==========================================
- Coverage 83.76% 83.75% -0.02%
==========================================
Files 11 11
Lines 2409 2413 +4
==========================================
+ Hits 2018 2021 +3
- Misses 391 392 +1
Continue to review full report at Codecov.
|
@trivialfis @RAMitchell, do you have any new comments/concerns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Defer to others if they have any issues.
|
||
const uint32_t missing_val = std::numeric_limits<uint32_t>::max(); | ||
|
||
for (auto rid : rid_span) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so you are aware, iterating through a Span implies boundary checks and may be slower than a standard for loop and accessing memory with pointers.
If this section is not performance critical you should absolutely iterate over span. I try to avoid optimising things that do not have visible effect on runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested performance again and I didn't observed regression vs. previous version.
So let's keep it as is (additional checks are not bad if it doesn't affect the performance).
GHistRow hist) { | ||
const size_t size = row_indices.Size(); | ||
const size_t* rid = row_indices.begin; | ||
const float* pgh = reinterpret_cast<const float*>(gpair.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pgh is not meaningful: https://google.github.io/styleguide/cppguide.html#General_Naming_Rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, thanks! Just a suggestion for the future, we might want to do some refactoring to this monolithic updater, like splitting up the builder into different components for loss guided, depthwise, an independent component for row partitioning etc. I want to add extra functionality to both gpu hist and hist, having nicer code would make our lives easier. ;-)
@trivialfis, thank you! |
This PR includes changes from #5156, will be rebased after committing the last one.
This commit achieves the same performance as before reverting #5104: