Optimizations for CPU #4529

SmirnovEgorRu · 2019-06-02T18:27:03Z

It is this PR #4433. But I have merged all commits in once and applied comment in the review.
My next steps here:

Make some coding cleanup and add comments where needed
Attach performance data

chenqin · 2019-06-17T18:41:11Z

Thanks for huge effort!

A few questions in general

proactive prefetch data into all level of cache can be great help speeding up. why T0(all level)? L3 is abundant enough I assume.
given compiler also do optimization, do we plan to test turn on flags and benchmark against https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Time for hist computation was reduced well and now computation of new gradients/hessians on each iteration spends significant time. Here are many opportunities to use AVX/AVX512 jointly with mask instructions

SmirnovEgorRu · 2019-06-17T19:37:15Z

@chenqin Hi!
I prefetch data which will be used very soon (after reading of the next 10 rows of bin-matrix), so better to prefetch this to L1. Also this data can be already in L3 and we need to fetch this to L1 to reduce latency.

given compiler also do optimization, do we plan to test turn on flags and benchmark against

Here I mean - vector instruction sets mostly. Now XGBoost uses sse2 only, but for gradient/hessians computations (as an example) AVX512 can be used as well:

Some of objective functions contain many code branches. They can be replaced by vector masks instructions from AVX512. From time to time it provides 10x speedup or even more.
Usage of wider vector registers also should help.

One problem here - that compiler in some cases can't auto-vectorize your code, instrinsic-functions help a lot here. Also, if we want to support different instructions sets (for example AVX512 and sse2) in one library-file - we need to have separate paths of code for them. It can be done, but requires some changes in build system.

I think, if topic of gradient boosting optimizations are interested - we can write some article for web site (as it was done for GPU). From my side I can provide materials about this topic. @chenqin, @hcho3 What do you think?

trivialfis

Thanks for the hard work, this is incredible. ;-) I'm going to work on this along with DMatrix soon. So will approve as long as all tests passed (as in current case). @hcho3

SmirnovEgorRu · 2019-06-20T17:09:38Z

Hi @hcho3,
Do you have any estimations when you will be free to review this PR? :)

hcho3 · 2019-06-24T07:09:21Z

Reviewing now

trivialfis · 2019-06-24T08:47:42Z

@hcho3 I'm only approving the PR because the tests passed, and I think optimizing histogram build will be beneficial to future improvement. But if we merge this someone needs to bring a huge refactor to quantile_hist and clarify how all these are put together.

Personally I prefer optimizing this by thinking about the whole structure. And I think there's lots of opportunities in optimization if we look at how the whole procedure is run, I mean how many of the steps are necessary if we do some preprocessing to the data like calculating the number of non-zero columns, implement a sparse histogram you mentioned, if it's possible to do grouping block before constructing quantile cuts etc. Or if it's possible to implement prediction cache for multi-class objective so that CSR format may not be needed at all.

BTW, @RAMitchell , when I mentioned "column matrix", I meant this class: https://github.com/dmlc/xgboost/blob/master/src/common/column_matrix.h

Going instruction level optimization is cool, but to me quantile hist has a lot more room to be improved on other places and vectorization might not be the critical point. I don't have the time to do it myself right now, it just my opinions.

hcho3

@SmirnovEgorRu Thanks a lot for preparing the pull request. The code is better organized and more comprehensible than #4433, and benchmark results are nice. I also like how now all operations are batched over multiple nodes. Overall, I'm inclined to approve this pull request.

@trivialfis I agree that more clarification of what each component does is in order.

src/common/hist_util.h

src/tree/param.h

src/tree/updater_quantile_hist.h

src/tree/updater_quantile_hist.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

SmirnovEgorRu · 2019-06-26T13:38:21Z

@hcho3, thanks for review. I have applied all comments, CI is green.

…ilder::BuildHistsBatch()

hcho3 · 2019-06-27T00:17:50Z

Looks like the R project website (https://cloud.r-project.org) is currently down, causing Windows R tests to fail. For now, we will ignore the failed tests.

SmirnovEgorRu · 2019-06-27T05:14:55Z

@hcho3, thank you!
Do I need to do smt else for merging this PR?

hcho3 · 2019-06-27T18:34:26Z

@SmirnovEgorRu It’s merged now. Thanks!

coding-komek · 2019-07-01T09:40:10Z

Thanks for including these changes! Is there an ETA as to when a release can be expected?

arnocandel · 2019-07-02T17:47:46Z

Are results (model, predictions) reproducible for same machine, same random state, same input?

SmirnovEgorRu · 2019-07-02T18:11:31Z

@arnocandel I tested this for multiple data sets - it was reproducible even for multicore systems,
Also, XGBoost has very sensitive tests (like for features importance), If you do smth wrong in terms of results сhanging - they would be failed.

SmirnovEgorRu · 2019-07-22T14:05:52Z

Hi @hcho3, @trivialfis,
Do you know when the next release will be available?

hcho3 · 2019-07-22T18:34:22Z

See #4680

SmirnovEgorRu and others added 22 commits December 2, 2018 18:58

Initial performance optimizations for xgboost

0ff8ada

remove includes

40c07c7

revert float->double

c80f4bc

fix for CI

32e88bb

fix for CI

f6c44a6

fix for CI

c36127d

fix for CI

dd12944

fix for CI

416bf2f

fix for CI

c862fa8

fix for CI

1d59566

fix for CI

b7685df

fix for CI

e29229b

fix for CI

d59c386

Check existence of _mm_prefetch and __builtin_prefetch

4a0c9b3

Fix lint

6c37c3f

Merge remote-tracking branch 'remotes/xgb_last/master'

f49c7bc

Merge branch 'master' of https://github.com/dmlc/xgboost

7dac50c

Merge remote-tracking branch 'upstream/master'

075dc55

Merge remote-tracking branch 'upstream/master'

a623b70

Merge branch 'master' of https://github.com/dmlc/xgboost

4312cf2

optimizations for CPU

a3df336

appling comments in review

f4dd54a

trivialfis self-requested a review June 2, 2019 19:04

SmirnovEgorRu added 7 commits June 12, 2019 23:57

add some comments, code refactoring

d2fc878

fixing issues in CI

9301ede

adding runtime checks

b065ee2

remove 1 extra check

8e399d5

remove extra checks in BuildHist

1f8f0ee

remove checks

7b28289

add debug info

ff7d946

trivialfis requested a review from hcho3 June 18, 2019 08:55

trivialfis approved these changes Jun 18, 2019

View reviewed changes

hcho3 reviewed Jun 26, 2019

View reviewed changes

SmirnovEgorRu and others added 2 commits June 26, 2019 14:35

Apply suggestions from code review

6f54a91

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

apply review comments

556a11f

hcho3 added 2 commits June 26, 2019 14:55

Remove unused function CreateNewNodes()

e8a7940

Add descriptive comment on node_idx variable in QuantileHistMaker::Bu…

adf0012

…ilder::BuildHistsBatch()

hcho3 approved these changes Jun 26, 2019

View reviewed changes

hcho3 merged commit 4d6590b into dmlc:master Jun 27, 2019

This was referenced Jul 3, 2019

MultiThreading not working properly with version 0.71.2 #3543

Closed

Call for contribution: improve multi-core CPU performance of 'hist' #3810

Closed

sperlingxx mentioned this pull request Jul 30, 2019

[HOTFIX] distributed training with hist method #4716

Merged

This was referenced Aug 13, 2019

Fix unstable results in hist method #4767

Closed

Multicore scalability of the Histogram-based GBDT scikit-learn/scikit-learn#14306

Open

hcho3 mentioned this pull request Oct 8, 2019

[BLOCKING] Per-node sync slows down distributed training with 'hist' #4679

Closed

lock bot locked as resolved and limited conversation to collaborators Oct 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for CPU #4529

Optimizations for CPU #4529

SmirnovEgorRu commented Jun 2, 2019

chenqin commented Jun 17, 2019

SmirnovEgorRu commented Jun 17, 2019

trivialfis left a comment

SmirnovEgorRu commented Jun 20, 2019

hcho3 commented Jun 24, 2019

trivialfis commented Jun 24, 2019 •

edited

Loading

hcho3 left a comment

SmirnovEgorRu commented Jun 26, 2019

hcho3 commented Jun 27, 2019

SmirnovEgorRu commented Jun 27, 2019

hcho3 commented Jun 27, 2019

coding-komek commented Jul 1, 2019

arnocandel commented Jul 2, 2019 •

edited

Loading

SmirnovEgorRu commented Jul 2, 2019

SmirnovEgorRu commented Jul 22, 2019

hcho3 commented Jul 22, 2019

Optimizations for CPU #4529

Optimizations for CPU #4529

Conversation

SmirnovEgorRu commented Jun 2, 2019

chenqin commented Jun 17, 2019

SmirnovEgorRu commented Jun 17, 2019

trivialfis left a comment

Choose a reason for hiding this comment

SmirnovEgorRu commented Jun 20, 2019

hcho3 commented Jun 24, 2019

trivialfis commented Jun 24, 2019 • edited Loading

hcho3 left a comment

Choose a reason for hiding this comment

SmirnovEgorRu commented Jun 26, 2019

hcho3 commented Jun 27, 2019

SmirnovEgorRu commented Jun 27, 2019

hcho3 commented Jun 27, 2019

coding-komek commented Jul 1, 2019

arnocandel commented Jul 2, 2019 • edited Loading

SmirnovEgorRu commented Jul 2, 2019

SmirnovEgorRu commented Jul 22, 2019

hcho3 commented Jul 22, 2019

trivialfis commented Jun 24, 2019 •

edited

Loading

arnocandel commented Jul 2, 2019 •

edited

Loading