Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized ApplySplit and UpdatePredictCache functions on CPU #5244

Merged
merged 9 commits into from
Feb 29, 2020

Conversation

SmirnovEgorRu
Copy link
Contributor

@SmirnovEgorRu SmirnovEgorRu commented Jan 29, 2020

This PR includes changes from #5156, will be rebased after committing the last one.
This commit achieves the same performance as before reverting #5104:

higgs1m ApplySplit EvaluateSplit BuildHist SyncHistogram Prediction
Master 33 29 90 26 3
Before reverting 3.7 3.5 6.2 0.0 1.6
This PR 3.65 1.3975 6.75 1.86 0.74
airline-ohe ApplySplit EvaluateSplit BuildHist SyncHistogram Prediction
Master 26 27 67 12 2
Before reverting 9.0 6.1 28.8 0.0 0.7
This PR 4.579 0.75 29.05 0.669 0.352

@hcho3
Copy link
Collaborator

hcho3 commented Jan 29, 2020

@SmirnovEgorRu Can you try running this with the URL dataset too?

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, sure,

Branch Time of Update, sec
master 139.3
this PR 29.1

Memory consumption is similar as before also (21231488 KB).
HW: Xeon E5-2680 v3 @ 2.50GHz, 2 sockets, 14 cores per socket

@trivialfis
Copy link
Member

trivialfis commented Jan 31, 2020

@SmirnovEgorRu Be careful with the change of author in git commit. ;-)

@SmirnovEgorRu SmirnovEgorRu force-pushed the apply_spliy_opt_2 branch 2 times, most recently from d0f8c3f to 4283e7d Compare February 1, 2020 18:35
@SmirnovEgorRu
Copy link
Contributor Author

SmirnovEgorRu commented Feb 1, 2020

I measured data similar as for previous PR.

Higgs dataset:

nthreads (w/ 1000 trees) 1 8 24 48 96
This PR, sec 207.7 44.3 22.2 18.1 26.5
Master, sec 280.9 87.4 48.4 44.8 45.3
Memory PR, KB 1245208 1284828 1293456 1292076 1302464
Memory  master, KB 1241536 1272884 1294728 1300284 1314112
LogLoss this PR 0.525144 0.525144 0.525144 0.525144 0.525144
LogLoss master 0.525144 0.525144 0.525144 0.525144 0.525144
niter (w/ 48 threads) 50 200 500 1000
This PR, sec 2.0 4.8 9.9 18.2
Master, sec 3.5 10.7 22.7 44.8
Memory PR, KB 1292728 1292756 1293332 1299388
Memory  master, KB 1289544 1301496 1293892 1300076
LogLoss this PR 0.566303 0.543322 0.531112 0.525144
LogLoss master 0.566303 0.543322 0.531112 0.525144

Parameters:

 { 'alpha':  0.9, 'max_bin': 256, 'scale_pos_weight': 2,
'learning_rate': 0.1, 'reg_lambda': 1, "min_child_weight": 0,
'max_depth': 8,  'max_leaves': 2**8, 'objective': 'binary:logistic' }

Airline dataset:

nthreads (w/ 1000 trees) 1 8 24 48 96
This PR, sec 953.1 187.9 83.3 60.1 64.3
Master, sec 953.7 211.4 105.0 94.6 87.1
Memory PR, KB 25387904 25549072 25562984 25567208 25569168
Memory  master, KB 25388064 25548864 25562324 25567528 25572872
LogLoss this PR 0.461403 0.461403 0.461403 0.461403 0.461403
LogLoss master 0.461403 0.461403 0.461403 0.461403 0.461403
niter (w/ 48 threads) 50 200 500 1000
This PR, sec 27.2872 37.7223 52.9934 73.568467
Master, sec 28.8187 40.0345 60.4123 94.6476
Memory PR, KB 25566544 25566472 25566640 25566340
Memory  master, KB 25567788 25561672 25567108 25567068
LogLoss this PR 0.478638 0.469505 0.465152 0.461403
LogLoss master 0.478638 0.469505 0.465152 0.461403

Parameters:

 { 'alpha':  0.9, 'max_bin': 256, 'scale_pos_weight': 2,
'learning_rate': 0.1, 'reg_lambda': 1, "min_child_weight": 0,
'max_depth': 8,  'max_leaves': 2**8, 'objective': 'binary:logistic' }

URL dataset:

nthreads (10 iter) 8 24 48 96
This PR, sec 58.7 40.7 43.1 49.9
Master, sec 60.7 41.4 43.8 51.2
Memory PR, GB 18.84 20.17 22.25 26.30
Memory  master, GB 18.84 20.16 22.23 26.28

Parameters:

{'max_depth': 6,'tree_method':'hist'}

Distributed mode on Mortgage data set:

I used local cluster to test performance of distributed case.

Mortgage 2000Q1 2 workers, 24 threads per worker 48  workers, 2 threads/worker
Master, sec 273.7 198.0
This PR, sec 253.0 193.5
Mortgage 2000Q1 2 workers, 24 threads per worker 48  workers, 2 threads/worker
Master, rmse 9.33264 9.31267
This PRr, rmse 9.33264 9.31267

Extended list of benchmarks:

Dataset higgs1m Letters Airline-ohe MNIST MSRank-30K Mortgage
Before reverting, sec 15.5 10.3 55.3 69.5 99.1 18.3
Current PR, sec 14.7 10.1 59.4 80.5 111.6 19.0
Master, sec 40.2 15.5 91.0 97.1 180.4 37.9
Gain this PR vs. master 2.7 1.5 1.5 1.2 1.6 2.0
Data set higgs1m Letters Airline-ohe MNIST MSRank-30K Mortgage
LogLoss\RMSE, this PR 0.525167381247577 0.016770285209655 0.461402758989979 0.072397198881340 0.802009880542755 0.096547365188599
LogLoss\RMSE, master 0.525167381247577 0.016770285209655 0.461402758989979 0.072397198881340 0.802009880542755 0.096547365188599

OMP env: OMP_NUM_THREADS=48 OMP_PLACES={0}:96:1

HW

AWS c5.metal, CLX 8275 @3.0GHz, 24 cores per socket, 2 sockets, HT: on, 96 threads totally

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, I have a question related to CI.
I see a fail in one python test with std::bad_alloc. I tried to reproduce this locally - but it passes all test locally and maximum memory usage is ~600MB only. I suppose it's just a sporadic problem and also, this test was passed initially before my small code refactoring. Could you, please, restart this?

Also, you can see above - I tested many things including scaling by threads, scaling by niter, memory usage, different data sets including dense/sparse cases, distributed mode and accuracy.

And I see improvements in performance for many cases, no perf degradation, memory consumption mostly became even slightly less in many cases, no accuracy loosing for all cases. And I want to propose to merge this into 1.0 release branch also, because I see 1.8x improvement in average across data sets, it should become good feature for the major release. I understand that the release branch has already been created, but I tried to cover all problematic things in my additional tests and I don't see any issues which can affect the product (if there are no real issues CI described above).
What is your opinion? If I need to check anything else - I'm ready to do this.

Dataset higgs1m Letters Airline-ohe MNIST MSRank-30K Mortgage Average gain
Gain this PR vs. master 2.7 1.5 1.5 1.2 1.6 2.0 1.8

@hcho3
Copy link
Collaborator

hcho3 commented Feb 2, 2020

No, it would take a fair amount of time for us to review this PR, and it seems risky to approve this large magnitude PR so close to 1.0. Let us include it in the next 1.1 release. We plan to make a new release every 2 months or so.

@trivialfis After 1.0, I am thinking of preparing 1.1 in about 6 weeks. There are a few fixes that I’d like to see. WDYT?

@trivialfis
Copy link
Member

@hcho3 Sorry, missed this one. Agreed. Let me know if I can help.

Briefly looked into this PR, need some more time to understand the changes.

* Remove SimpleArray as it's only used in column matrix, and resize is only
called once per tree.

* Reduce the number of parameters, specifically by computing prefetching at
compile time.
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance improvement is great! Thanks for your wonderful work on hist algorithm optimization. Here are a few things about code structure, please see inlined comments too.

  • DMatrix should rarely be a parameter of builder, as it uses only the gradient index. So I believe most of pointers/references to DMatrix are unused variables, or just used for accessing meta info. Please remove them.
  • Please reduce the usage of ibegin, iend and offset as function parameters. You can pass a named Span or Range1d by you as parameter, then expand the pointer out inside function scope. This way it's more clear what are those parameters pointing to.

src/common/hist_util.cc Outdated Show resolved Hide resolved
src/common/hist_util.cc Outdated Show resolved Hide resolved
src/common/hist_util.cc Outdated Show resolved Hide resolved
src/common/hist_util.h Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
@mli
Copy link
Member

mli commented Feb 15, 2020

Codecov Report

Merging #5244 into master will increase coverage by 2.10%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5244      +/-   ##
==========================================
+ Coverage   81.66%   83.76%   +2.10%     
==========================================
  Files          11       11              
  Lines        2389     2409      +20     
==========================================
+ Hits         1951     2018      +67     
+ Misses        438      391      -47     
Impacted Files Coverage Δ
python-package/xgboost/libpath.py 55.55% <0.00%> (-4.45%) ⬇️
python-package/xgboost/sklearn.py 90.88% <0.00%> (+0.96%) ⬆️
python-package/xgboost/dask.py 90.30% <0.00%> (+2.67%) ⬆️
python-package/xgboost/tracker.py 93.97% <0.00%> (+15.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe8d72b...3df344f. Read the comment docs.

@SmirnovEgorRu
Copy link
Contributor Author

@trivialfis, thank you very much for the review.
I have applied you comments, e.g. replaced ibegin, iend by Span in hist building.

@SmirnovEgorRu
Copy link
Contributor Author

@trivialfis, @hcho3, any changed required to be committed?

@trivialfis
Copy link
Member

I still need to understand the multiple by 2 expression and try to make it obvious. Will keep you posted.

src/common/row_set.h Show resolved Hide resolved
src/common/row_set.h Outdated Show resolved Hide resolved
src/common/row_set.h Outdated Show resolved Hide resolved
src/common/hist_util.cc Outdated Show resolved Hide resolved
src/common/hist_util.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
src/tree/updater_quantile_hist.cc Show resolved Hide resolved
src/tree/updater_quantile_hist.h Show resolved Hide resolved
tests/cpp/common/test_partition_builder.cc Show resolved Hide resolved
@SmirnovEgorRu
Copy link
Contributor Author

@trivialfis, added explanation in the code.

@SmirnovEgorRu
Copy link
Contributor Author

One CI step is failed, but error is

urllib.error.URLError: <urlopen error [Errno 54] Connection reset by peer>

Could we restart this?

@codecov-io
Copy link

Codecov Report

Merging #5244 into master will decrease coverage by 0.01%.
The diff coverage is 88.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5244      +/-   ##
==========================================
- Coverage   83.76%   83.75%   -0.02%     
==========================================
  Files          11       11              
  Lines        2409     2413       +4     
==========================================
+ Hits         2018     2021       +3     
- Misses        391      392       +1
Impacted Files Coverage Δ
python-package/xgboost/dask.py 90.3% <ø> (ø) ⬆️
python-package/xgboost/sklearn.py 90.88% <100%> (ø) ⬆️
python-package/xgboost/libpath.py 55.55% <33.33%> (ø) ⬆️
python-package/xgboost/__init__.py 86.36% <0%> (-2.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 208d251...e973ecc. Read the comment docs.

@SmirnovEgorRu
Copy link
Contributor Author

@trivialfis @RAMitchell, do you have any new comments/concerns?

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Defer to others if they have any issues.


const uint32_t missing_val = std::numeric_limits<uint32_t>::max();

for (auto rid : rid_span) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so you are aware, iterating through a Span implies boundary checks and may be slower than a standard for loop and accessing memory with pointers.

If this section is not performance critical you should absolutely iterate over span. I try to avoid optimising things that do not have visible effect on runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested performance again and I didn't observed regression vs. previous version.
So let's keep it as is (additional checks are not bad if it doesn't affect the performance).

GHistRow hist) {
const size_t size = row_indices.Size();
const size_t* rid = row_indices.begin;
const float* pgh = reinterpret_cast<const float*>(gpair.data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, thanks! Just a suggestion for the future, we might want to do some refactoring to this monolithic updater, like splitting up the builder into different components for loss guided, depthwise, an independent component for row partitioning etc. I want to add extra functionality to both gpu hist and hist, having nicer code would make our lives easier. ;-)

@trivialfis trivialfis merged commit 1b97eaf into dmlc:master Feb 29, 2020
@SmirnovEgorRu
Copy link
Contributor Author

@trivialfis, thank you!
Yes, currently the builder is too huge. Let's think how it can be refactored in the future :)

@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants