-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations of pre-processing for 'hist' tree method #4310
Conversation
for (size_t i = 0; i < info.num_row_; ++i) { | ||
if (gpair[i].GetHess() >= 0.0f && coin_flip(rnd)) { | ||
row_indices.push_back(i); | ||
p_row_indices[j++] = i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding values to the underlying array directly is performance friendly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 points:
- push_back has at least 1 "if" to check current capacity with new size, even if reserve() was called. It leads to additional branching and performance slowdown.
- constructions like this:
if (a[i] > c)
b[j++] = a[i];
Can be easily vectorized by AVX-512 instructions - vcompressps/vcompresspd and some mask-instructions for comparison, even with compiler auto-vectorization. I know, that now XGBoost from a box supports only sse4.2, however AVX-512 will be possibly available in the future.
With push_back() it doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very interested in trying to recompile with AVX-512 and testing it in a real world
I see failed tests for R. But logs contain these messages: Looks like these errors don't depend on my commit. May I take that tests are passed? |
I have retriggered the test, should pass this time |
@CodingCat thanks. Looks like it is okay now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SmirnovEgorRu Thanks for your contribution. I made a first pass and overall it looks good. I notice that you are using pre-fetching and blocking quite often (e.g. lots of reference to block
). Is there a reference I can look at as to why it improves performance? I'd like to learn more about this technique, so that in a future pull request, I can propose a suitable abstraction to make this kind of optimization easy. (I noticed that the optimized code is not as readable, due to heavy use of loop variables and temporary buffers. A suitable abstraction should ideally make the code easy to inspect.)
cc @trivialfis
@@ -71,7 +71,7 @@ class ColumnMatrix { | |||
// construct column matrix from GHistIndexMatrix | |||
inline void Init(const GHistIndexMatrix& gmat, | |||
double sparse_threshold) { | |||
const auto nfeature = static_cast<bst_uint>(gmat.cut.row_ptr.size() - 1); | |||
const int32_t nfeature = static_cast<int32_t>(gmat.cut.row_ptr.size() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SmirnovEgorRu Why the change from bst_uint
to int32_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Windows requires that integer type for count inside "#pragma omp paralell for" should have signed type.
Before nfeature wasn't used for parallel loop, but now it is used. So it was changed from unsigned to signed
@@ -59,7 +59,6 @@ class RowSetCollection { | |||
} | |||
// clear up things | |||
inline void Clear() { | |||
row_indices_.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need this line. Clearing a RowSetCollection
should also clear all row indices stores within.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some measurements with my all optimizations and InitData() function spends ~15% of tree building time.
This block of code spends all of this time:
for (size_t i = 0; i < info.num_row_; ++i) {
if (gpair[i].GetHess() >= 0.0f) {
row_indices.push_back(i);
There are 2 reasons: std::vector - not performance friendly object, push_back contains at least one "if" to check current size. Also, this code can't be parallelized and it is sequential.
I have changed this code to use C-style partition:
j = 0;
for (size_t i = 0; i < info.num_row_; ++i) {
if (gpair[i].GetHess() >= 0.0f) {
p_row_indices[j++] = i;
But in this case I need to do row_indices_->resize(num_row) each time after row_indices_->clear(). But resize(num_rows) - very expensive operations if current size=0, it will fill data inside by default value (in our case - 0) by not performance manner. So, I deleted clear() and use resize(nrows) only one time.
After my changes InitData() spends less then 1% now.
P.S. std::vector - really not performance friendly :)
} else { | ||
for (const auto &batch : p_fmat->GetRowBatches()) { | ||
const size_t size = batch.Size(); | ||
const size_t block_size = 512; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rule for choosing this number (512)? Is this number hardware-dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but this value influences on size of allocated buffer: it should not be too large from the one point of view and be not too small (useful work of omp-task should be more than omp-overhead) from the second point of view.
By experiments 512 have been chosen as optimal.
size_t rbegin = 0; | ||
size_t prev_sum = 0; | ||
|
||
for (const auto &batch : p_fmat->GetRowBatches()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to @trivialfis, @CodingCat: we are not really sure if 'hist' method works when external memory is enabled (i.e. GetRowBatches()
returns more than one batch). We will need to test 'hist' against external memory in a future pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hcho3 Absolutely agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, it doesn't work in distributed setting, even after @trivialfis 's fix in 0.82
index.resize(row_ptr.back()); | ||
|
||
CHECK_GT(cut.cut.size(), 0U); | ||
CHECK_EQ(cut.row_ptr.back(), cut.cut.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SmirnovEgorRu Why are we removing this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now row_ptr is allocated to final length from the start. So if we have a lot of batches - back() for not final batch will return some trash.
src/tree/updater_quantile_hist.cc
Outdated
if (gpair[i].GetHess() >= 0.0f) { | ||
row_indices.push_back(i); | ||
MemStackAllocator<int, 128> buff(this->nthread_); | ||
int* p_buff = buff.Get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int* p_buff = buff.Get(); | |
bool* p_buff = buff.Get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committed, thanks
src/tree/updater_quantile_hist.cc
Outdated
for (size_t i = 0; i < info.num_row_; ++i) { | ||
if (gpair[i].GetHess() >= 0.0f) { | ||
row_indices.push_back(i); | ||
MemStackAllocator<int, 128> buff(this->nthread_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemStackAllocator<int, 128> buff(this->nthread_); | |
MemStackAllocator<bool, 128> buff(this->nthread_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SmirnovEgorRu Did you intend to use int
to avoid false sharing? Or can we use bool
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False sharing should not affect this place
src/tree/updater_quantile_hist.cc
Outdated
|
||
bool has_heg_hess = false; | ||
for (size_t tid = 0; tid < this->nthread_; ++tid) { | ||
if (p_buff[tid] == true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (p_buff[tid] == true) { | |
if (p_buff[tid]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commited
src/tree/updater_quantile_hist.cc
Outdated
} | ||
} | ||
|
||
bool has_heg_hess = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool has_heg_hess = false; | |
bool has_neg_hess = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committed
if (has_heg_hess) { | ||
size_t j = 0; | ||
for (size_t i = 0; i < info.num_row_; ++i) { | ||
if (gpair[i].GetHess() >= 0.0f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to @trivialfis, @RAMitchell: I wonder if we should just forbid negative Hessian values. All the non-negativity checks in the updater codebase are not only eye-sores but also reduces parallelism.
@hcho3 |
Sorry for the delay, will review shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is very edifying, thanks a lot! Some notes and questions are inlined in the comment. We will need to make more adjustments for the code structure in the future. Optimization is awesome(big thanks again), but I usually try to measure for how many time I need to spend on understanding each line of code (time spent on reading / number of lines), I think this is getting higher and higher.
@SmirnovEgorRu For avx-512 optimization, is function multi-versioning
applicable here in the future? I should try to learn more about it.
@hcho3 I'm not sure about the implication of forbidding neg hess.
n_ = n; | ||
} | ||
|
||
T& operator[](size_t idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of using SimpleArray over std::vector? We still have one extra indirection each time we index the array.
@SmirnovEgorRu I'm wondering about this too, SimpleArray
doens't use dynamic table to amortize the allocation/deallocation time, can you add a note that under what size it's more efficient than dynamic table? Or it's for some other reasons?
size_t group_ind = 0; | ||
if (use_group_ind) { | ||
group_ind = this->SearchGroupIndFromBaseRow(group_ptr, batch.base_rowid); | ||
if (use_group_ind) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hcho3 We need to change the name of these Init
functions.
sketchs[entry.index].Push(entry.fvalue, info.GetWeight(w_idx)); | ||
#pragma omp barrier | ||
#pragma omp for schedule(static) | ||
for (int32_t icol = 0; icol < static_cast<int32_t>(ncol); ++icol) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of columns is usually small, ~3000 is quite an extreme case, ~120 might be the more common case. Is it preferred to use OMP here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMP must be here :)
Each OMP task will push around 512*nthreads/ncols elements, it is enough to have gain from OMP. Also, OMP threads have already created on 117 line.
const size_t block_size_iter = block_size * nthread; | ||
const size_t n_blocks = size / block_size_iter + !!(size % block_size_iter); | ||
|
||
std::vector<std::vector<std::pair<float, float>>> buff(nthread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note like buffer for sketches to increase the cache usage
or for some other reasons? It's sometime easy to forget the pain when we are new comers. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffs are not needed for better cache locality.
It requires for collecting local features-values by each thread and after push them to global queue without conflicts
size_t rbegin = 0; | ||
size_t prev_sum = 0; | ||
|
||
for (const auto &batch : p_fmat->GetRowBatches()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hcho3 Absolutely agree.
@@ -28,6 +28,38 @@ | |||
#include "../common/column_matrix.h" | |||
|
|||
namespace xgboost { | |||
|
|||
template<typename T, size_t MaxStackSize> | |||
class MemStackAllocator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, the allocator is actually used as some sort of memory buffer in this PR (I haven't gone through the big PR yet). So can we make it a std style allocator and uses something like std::vector<int, StackAllocator>
instead? Or we can implement our own data structure that can switch allocators at runtime so that Your SimpleArray
and this StackAllocator
can be organized into a more appropriate structure in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector - very not efficient, but probably in this case it will work.
if you come back tbb - they have std-style ScalableAllocator which can be useful here.
@SmirnovEgorRu I have a slightly off topic question: for your experiments do you use hyper threading? In my work I found that xgboost can perform very poorly and with high variance when using all threads (default nthreads setting). This is my CPU:
I am wondering if this is a problem that affects others and if so how can we solve it. @Laurae2 did you also experience this in your xgboost experiments? |
@RAMitchell xgboost fast histogram has poor scalability, therefore hyperthreads is beneficial only with a small number of threads (such as on a small desktop or a laptop). For instance on a large enough dataset, my i7-7700HQ gets a 40% performance boost from hyperthreads, but my Dual E5-2699v4 (and my Dual Xeon 6154) gets from 50% slowdown to 2x slowdown (the bigger the dataset, the smaller the slowdown). I recommend (@SmirnovEgorRu also recommends) to disable hyperthreading. Another solution is to use OpenMP environment parameters to pin threads to specific cores when using xgboost with many threads. Also note if using gcc or icc. I noticed icc provides the same performance as gcc for xgboost for gcc recent versions (before @SmirnovEgorRu performance improvements). For later versions of xgboost, I did not test on icc. More info here: https://sites.google.com/view/lauraepp/benchmarks/intel-compiler-may-2018 |
Co-Authored-By: SmirnovEgorRu <egor.smirnov@intel.com>
@RAMitchell, I agree with @Laurae2, in case of XGBoost better to disable HyperThreading. However, after all my changes (#4278) I observed that HyperThreading is able to provide improvements for some data sets, but need investigate this fact, |
Codecov Report
@@ Coverage Diff @@
## master #4310 +/- ##
==========================================
+ Coverage 67.82% 68.01% +0.19%
==========================================
Files 132 132
Lines 12201 12363 +162
==========================================
+ Hits 8275 8409 +134
- Misses 3926 3954 +28
Continue to review full report at Codecov.
|
@hcho3 @CodingCat @trivialfis, |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A future PR should forbid negative Hessian values and simplify preprocessing code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's move on. ;-)
This commit is a part of this pull-request.
Performance before this commit:
Performance results for these changes:
HW: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2 sockets, 14 cores per socket.