-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC implementation of deterministic GPU histogram. #5361
Conversation
Experimenting on Year Prediction Dataset with two consecutive runs: Non-deterministic histogram build:Deterministic histogram build:There are a few things to notice here. First the result is quite different between pre-rounding and no pre-rounding. Second is the f score is reproducible on small number of runs ( I haven't done large number of trees yet, only 45 trees are built here), while original code have variance even in small number of runs. There is still discrepancy in saved model. |
@@ -76,6 +76,20 @@ void TestDeviceSketch(bool use_external_memory) { | |||
ASSERT_LT(fabs(hmat_cpu.Values()[i] - hmat_gpu.Values()[i]), eps * nrows); | |||
} | |||
|
|||
// Determinstic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Typo
src/tree/gpu_hist/histogram.cu
Outdated
* @param max max(sum(positive gradient) + sum(abs(negative gradient))) | ||
* @param n The number of elements for the sum, or an upper bound estimate | ||
*/ | ||
DEV_INLINE __host__ float CreateRoundingFactor(float max, int n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting you to do this with doubles, this would give an extra 29 bits of precision and might resolve your accuracy issues. We can make the check to only do the truncation if GradientSumT is double precision.
Em I still get some non-determinism somewhere. Need more investigation. |
I was able to obtain bit by bit reproducible result (saved model) on 1M rows HIGGS dataset. But started to get discrepancy on ~5M YearPrediction. |
There's minor discrepancy on 5M YearPrediction on may laptop but not on my desktop. On 1M+ Higgs both have discrepancy, arguably much smaller than non-deterministic hist. I added a new parameter so that we can switch if on certain datasets the accuracy drops too much as a fall back option for user. But so far on YearPrediction and Higgs the accuracy loss is negligible. |
@RAMitchell After getting |
Awesome! Maybe uninitialised memory in |
That's not the cause. |
@RAMitchell Time for training on Year Prediction is up from |
Actually maybe I can pre-compute the rounding. Update: Update: |
7ac4a95
to
4e5c737
Compare
Codecov Report
@@ Coverage Diff @@
## master #5361 +/- ##
==========================================
+ Coverage 83.75% 84.07% +0.31%
==========================================
Files 11 11
Lines 2413 2411 -2
==========================================
+ Hits 2021 2027 +6
+ Misses 392 384 -8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel comfortable doing it with single precision. I think there is a mistake somewhere for the program to slow down after doing the truncation in the histogram kernel. An addition and subtraction is practically nothing for a GPU.
src/tree/gpu_hist/histogram.cu
Outdated
}; // NOLINT | ||
|
||
dh::XGBCachingDeviceAllocator<char> alloc; | ||
auto positive_sum = thrust::reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it matters, but it is possible to fuse these two reductions into a single reduction e.g. apply transform iterator to gradient pair inputs that returns 2 GradientPair (pos/neg) and define an operator to sum them. Alternatively just use the absolute sum, although this is slightly less tight and we could lose up to one bit of precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell I tried using cub::DeviceReudce
with 2 cuda streams for this, the end result doesn't seem to be any different. As for abs sum, it failed a Windows test for GPU ranking when using float.
src/tree/gpu_hist/histogram.cuh
Outdated
|
||
GradientPair CreateRoundingFactor(common::Span<GradientPair> gpair); | ||
|
||
DEV_INLINE float TruncateWithRoundingFactor(float rounding_factor, float x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this function could be so expensive that it slows down the algorithm if we use it in the histogram kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell I pushed the change for computing truncation on the fly. There are other functions like GetHess
, GetGrad
and constructors. I'm not sure if the compiler can optimize them away.
* Use pre-rounding based method to obtain reproducible floating point summation. * GPU Hist for regression and classification are bit-by-bit reproducible. * Add doc. * Switch to thrust reduce for `node_sum_gradient`.
076f41d
to
1065c72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a Python level integration test. Something like train twice on the same data (with different DMatrix objects) and get bitwise exact results. Just so non-determinism doesn't creep in other areas.
Other than that, amazing work!
Done. |
@sh1ng @pseudotensor you might find this feature useful. |
summation.