POC implementation of deterministic GPU histogram. #5361

trivialfis · 2020-02-25T16:01:00Z

Use pre-rounding based method to obtain reproducible floating point
summation.
This makes GPU Hist more stable but not yet fully reproducible.

trivialfis · 2020-02-25T16:03:07Z

Experimenting on Year Prediction Dataset with two consecutive runs:

Non-deterministic histogram build:

Deterministic histogram build:

There are a few things to notice here. First the result is quite different between pre-rounding and no pre-rounding. Second is the f score is reproducible on small number of runs ( I haven't done large number of trees yet, only 45 trees are built here), while original code have variance even in small number of runs.

There is still discrepancy in saved model.

trivialfis · 2020-02-25T16:05:17Z

tests/cpp/common/test_gpu_hist_util.cu

@@ -76,6 +76,20 @@ void TestDeviceSketch(bool use_external_memory) {
    ASSERT_LT(fabs(hmat_cpu.Values()[i] - hmat_gpu.Values()[i]), eps * nrows);
  }

+  // Determinstic


RAMitchell · 2020-02-25T21:10:49Z

src/tree/gpu_hist/histogram.cu

+ * @param max max(sum(positive gradient) + sum(abs(negative gradient)))
+ * @param n The number of elements for the sum, or an upper bound estimate
+ */
+DEV_INLINE __host__ float CreateRoundingFactor(float max, int n) {


I was expecting you to do this with doubles, this would give an extra 29 bits of precision and might resolve your accuracy issues. We can make the check to only do the truncation if GradientSumT is double precision.

trivialfis · 2020-02-26T06:41:10Z

Em I still get some non-determinism somewhere. Need more investigation.

trivialfis · 2020-02-26T07:23:06Z

I was able to obtain bit by bit reproducible result (saved model) on 1M rows HIGGS dataset. But started to get discrepancy on ~5M YearPrediction.

trivialfis · 2020-02-29T08:18:21Z

There's minor discrepancy on 5M YearPrediction on may laptop but not on my desktop. On 1M+ Higgs both have discrepancy, arguably much smaller than non-deterministic hist.

I added a new parameter so that we can switch if on certain datasets the accuracy drops too much as a fall back option for user. But so far on YearPrediction and Higgs the accuracy loss is negligible.

trivialfis · 2020-03-01T18:16:01Z

@RAMitchell After getting dh::SumReduction out of the way, I was able to obtain bit by bit reproducible result on both YearPrediction, Higgs and Epsilon for 100 iterations. As for why cub::DeviceReduce produces non-deterministic result I'm not sure.

RAMitchell · 2020-03-01T21:45:43Z

Awesome! Maybe uninitialised memory in node_sum_gradients_d or someone else?

trivialfis · 2020-03-02T09:08:00Z

@RAMitchell

Maybe uninitialised memory in node_sum_gradients_d

That's not the cause.

trivialfis · 2020-03-02T09:51:48Z

@RAMitchell Time for training on Year Prediction is up from 1.46069 to 1.6410 for 100 iterations. I don't have more commits coming unless there's further optimization required. Please review when time allows.

trivialfis · 2020-03-02T10:18:40Z

Actually maybe I can pre-compute the rounding.

Update:
Switched to pre-computing the rounding. Slightly faster, but we can't use double for truncation anymore. So I also switched to the bound suggested by @RAMitchell , which is tighter than Algorithm 3. Thus far the accuracy loss is negligible.

Update:
The runtime doesn't seem to be any different. Combined with pre-computing truncation and some other small optimizations, for 1000 rounds HIGGS is 0.4 seconds slower, while year prediction is basically the same (0.07 sec faster for deterministic algorithm).

mli · 2020-03-02T16:37:21Z

Codecov Report

Merging #5361 into master will increase coverage by 0.31%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5361      +/-   ##
==========================================
+ Coverage   83.75%   84.07%   +0.31%     
==========================================
  Files          11       11              
  Lines        2413     2411       -2     
==========================================
+ Hits         2021     2027       +6     
+ Misses        392      384       -8

Impacted Files	Coverage Δ
python-package/xgboost/compat.py	`54.54% <0.00%> (+0.58%)`	⬆️
python-package/xgboost/core.py	`80.10% <0.00%> (+0.65%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 655cf17...d00399b. Read the comment docs.

RAMitchell

I don't feel comfortable doing it with single precision. I think there is a mistake somewhere for the program to slow down after doing the truncation in the histogram kernel. An addition and subtraction is practically nothing for a GPU.

RAMitchell · 2020-03-03T04:28:22Z

src/tree/gpu_hist/histogram.cu

+  };  // NOLINT
+
+  dh::XGBCachingDeviceAllocator<char> alloc;
+  auto positive_sum = thrust::reduce(


Not sure if it matters, but it is possible to fuse these two reductions into a single reduction e.g. apply transform iterator to gradient pair inputs that returns 2 GradientPair (pos/neg) and define an operator to sum them. Alternatively just use the absolute sum, although this is slightly less tight and we could lose up to one bit of precision.

@RAMitchell I tried using cub::DeviceReudce with 2 cuda streams for this, the end result doesn't seem to be any different. As for abs sum, it failed a Windows test for GPU ranking when using float.

RAMitchell · 2020-03-03T04:30:52Z

src/tree/gpu_hist/histogram.cuh

+
+GradientPair CreateRoundingFactor(common::Span<GradientPair> gpair);
+
+DEV_INLINE float TruncateWithRoundingFactor(float rounding_factor, float x) {


I don't understand how this function could be so expensive that it slows down the algorithm if we use it in the histogram kernel.

@RAMitchell I pushed the change for computing truncation on the fly. There are other functions like GetHess, GetGrad and constructors. I'm not sure if the compiler can optimize them away.

* Use pre-rounding based method to obtain reproducible floating point summation. * GPU Hist for regression and classification are bit-by-bit reproducible. * Add doc. * Switch to thrust reduce for `node_sum_gradient`.

RAMitchell

I think we need a Python level integration test. Something like train twice on the same data (with different DMatrix objects) and get bitwise exact results. Just so non-determinism doesn't creep in other areas.

Other than that, amazing work!

trivialfis · 2020-03-04T05:20:15Z

I think we need a Python level integration test.

Done.

RAMitchell · 2020-03-04T21:15:50Z

@sh1ng @pseudotensor you might find this feature useful.

trivialfis added the status: WIP label Feb 25, 2020

trivialfis commented Feb 25, 2020

View reviewed changes

RAMitchell reviewed Feb 25, 2020

View reviewed changes

trivialfis removed the status: WIP label Feb 29, 2020

trivialfis mentioned this pull request Mar 2, 2020

fixes and changes to the ranking metrics computed on cpu #5380

Merged

trivialfis force-pushed the determinstic-gpu-hist-max branch from 7ac4a95 to 4e5c737 Compare March 2, 2020 14:14

RAMitchell reviewed Mar 3, 2020

View reviewed changes

Deterministic GPU histogram.

1065c72

* Use pre-rounding based method to obtain reproducible floating point summation. * GPU Hist for regression and classification are bit-by-bit reproducible. * Add doc. * Switch to thrust reduce for `node_sum_gradient`.

trivialfis force-pushed the determinstic-gpu-hist-max branch from 076f41d to 1065c72 Compare March 3, 2020 15:06

RAMitchell reviewed Mar 4, 2020

View reviewed changes

Add a Python test.

2ece4af

RAMitchell approved these changes Mar 4, 2020

View reviewed changes

trivialfis merged commit 8d06878 into dmlc:master Mar 4, 2020

trivialfis deleted the determinstic-gpu-hist-max branch March 4, 2020 07:13

trivialfis mentioned this pull request Apr 21, 2020

[Roadmap] 1.1.0 Roadmap #5337

Closed

12 tasks

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC implementation of deterministic GPU histogram. #5361

POC implementation of deterministic GPU histogram. #5361

trivialfis commented Feb 25, 2020 •

edited

Loading

trivialfis commented Feb 25, 2020 •

edited

Loading

trivialfis Feb 25, 2020

RAMitchell Feb 25, 2020

trivialfis commented Feb 26, 2020

trivialfis commented Feb 26, 2020 •

edited

Loading

trivialfis commented Feb 29, 2020 •

edited

Loading

trivialfis commented Mar 1, 2020

RAMitchell commented Mar 1, 2020

trivialfis commented Mar 2, 2020

trivialfis commented Mar 2, 2020

trivialfis commented Mar 2, 2020 •

edited

Loading

mli commented Mar 2, 2020 •

edited

Loading

RAMitchell left a comment

RAMitchell Mar 3, 2020

trivialfis Mar 3, 2020

RAMitchell Mar 3, 2020

trivialfis Mar 3, 2020

RAMitchell left a comment

trivialfis commented Mar 4, 2020

RAMitchell commented Mar 4, 2020


		GradientPair CreateRoundingFactor(common::Span<GradientPair> gpair);

		DEV_INLINE float TruncateWithRoundingFactor(float rounding_factor, float x) {

POC implementation of deterministic GPU histogram. #5361

POC implementation of deterministic GPU histogram. #5361

Conversation

trivialfis commented Feb 25, 2020 • edited Loading

trivialfis commented Feb 25, 2020 • edited Loading

Non-deterministic histogram build:

Deterministic histogram build:

trivialfis Feb 25, 2020

Choose a reason for hiding this comment

RAMitchell Feb 25, 2020

Choose a reason for hiding this comment

trivialfis commented Feb 26, 2020

trivialfis commented Feb 26, 2020 • edited Loading

trivialfis commented Feb 29, 2020 • edited Loading

trivialfis commented Mar 1, 2020

RAMitchell commented Mar 1, 2020

trivialfis commented Mar 2, 2020

trivialfis commented Mar 2, 2020

trivialfis commented Mar 2, 2020 • edited Loading

mli commented Mar 2, 2020 • edited Loading

Codecov Report

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell Mar 3, 2020

Choose a reason for hiding this comment

trivialfis Mar 3, 2020

Choose a reason for hiding this comment

RAMitchell Mar 3, 2020

Choose a reason for hiding this comment

trivialfis Mar 3, 2020

Choose a reason for hiding this comment

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis commented Mar 4, 2020

RAMitchell commented Mar 4, 2020

trivialfis commented Feb 25, 2020 •

edited

Loading

trivialfis commented Feb 25, 2020 •

edited

Loading

trivialfis commented Feb 26, 2020 •

edited

Loading

trivialfis commented Feb 29, 2020 •

edited

Loading

trivialfis commented Mar 2, 2020 •

edited

Loading

mli commented Mar 2, 2020 •

edited

Loading