[REVIEW] Fix experimental RF backend crashes and add tests #3117

JohnZed · 2020-11-05T07:38:16Z

Relates to #3107. We're exceeding the array bounds of the column ids array and getting gibberish,
which eventually leads to memory errors. With this patch, the two test cases no longer crash (though
they still fail due to accuracy as expected). Other tests at the Python layer work smoothly too.

Closes #3107

GPUtester · 2020-11-05T07:38:46Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

hcho3 · 2020-11-05T08:43:35Z

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

+    if (col < 0) {
+      printf(
+        "indexing q with %d, col: %d, nbins: %d, i: %d, from colStart: %d + "
+        "blockIdx.y: %d\n",
+        col * nbins + i, col, nbins, i, colStart, blockIdx.y);
+    }


See #3118 for a proposed abstraction for automated bounds checking.

hcho3 · 2020-11-05T09:19:38Z

I've verified that this PR indeed fixes the issue #3107. Thanks!

teju85

Thank you John for catching this!

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

teju85 · 2020-11-06T02:53:21Z

cpp/src/decisiontree/batched-levelalgo/builder_base.cuh

@@ -394,6 +394,7 @@ struct ClsTraits {
    dim3 grid(b.n_blks_for_rows, colBlks, batchSize);
    size_t smemSize = sizeof(int) * binSize + sizeof(DataT) * nbins;
    smemSize += sizeof(int);
+    smemSize += 3 * sizeof(DataT*);  // Room for alignment in worst case


instead of this, it is better to do smem size computation as:

constexpr size_t kMaxSize = std::max(sizeof(int), sizeof(DataT)); size_t smemSize = alignTo(sizeof(int) * binSize, kMaxSize); smemSize += alignTo(sizeof(DataT) * nbins, kMaxSize); smemSize += alignTo(sizeof(int), kMaxSize);

~~We will also need to cover the MAE-only alignments for another 4 items here...~~ (previous comment was incorrect but following still holds) I think this is getting to be a lot of code duplication for memory size calculation, and I'd be curious about whether we can just live with an overestimate

teju85 · 2020-11-06T02:53:39Z

cpp/src/decisiontree/batched-levelalgo/builder_base.cuh

    dim3 grid(b.n_blks_for_rows, n_col_blks, batchSize);
    auto nbins = b.params.n_bins;
    size_t smemSize = 7 * nbins * sizeof(DataT) + nbins * sizeof(int);
    smemSize += sizeof(int);
+    smemSize += 7 * sizeof(DataT*);  // Room for alignment in worst case


similar comment as mentioned above for classification.

I wonder if there is any shortcut that would help here... right now this looks like we'll have to basically duplicate much of the allocation logic from kernels.cuh here and keep the files in sync forever with no tests in place to ensure we haven't messed up. Are we at risk of running out of shared memory here?

I cleaned this up a bit in my latest push but didn't go all the way to the precise solution due to concerns above... let's discuss further if you think the memory overhead is an issue.

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

codecov-io · 2020-11-06T03:44:57Z

Codecov Report

Merging #3117 (6d68a09) into branch-0.17 (25b29f5) will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           branch-0.17    #3117   +/-   ##
============================================
  Coverage        69.38%   69.38%           
============================================
  Files              193      193           
  Lines            14704    14704           
============================================
  Hits             10203    10203           
  Misses            4501     4501

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25b29f5...6d68a09. Read the comment docs.

cpp/src/decisiontree/batched-levelalgo/builder_base.cuh

teju85

Thanks John for the fix. Changes LGTM.

JohnZed · 2020-11-10T21:29:13Z

rerun tests

JohnZed · 2020-11-11T05:41:24Z

rerun tests

JohnZed · 2020-11-11T17:04:42Z

@teju85 are you ok if I go ahead and merge this? Happy to discuss or update more if you prefer

teju85 · 2020-11-11T17:09:14Z

Changes LGTM John. We just need an approval from python side. (We can make the smem size calculation stricter, if needed, later)

hcho3 · 2020-11-11T23:34:41Z

Can we merge this soon? My fix #3132 is blocked by this change.

Patch and test for RF crash rapidsai#3107

0c7b7a6

JohnZed requested review from a team as code owners November 5, 2020 07:38

hcho3 reviewed Nov 5, 2020

View reviewed changes

JohnZed added 3 commits November 5, 2020 12:07

Cleanups of RF regression fixes

2e13db7

Add failing tests to RF regression

16a633a

Expand experimental backend testing and align pointers

ef48730

JohnZed requested a review from a team as a code owner November 6, 2020 01:31

Expand python RF regression test

a314bc7

teju85 requested changes Nov 6, 2020

View reviewed changes

JohnZed added 4 commits November 6, 2020 11:18

Updates based on review feedback

e930a1a

Merge branch 'branch-0.17' into bug-rf-crash-3107

dff470d

Update changelog

534b885

Add classification tests

1c53774

JohnZed changed the title ~~[WIP] Demo and possible patch for RF crash~~ [REVIEW] Fix experimental RF backend crashes and add tests Nov 6, 2020

JohnZed added 3 - Ready for Review Ready for review by team 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Nov 6, 2020

teju85 requested changes Nov 8, 2020

View reviewed changes

cpp/src/decisiontree/batched-levelalgo/builder_base.cuh Outdated Show resolved Hide resolved

Review comments and style fixes for RF

6d68a09

teju85 approved these changes Nov 10, 2020

View reviewed changes

hcho3 mentioned this pull request Nov 11, 2020

[Breaking] Add min_samples_split + Rename min_rows_per_node -> min_samples_leaf #3132

Merged

dantegd approved these changes Nov 12, 2020

View reviewed changes

dantegd merged commit 2219a75 into rapidsai:branch-0.17 Nov 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix experimental RF backend crashes and add tests #3117

[REVIEW] Fix experimental RF backend crashes and add tests #3117

JohnZed commented Nov 5, 2020 •

edited

Loading

GPUtester commented Nov 5, 2020

hcho3 Nov 5, 2020

hcho3 commented Nov 5, 2020

teju85 left a comment

teju85 Nov 6, 2020

JohnZed Nov 6, 2020 •

edited

Loading

teju85 Nov 6, 2020

JohnZed Nov 6, 2020

JohnZed Nov 6, 2020

codecov-io commented Nov 6, 2020 •

edited

Loading

teju85 left a comment

JohnZed commented Nov 10, 2020

JohnZed commented Nov 11, 2020

JohnZed commented Nov 11, 2020

teju85 commented Nov 11, 2020

hcho3 commented Nov 11, 2020

[REVIEW] Fix experimental RF backend crashes and add tests #3117

[REVIEW] Fix experimental RF backend crashes and add tests #3117

Conversation

JohnZed commented Nov 5, 2020 • edited Loading

GPUtester commented Nov 5, 2020

hcho3 Nov 5, 2020

Choose a reason for hiding this comment

hcho3 commented Nov 5, 2020

teju85 left a comment

Choose a reason for hiding this comment

teju85 Nov 6, 2020

Choose a reason for hiding this comment

JohnZed Nov 6, 2020 • edited Loading

Choose a reason for hiding this comment

teju85 Nov 6, 2020

Choose a reason for hiding this comment

JohnZed Nov 6, 2020

Choose a reason for hiding this comment

JohnZed Nov 6, 2020

Choose a reason for hiding this comment

codecov-io commented Nov 6, 2020 • edited Loading

Codecov Report

teju85 left a comment

Choose a reason for hiding this comment

JohnZed commented Nov 10, 2020

JohnZed commented Nov 11, 2020

JohnZed commented Nov 11, 2020

teju85 commented Nov 11, 2020

hcho3 commented Nov 11, 2020

JohnZed commented Nov 5, 2020 •

edited

Loading

JohnZed Nov 6, 2020 •

edited

Loading

codecov-io commented Nov 6, 2020 •

edited

Loading