Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix experimental RF backend crashes and add tests #3117

Merged
merged 10 commits into from
Nov 12, 2020

Conversation

JohnZed
Copy link
Contributor

@JohnZed JohnZed commented Nov 5, 2020

Relates to #3107. We're exceeding the array bounds of the column ids array and getting gibberish,
which eventually leads to memory errors. With this patch, the two test cases no longer crash (though
they still fail due to accuracy as expected). Other tests at the Python layer work smoothly too.

Closes #3107

@JohnZed JohnZed requested review from a team as code owners November 5, 2020 07:38
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

Comment on lines 357 to 362
if (col < 0) {
printf(
"indexing q with %d, col: %d, nbins: %d, i: %d, from colStart: %d + "
"blockIdx.y: %d\n",
col * nbins + i, col, nbins, i, colStart, blockIdx.y);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3118 for a proposed abstraction for automated bounds checking.

@hcho3
Copy link
Contributor

hcho3 commented Nov 5, 2020

I've verified that this PR indeed fixes the issue #3107. Thanks!

@JohnZed JohnZed requested a review from a team as a code owner November 6, 2020 01:31
Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you John for catching this!

cpp/src/decisiontree/batched-levelalgo/kernels.cuh Outdated Show resolved Hide resolved
cpp/src/decisiontree/batched-levelalgo/kernels.cuh Outdated Show resolved Hide resolved
@@ -394,6 +394,7 @@ struct ClsTraits {
dim3 grid(b.n_blks_for_rows, colBlks, batchSize);
size_t smemSize = sizeof(int) * binSize + sizeof(DataT) * nbins;
smemSize += sizeof(int);
smemSize += 3 * sizeof(DataT*); // Room for alignment in worst case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, it is better to do smem size computation as:

constexpr size_t kMaxSize = std::max(sizeof(int), sizeof(DataT));
size_t smemSize = alignTo(sizeof(int) * binSize, kMaxSize);
smemSize += alignTo(sizeof(DataT) * nbins, kMaxSize);
smemSize += alignTo(sizeof(int), kMaxSize);

Copy link
Contributor Author

@JohnZed JohnZed Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need to cover the MAE-only alignments for another 4 items here... (previous comment was incorrect but following still holds) I think this is getting to be a lot of code duplication for memory size calculation, and I'd be curious about whether we can just live with an overestimate

dim3 grid(b.n_blks_for_rows, n_col_blks, batchSize);
auto nbins = b.params.n_bins;
size_t smemSize = 7 * nbins * sizeof(DataT) + nbins * sizeof(int);
smemSize += sizeof(int);
smemSize += 7 * sizeof(DataT*); // Room for alignment in worst case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as mentioned above for classification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any shortcut that would help here... right now this looks like we'll have to basically duplicate much of the allocation logic from kernels.cuh here and keep the files in sync forever with no tests in place to ensure we haven't messed up. Are we at risk of running out of shared memory here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned this up a bit in my latest push but didn't go all the way to the precise solution due to concerns above... let's discuss further if you think the memory overhead is an issue.

cpp/src/decisiontree/batched-levelalgo/kernels.cuh Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Nov 6, 2020

Codecov Report

Merging #3117 (6d68a09) into branch-0.17 (25b29f5) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.17    #3117   +/-   ##
============================================
  Coverage        69.38%   69.38%           
============================================
  Files              193      193           
  Lines            14704    14704           
============================================
  Hits             10203    10203           
  Misses            4501     4501           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25b29f5...6d68a09. Read the comment docs.

@JohnZed JohnZed changed the title [WIP] Demo and possible patch for RF crash [REVIEW] Fix experimental RF backend crashes and add tests Nov 6, 2020
@JohnZed JohnZed added 3 - Ready for Review Ready for review by team 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Nov 6, 2020
Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks John for the fix. Changes LGTM.

@JohnZed
Copy link
Contributor Author

JohnZed commented Nov 10, 2020

rerun tests

1 similar comment
@JohnZed
Copy link
Contributor Author

JohnZed commented Nov 11, 2020

rerun tests

@JohnZed
Copy link
Contributor Author

JohnZed commented Nov 11, 2020

@teju85 are you ok if I go ahead and merge this? Happy to discuss or update more if you prefer

@teju85
Copy link
Member

teju85 commented Nov 11, 2020

Changes LGTM John. We just need an approval from python side. (We can make the smem size calculation stricter, if needed, later)

@hcho3
Copy link
Contributor

hcho3 commented Nov 11, 2020

Can we merge this soon? My fix #3132 is blocked by this change.

@dantegd dantegd merged commit 2219a75 into rapidsai:branch-0.17 Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Reviewer Waiting for reviewer to review or respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] raft::cuda_error thrown from experimental RF backend
6 participants