-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Fix experimental RF backend crashes and add tests #3117
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
if (col < 0) { | ||
printf( | ||
"indexing q with %d, col: %d, nbins: %d, i: %d, from colStart: %d + " | ||
"blockIdx.y: %d\n", | ||
col * nbins + i, col, nbins, i, colStart, blockIdx.y); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3118 for a proposed abstraction for automated bounds checking.
I've verified that this PR indeed fixes the issue #3107. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you John for catching this!
@@ -394,6 +394,7 @@ struct ClsTraits { | |||
dim3 grid(b.n_blks_for_rows, colBlks, batchSize); | |||
size_t smemSize = sizeof(int) * binSize + sizeof(DataT) * nbins; | |||
smemSize += sizeof(int); | |||
smemSize += 3 * sizeof(DataT*); // Room for alignment in worst case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this, it is better to do smem size computation as:
constexpr size_t kMaxSize = std::max(sizeof(int), sizeof(DataT));
size_t smemSize = alignTo(sizeof(int) * binSize, kMaxSize);
smemSize += alignTo(sizeof(DataT) * nbins, kMaxSize);
smemSize += alignTo(sizeof(int), kMaxSize);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also need to cover the MAE-only alignments for another 4 items here... (previous comment was incorrect but following still holds) I think this is getting to be a lot of code duplication for memory size calculation, and I'd be curious about whether we can just live with an overestimate
dim3 grid(b.n_blks_for_rows, n_col_blks, batchSize); | ||
auto nbins = b.params.n_bins; | ||
size_t smemSize = 7 * nbins * sizeof(DataT) + nbins * sizeof(int); | ||
smemSize += sizeof(int); | ||
smemSize += 7 * sizeof(DataT*); // Room for alignment in worst case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment as mentioned above for classification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is any shortcut that would help here... right now this looks like we'll have to basically duplicate much of the allocation logic from kernels.cuh here and keep the files in sync forever with no tests in place to ensure we haven't messed up. Are we at risk of running out of shared memory here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned this up a bit in my latest push but didn't go all the way to the precise solution due to concerns above... let's discuss further if you think the memory overhead is an issue.
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #3117 +/- ##
============================================
Coverage 69.38% 69.38%
============================================
Files 193 193
Lines 14704 14704
============================================
Hits 10203 10203
Misses 4501 4501 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks John for the fix. Changes LGTM.
rerun tests |
1 similar comment
rerun tests |
@teju85 are you ok if I go ahead and merge this? Happy to discuss or update more if you prefer |
Changes LGTM John. We just need an approval from python side. (We can make the smem size calculation stricter, if needed, later) |
Can we merge this soon? My fix #3132 is blocked by this change. |
Relates to #3107. We're exceeding the array bounds of the column ids array and getting gibberish,
which eventually leads to memory errors. With this patch, the two test cases no longer crash (though
they still fail due to accuracy as expected). Other tests at the Python layer work smoothly too.
Closes #3107