Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for 2767 - fix rf path trying to sample 0 columns #2788

Merged
merged 2 commits into from
Sep 3, 2020

Conversation

drobison00
Copy link
Contributor

In some situations, the number of sampled columns, multiplied by max features would result in a number that rounded to 0, which in turn caused a memory block of size zero to be allocated, and CudaMemsetAsync to throw an invalid argument exception.

This sets a floor of 1 for ncols_sampled, which resolves the issue.

Closes #2767

@drobison00 drobison00 added 3 - Ready for Review Ready for review by team CUDA / C++ CUDA issue labels Sep 2, 2020
@drobison00 drobison00 requested a review from a team as a code owner September 2, 2020 19:08
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@drobison00 drobison00 requested a review from JohnZed September 2, 2020 19:14
Copy link
Member

@beckernick beckernick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see this is so clean!

Just curious why this only came up with RF regressor, rather than both the RF classifier and regressor. Does the classifier segment have a guard against this? Seems like this is used in both downstream in the //regression and //classifcation sections.

@drobison00
Copy link
Contributor Author

@beckernick
I tested a bit with the equivalent classification paths. They are affected, in that 'histcount', defined in memory.cuh:218, will be set to zero, and all the subsequent (h/d)_hist_xxx buffer allocations become zero, and result in 0x00 data pointers in the tree data structure. I'm not familiar enough with the code paths, but It seems unlikely that everything was working as expected with the classifier, when ncols * max_features < 1, on that path.

@beckernick
Copy link
Member

@beckernick
I tested a bit with the equivalent classification paths. They are affected, in that 'histcount', defined in memory.cuh:218, will be set to zero, and all the subsequent (h/d)_hist_xxx buffer allocations become zero, and result in 0x00 data pointers in the tree data structure. I'm not familiar enough with the code paths, but It seems unlikely that everything was working as expected with the classifier, when ncols * max_features < 1, on that path.

Got it, makes sense. A silent failure is just as devious.

Looks like a FIL test is now failing, but it seems like that would be unrelated to this code path. It also seems to be failing in another PR as well, further suggesting that (#2789)

FAILED cuml/test/test_fil.py::test_lightgbm - assert False

cc @dantegd is this possibly an expected failure?

@dantegd
Copy link
Member

dantegd commented Sep 3, 2020

rerun tests

@dantegd
Copy link
Member

dantegd commented Sep 3, 2020

@beckernick PR #2787 disabled that test temporarily because there seems to be a small issue with that test and lightgbm 3.0

Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change lgtm

@beckernick
Copy link
Member

Ah, perfect. thanks for the quick explanation 👍

@beckernick beckernick merged commit 212813d into rapidsai:branch-0.16 Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CUDA / C++ CUDA issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] RandomForestRegressor fit causes segfault when max_features * n_features is less than one
4 participants