-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Confusing behavior of predict() / predict_proba() of RF when output_class is specified #3347
Comments
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
#3609 will fix this issue. @jameslamb Would you like to try out the updated versions of |
Closes #3347. Make the `predict()` and `predict_proba()` functions of RF to match those in the scikit-learn RF. * Eliminate the parameter `output_class`. Instead, `predict()` will always produce class prediction, and `predict_proba()` will always produce probability prediction. (This applies to binary and multi-class classifiers. Regressors will only have `predict()`.) * Remove the `threshold` parameter from `predict_proba()`. Authors: - Philip Hyunsu Cho (@hcho3) Approvers: - John Zedlewski (@JohnZed) URL: #3609
Sorry I wasn't able to get back to you before this was merged. I'd be happy to look this week and see if there are other differences with the
Related to this, I looked tonight and saw that this project doesn't currently run the For an example of how those checks are used, you can see
Would you be open to a PR that tries to add those checks for |
@jameslamb Yes, a PR would be great. Thanks! |
Describe the bug
The
predict()
andpredict_proba()
methods ofRandomForestClassifier
behave in a confusing way whenoutput_class
parameter is specified.Code to reproduce bug
Binary classification
output_class=True
has no effect onpredict_proba()
. Thepredict_proba()
function still produces probability prediction, when the presence ofoutput_class=True
suggests a class prediction.output_class=False
causespredict_proba()
to crash:Multi-class classification
output_class=True
has no effect onpredict_proba()
. Thepredict_proba()
function still produces probability prediction, when the presence ofoutput_class=True
suggests a class prediction.output_class=False
causespredict()
to crash:output_class=False
causespredict_proba()
to crash:Suggested fix
This behavior makes cuML RF highly inconsistent with sklearn RF. I suggest the following list of modification:
output_class
parameter entirely. Instead, thepredict()
method should aways produce class prediction, and thepredict_proba()
method should always produce probability prediction.threshold
parameter frompredict_proba()
method. Thepredict()
method can keep it.predict_proba()
method should always return an array of dimension(n_samples, n_classes)
, no matter whethern_classes
is 2 (binary classification) or more than 2 (multi-class classification). Right now, thepredict_proba()
method in cuML RF returns an array of size(n_samples,)
for binary classification.Consult the scikit-learn RF's documentation. Screenshot:
Environment details
Conda environment
The text was updated successfully, but these errors were encountered: