-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Include support for raw_score in predict (fixes #3793) #4024
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on! Please see my first round of comments
@@ -1255,3 +1255,59 @@ def test_sklearn_integration(estimator, check, client): | |||
def test_parameters_default_constructible(estimator): | |||
name, Estimator = estimator.__class__.__name__, estimator.__class__ | |||
sklearn_checks.check_parameters_default_constructible(name, Estimator) | |||
|
|||
|
|||
@pytest.mark.parametrize('task', ['binary_classification', 'multi-class_classification', 'regression']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please adding ranking tests as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this change also makes me realize that the other tests using @pytest.mark.parametrize('task', tasks)
are not currently testing multi-class classification.
Are you interested in submitting a separate PR to add multi-class-classification
to tasks
?
tasks = ['classification', 'regression', 'ranking'] |
Doing that would improve our test coverage AND reduce the risk of mistakes like this where a task if forgotten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Should this modify the _create_data
function to add an objective of multi-class-classification
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please! based on this discussion: #4024 (comment)
when you do that, I think it would make sense to remove the centers
argument from that function, and just use the objective
to set centers=2
or centers=3
. We don't have any tests right now that use more than 3 classes or that really care about the exact numeric values of the cluster centers coming out of make_blobs()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi James. I'm having trouble with the tests for ranking. The check for the expected mean is because if you have 25 samples in one leaf with value 0 and 75 samples in another leaf with value 1 you expect your mean to be (25 * 0 + 75 * 1) / (25 + 75), right? Which I guess would be roughly equal to checking that there are 25 predictions with the value 0 and 75 with the value 1. But for ranking I'm seeing that these don't match:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I could compare directly the count column vs the unique counts of the predictions. It's not very clear to me why we don't use that here, why are we using the weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you know what now that I think about it...it would be enough to compare the unique values of the leaf nodes to the unique values of the preds. If those are the same, you know you got the raw scores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the tests would check set(raw_predictions) == set(trees_df['value'])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. please let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks very much! I'm going to wait to merge this until after LightGBM 3.2.0 is released (#3872 ), assuming that happens in the next day or two.
Sounds good. What would be a good next contribution? I'd like to work on the dataset but I'm happy to contribute something else if you consider it more important |
thanks very much! Could you try adding tests for I don't want to start on |
Wow I hadn't seen the voting parallel learning, it looks awesome. Will definitely work on that and hope that it doesn't suffer from what I've seen in #4026. |
@jmoralez Could you please sync with latest |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This attempts to solve #3793 by testing the
predict
method forDaskLGBMRegressor
andDaskLGBMClassifier
withraw_score=True
. The tests forDaskLGBMClassifier
are performed with binary and multi-class classification sincepredict_proba=True
andraw_score=True
for binary classification return a 1d array/pandas Series as opposed to the multi-class variants (which return the same number of columns as classes).I found that using
drop_axis=1
fordask.array.map_blocks
works everytime and avoids having manyif
statements testing the conditions described above but I'm happy to discuss the best way to do this. I also replaced the decision to return apd.Series
orpd.DataFrame
by checking the shape of the input array, which I think should be discussed as well.