-
-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Change predict
, transform
, predict_proba
to infer metadata by default for ParallelPostFit
#862
Conversation
ParallelPostFit
ParallelPostFit
ParallelPostFit
Thanks for the detailed write up in #868 . Would you be willing to add a test here as well ? I think the Pandas/dtype example you wrote up would be good |
Does this end up calling If necessary, we can add a new keyword like |
Thanks a lot for the review , added a test for the same here. See dask-ml/tests/test_parallel_post_fit.py Lines 80 to 91 in f29dd19
|
Thanks a lot for the review. I agree that we should handle value dependent predictions. I changed my implementation to fallback to numpy array as an option if the prediction fails on I hope this is an acceptable solution . I would like the users to be as much agnostic of the backend ML library behavior discrepancies as possible. FWIW we catch Exception across the codebase in dask we have precedence on doing such behavior. (See concat eg) .
I think this idea might add a lot more friction for users as setting _meta especially for users using different ML libraries . Hopefully above change is acceptable for a better User behavior. |
ParallelPostFit
ParallelPostFit.predict
ParallelPostFit.predict
ParallelPostFit.predict
ParallelPostFit.predict
ParallelPostFit
ParallelPostFit
ParallelPostFit
ParallelPostFit
ParallelPostFit
ParallelPostFit
ParallelPostFit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last question:
Right now we have meta=None
warn, saying that the default will be to infer in the future. How does a user get that infer behavior today without a warning? I don't think they can (short of silencing the warning) since Dask uses None
to mean infer.
So I think we need the actual default to be
_no_default = object()
class ParallelPostFit:
def __init__(..., meta=_no_default):
if meta is _no_default:
warnings.warn(...)
self.meta = meta
so that we can distinguish a user explicitly opting into inference from a user who just hasn't specified anything.
But do we even care about that case? I'm not sure, since I think the transform will essentially always fail on a length-0 input since sklearn does length checks... I'm not sure, but I think it's best to make this change.
ParallelPostFit
ParallelPostFit
@TomAugspurger , I addressed some of the reviews you requested but did not make the changes around raising the warning in It would be amazing if you could reply with you thoughts on both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I confused myself with this meta
stuff.
Maybe we just offer meta
as an option to those that need it (cuml users)? That would mean we don't even need a warning. We just have a default of None
which means "use ndarray, but override if you need to".
Can you share an example of how cuml users will use this? Something like ParallelPostFit(..., meta=cupy.array((), dtype=float))
This is how For cuml_model = cuml.linear_model.LinearRegression().fit(df, y)
dask_model = ParallelPostFit(estimator=cuml_model, meta=cudf.Series([1]))
dask_model.predict(dask_cudf_df).compute()
For cuml_model = cuml.linear_model.LinearRegression().fit(df.to_cupy(), y)
### None works here too.
### dask_model = ParallelPostFit(estimator=cuml_model, meta=None)
dask_model = ParallelPostFit(estimator=cuml_model, meta=cp.array([1]))
dask_model.predict(dask_cudf_df).compute()
I still think we should eventually More than happy to push on that once we give users enough time to update their calls. |
The thing I'm not sure about is how to handle dask-ml's code that helps with inference, as opposed to Dask's regular map_blocks / map_partition's inference. Because (last I checked) most sklearn estimators raise when there are zero samples, Dask's typical I don't know how that translates back to user API though. Maybe it's not actually an issue. Any thoughts @quasiben or @VibhuJawa? |
Can confirm that this is still true today Most estimators indeed fail on zero shaped arrays.
Eventually, I think the following will be a good end behavior for users: For Default Reasoning: For Value Dependent Encoders For encoders that are value dependent the users can over ride this with For Default We can do what we do for Reasoning: For Value Dependent Encoders For encoders that are value dependent the users can over ride this with Proposed Next Steps: With that issue we give users 1 release notice about the upcoming change in behavior. |
What do you think about creating (either in dask-ml or in dask-core) a small non-empty one container for arrays. This would be something like:
Or in practice: In [10]: arr = cupy.arange(11)
In [11]: d_c = da.from_array(arr, chunks=(4,), asarray=False)
In [12]: type(np.asarray(1, like=d_c._meta))
Out[12]: cupy._core.core.ndarray This would rely on NEP-35 first introduced in NumPy 1.20.1. So we could push this to help with a default behavior of inferring type but pass the same bit of data to the
|
Thanks Ben, that's helpful. I do think whatever small temporary arrays we create should use OK, I've finally had a chance to sit down, install cudf, and play with this PR. I have some thoughts. Apologies for the repeated iteraations on this @VibhuJawa, but I think we're close. I don't think that a single
Again, apologies for going back and forth on this @VibhuJawa, but I think it's worth getting things right. I'll have some time tomorrow night to work on this and will push some commits if you don't beat me to it. |
No worries on the back and forth. I agree with the idea of getting things right . Thanks a lot for working through with me on this and the detailed suggestions. I agree with the approach that you suggested as well as all the suggestions. I am gonna push some changes to this PR in accordance to your suggestions and we can hopefully get those through. :-D |
ParallelPostFit
ParallelPostFit
ParallelPostFit
ParallelPostFit
…ba_meta, transform and ensure parrity in predict, transform
…into dask_cudf_fix
ParallelPostFit
ParallelPostFit
@TomAugspurger , The PR should be ready for a review. I have tried my best to follow all the directions in this comment. Please let me know of any additional changes. |
ParallelPostFit
predict
, transform
, predict_proba
to infer metadata by default for ParallelPostFit
Thanks @VibhuJawa, looks great. |
Thanks for helping review Tom! 😄 |
This PR fixes #868