Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Change predict, transform, predict_proba to infer metadata by default for ParallelPostFit #862

Merged
merged 21 commits into from
Nov 17, 2021

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Oct 15, 2021

This PR fixes #868

@VibhuJawa VibhuJawa changed the title [WIP] Fix ParallelPostFit Predict for dask-cudf dataframe backend [WIP] Remove meta hard-coding in ParallelPostFit Oct 19, 2021
@VibhuJawa VibhuJawa changed the title [WIP] Remove meta hard-coding in ParallelPostFit [REVIEW] Remove meta hard-coding in ParallelPostFit Oct 19, 2021
@VibhuJawa VibhuJawa marked this pull request as ready for review October 19, 2021 23:14
@quasiben
Copy link
Member

Thanks for the detailed write up in #868 . Would you be willing to add a test here as well ? I think the Pandas/dtype example you wrote up would be good

@TomAugspurger
Copy link
Member

Does this end up calling _predict on self._meta or self._meta_nonempty to infer the output type? I'd like to avoid that if possible, since it can fail if the predict function has values-dependent behavior (e.g. some kind of encoder that only knows how to handle specific values).

If necessary, we can add a new keyword like meta or output_meta for users to control this. That would get set in __init__ and passed here.

@VibhuJawa
Copy link
Collaborator Author

Thanks for the detailed write up in #868 . Would you be willing to add a test here as well ? I think the Pandas/dtype example you wrote up would be good

Thanks a lot for the review , added a test for the same here.

See

def test_predict_meta_correctness():
X, y = make_classification(chunks=100)
X_ddf = dd.from_dask_array(X)
base = LinearRegression(n_jobs=1)
base.fit(X, y)
wrap = ParallelPostFit(base)
base_output_dtype = base.predict(X).dtype
wrap_output_dtype = wrap.predict(X_ddf).dtype
assert base_output_dtype == wrap_output_dtype

@VibhuJawa
Copy link
Collaborator Author

I'd like to avoid that if possible, since it can fail if the predict function has values-dependent behavior (e.g. some kind of encoder that only knows how to handle specific values).

Thanks a lot for the review. I agree that we should handle value dependent predictions.

I changed my implementation to fallback to numpy array as an option if the prediction fails on _meta_nonempty . I also added a test for the same here.

I hope this is an acceptable solution . I would like the users to be as much agnostic of the backend ML library behavior discrepancies as possible.

FWIW we catch Exception across the codebase in dask we have precedence on doing such behavior. (See concat eg) .

If necessary, we can add a new keyword like meta or output_meta for users to control this. That would get set in init and passed here.

I think this idea might add a lot more friction for users as setting _meta especially for users using different ML libraries . Hopefully above change is acceptable for a better User behavior.

dask_ml/wrappers.py Outdated Show resolved Hide resolved
@VibhuJawa VibhuJawa changed the title [REVIEW] Remove meta hard-coding in ParallelPostFit [REVIEW] Add meta attribute to ParallelPostFit.predict Nov 4, 2021
dask_ml/wrappers.py Outdated Show resolved Hide resolved
docs/source/hyper-parameter-search.rst Outdated Show resolved Hide resolved
tests/model_selection/test_hyperband.py Outdated Show resolved Hide resolved
tests/test_parallel_post_fit.py Outdated Show resolved Hide resolved
@VibhuJawa VibhuJawa changed the title [REVIEW] Add meta attribute to ParallelPostFit.predict [WIP] Add meta attribute to ParallelPostFit.predict Nov 5, 2021
@VibhuJawa VibhuJawa changed the title [WIP] Add meta attribute to ParallelPostFit.predict [WIP] Add meta attribute to ParallelPostFit Nov 8, 2021
@VibhuJawa VibhuJawa changed the title [WIP] Add meta attribute to ParallelPostFit [REVIEW] Add meta attribute to ParallelPostFit Nov 8, 2021
@VibhuJawa VibhuJawa changed the title [REVIEW] Add meta attribute to ParallelPostFit WIP] Add meta attribute to ParallelPostFit Nov 8, 2021
@VibhuJawa VibhuJawa changed the title WIP] Add meta attribute to ParallelPostFit [WIP] Add meta attribute to ParallelPostFit Nov 8, 2021
Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last question:

Right now we have meta=None warn, saying that the default will be to infer in the future. How does a user get that infer behavior today without a warning? I don't think they can (short of silencing the warning) since Dask uses None to mean infer.

So I think we need the actual default to be

_no_default = object()

class ParallelPostFit:
    def __init__(..., meta=_no_default):
        if meta is _no_default:
            warnings.warn(...)
        self.meta = meta

so that we can distinguish a user explicitly opting into inference from a user who just hasn't specified anything.

But do we even care about that case? I'm not sure, since I think the transform will essentially always fail on a length-0 input since sklearn does length checks... I'm not sure, but I think it's best to make this change.

dask_ml/wrappers.py Outdated Show resolved Hide resolved
dask_ml/wrappers.py Outdated Show resolved Hide resolved
@VibhuJawa VibhuJawa changed the title [WIP] Add meta attribute to ParallelPostFit [REVIEW] Add meta attribute to ParallelPostFit Nov 11, 2021
@VibhuJawa
Copy link
Collaborator Author

@TomAugspurger , I addressed some of the reviews you requested but did not make the changes around raising the warning in __init__ (see comment) . I also did not add the add the _no_default = object() because of concerns raised in comment here.

It would be amazing if you could reply with you thoughts on both.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I confused myself with this meta stuff.

Maybe we just offer meta as an option to those that need it (cuml users)? That would mean we don't even need a warning. We just have a default of None which means "use ndarray, but override if you need to".

Can you share an example of how cuml users will use this? Something like ParallelPostFit(..., meta=cupy.array((), dtype=float))

dask_ml/wrappers.py Outdated Show resolved Hide resolved
@VibhuJawa
Copy link
Collaborator Author

VibhuJawa commented Nov 11, 2021

Can you share an example of how cuml users will use this? Something like ParallelPostFit(..., meta=cupy.array((), dtype=float))

This is how cuML users will have to use it. cuML currently outputs a cudf.Series when trained on cuDF frames and cupy array when trained on cupy-arrays.

For model trained on cudf-dataframe:

cuml_model = cuml.linear_model.LinearRegression().fit(df, y)
dask_model = ParallelPostFit(estimator=cuml_model, meta=cudf.Series([1]))
dask_model.predict(dask_cudf_df).compute()
0    0.982425
1    1.043937
2    0.117165
0    0.946690
1   -0.090217
dtype: float64

For model trained on cupy-series (Default and cupy array both work) :

cuml_model = cuml.linear_model.LinearRegression().fit(df.to_cupy(), y)
### None works here too. 
### dask_model = ParallelPostFit(estimator=cuml_model, meta=None)

dask_model = ParallelPostFit(estimator=cuml_model, meta=cp.array([1]))
dask_model.predict(dask_cudf_df).compute()
array([ 0.98242531,  1.04393673,  0.11716462,  0.9466901 , -0.09021675])

Maybe we just offer meta as an option to those that need it (cuml users)? That would mean we don't even need a warning. We just have a default of None which means "use ndarray, but override if you need to".

I still think we should eventually default to infer rather than ndarray for predict like we do for transform . This will streamline behavior across functions .

More than happy to push on that once we give users enough time to update their calls.

@TomAugspurger
Copy link
Member

  1. I agree that we should be consistent between predict and transform.
  2. Our long-term goal should be to infer, with the option to override.

The thing I'm not sure about is how to handle dask-ml's code that helps with inference, as opposed to Dask's regular map_blocks / map_partition's inference. Because (last I checked) most sklearn estimators raise when there are zero samples, Dask's typical map_blocks inference fails. So we allocate an array with one sample at https://github.com/dask/dask-ml/pull/862/files#diff-20a89984412f6d21bb7a3f836c75a6f17df88ea034a8d886792dd5f102b58517R217 and infer meta on that.

I don't know how that translates back to user API though. Maybe it's not actually an issue. Any thoughts @quasiben or @VibhuJawa?

dask_ml/wrappers.py Outdated Show resolved Hide resolved
@VibhuJawa
Copy link
Collaborator Author

VibhuJawa commented Nov 12, 2021

The thing I'm not sure about is how to handle dask-ml's code that helps with inference, as opposed to Dask's regular map_blocks / map_partition's inference. Because (last I checked) most sklearn estimators raise when there are zero samples, Dask's typical map_blocks inference fails.

Can confirm that this is still true today Most estimators indeed fail on zero shaped arrays.

I don't know how that translates back to user API though. Maybe it's not actually an issue. Any thoughts @quasiben or @VibhuJawa?

Eventually, I think the following will be a good end behavior for users:

For dask-dataframes :

    Default
    Default meta to dd.core.no_default because that gives us the ability to fall back to _meta_nonempty thus we don't run into the zero shaped issue that dask-array runs into .

    Reasoning:
    That is sk_model.predict(dask_df._meta_nonempty) works but sk_model.predict(dask_df._meta) fails . But as dask-dataframe.map_partitions falls back on _meta_nonempty we are good.

    For Value Dependent Encoders

    For encoders that are value dependent the users can over ride this with ParallelPostFit(..., meta=...) .

For dask-arrays:

    Default

    We can do what we do for transform (see code) and run meta inference on 1 -sample sized array, for predict too.

    Reasoning:

    sk_model.predict(dask_ar._meta) fails and we dont have the _meta_nonempty fallback for dask-arrays.

    For Value Dependent Encoders

    For encoders that are value dependent the users can over ride this with ParallelPostFit(..., meta=...) .

Proposed Next Steps:
With this PR we add meta attribute to ParallelPostFit and when we tackle removing the np.array fallback with #875 , we change the behavior of predict to reflect above .

With that issue we give users 1 release notice about the upcoming change in behavior.

@quasiben
Copy link
Member

quasiben commented Nov 12, 2021

What do you think about creating (either in dask-ml or in dask-core) a small non-empty one container for arrays. This would be something like:

np.asarray(1, like=arr._meta)

Or in practice:

In [10]: arr = cupy.arange(11)
In [11]: d_c = da.from_array(arr, chunks=(4,), asarray=False)
In [12]: type(np.asarray(1, like=d_c._meta))
Out[12]: cupy._core.core.ndarray

This would rely on NEP-35 first introduced in NumPy 1.20.1. So we could push this to help with a default behavior of inferring type but pass the same bit of data to the _postfit_estimator meta arg.

meta = np.asarray(1, like=d_c._meta) or meta = np.zeros((1,d_c.shape[1]), like=d_c._meta)

@TomAugspurger
Copy link
Member

Thanks Ben, that's helpful. I do think whatever small temporary arrays we create should use like=.

OK, I've finally had a chance to sit down, install cudf, and play with this PR. I have some thoughts. Apologies for the repeated iteraations on this @VibhuJawa, but I think we're close.

I don't think that a single meta= argument to ParallelPostFit.__init__ make sense. At a minimum, we would need something like a predict_meta=..., predict_proba_meta=..., transform_meta=.... So how about this:

  1. Adopt a policy of ParallelPostFit.predict and ParallelPostFit.transform will infer the dtype by default. I'm OK with doing this as a bugfix. The old version was just plain wrong in assuming that it was always a specific dtype / array type.
  2. Wherever possible, use the default behavior of map_blocks / map_partitions, applying the callable to an empty array / dataframe.
  3. Wherever necessary construct a nonempty array of the right type and pass that to the subestimator. Maybe cuml already accepts NumPy arrays / pandas DataFrames in predict / transform, so using like=... isn't mandatory yet?
  4. Add a predict_meta, predict_proba_meta, and transform_meta to ParallelPostFit.__init__. Both default to None (which means infer). I think we don't bother with warnings. But this gives users the option to override inference if necessary.

Again, apologies for going back and forth on this @VibhuJawa, but I think it's worth getting things right. I'll have some time tomorrow night to work on this and will push some commits if you don't beat me to it.

@VibhuJawa
Copy link
Collaborator Author

apologies for going back and forth on this @VibhuJawa, but I think it's worth getting things right. I'll have some time tomorrow night to work on this and will push some commits if you don't beat me to it.

No worries on the back and forth. I agree with the idea of getting things right . Thanks a lot for working through with me on this and the detailed suggestions.

I agree with the approach that you suggested as well as all the suggestions. I am gonna push some changes to this PR in accordance to your suggestions and we can hopefully get those through. :-D

@VibhuJawa VibhuJawa changed the title [REVIEW] Add meta attribute to ParallelPostFit [PREDICT] Add meta attribute to ParallelPostFit Nov 16, 2021
@VibhuJawa VibhuJawa changed the title [PREDICT] Add meta attribute to ParallelPostFit [WIP] Add meta attribute to ParallelPostFit Nov 16, 2021
@jrbourbeau jrbourbeau mentioned this pull request Nov 16, 2021
@VibhuJawa VibhuJawa changed the title [WIP] Add meta attribute to ParallelPostFit [REVIEW] Add meta attribute to ParallelPostFit Nov 16, 2021
@VibhuJawa
Copy link
Collaborator Author

VibhuJawa commented Nov 16, 2021

@TomAugspurger , The PR should be ready for a review. I have tried my best to follow all the directions in this comment. Please let me know of any additional changes.

@VibhuJawa VibhuJawa changed the title [REVIEW] Add meta attribute to ParallelPostFit [REVIEW] Change predict, transform, predict_proba to infer metadata by default for ParallelPostFit Nov 16, 2021
@TomAugspurger
Copy link
Member

Thanks @VibhuJawa, looks great.

@TomAugspurger TomAugspurger merged commit f752e29 into dask:main Nov 17, 2021
@TomAugspurger TomAugspurger mentioned this pull request Nov 17, 2021
@jakirkham
Copy link
Member

Thanks for helping review Tom! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]ParallelPostFit fails with cuML models
5 participants