[REVIEW] Change `predict`, `transform`, `predict_proba` to infer metadata by default for `ParallelPostFit` #862

VibhuJawa · 2021-10-15T22:04:28Z

This PR fixes #868

…_fix

quasiben · 2021-10-21T15:05:17Z

Thanks for the detailed write up in #868 . Would you be willing to add a test here as well ? I think the Pandas/dtype example you wrote up would be good

TomAugspurger · 2021-10-21T15:10:52Z

Does this end up calling _predict on self._meta or self._meta_nonempty to infer the output type? I'd like to avoid that if possible, since it can fail if the predict function has values-dependent behavior (e.g. some kind of encoder that only knows how to handle specific values).

If necessary, we can add a new keyword like meta or output_meta for users to control this. That would get set in __init__ and passed here.

VibhuJawa · 2021-10-25T21:01:27Z

Thanks for the detailed write up in #868 . Would you be willing to add a test here as well ? I think the Pandas/dtype example you wrote up would be good

Thanks a lot for the review , added a test for the same here.

See

dask-ml/tests/test_parallel_post_fit.py

Lines 80 to 91 in f29dd19

    
           def test_predict_meta_correctness(): 
        
               X, y = make_classification(chunks=100) 
        
               X_ddf = dd.from_dask_array(X) 
        
               base = LinearRegression(n_jobs=1) 
        
               base.fit(X, y) 
        
               wrap = ParallelPostFit(base) 
        
               base_output_dtype = base.predict(X).dtype 
        
               wrap_output_dtype = wrap.predict(X_ddf).dtype 
        
               assert base_output_dtype == wrap_output_dtype

VibhuJawa · 2021-10-25T21:21:06Z

I'd like to avoid that if possible, since it can fail if the predict function has values-dependent behavior (e.g. some kind of encoder that only knows how to handle specific values).

Thanks a lot for the review. I agree that we should handle value dependent predictions.

I changed my implementation to fallback to numpy array as an option if the prediction fails on _meta_nonempty . I also added a test for the same here.

I hope this is an acceptable solution . I would like the users to be as much agnostic of the backend ML library behavior discrepancies as possible.

FWIW we catch Exception across the codebase in dask we have precedence on doing such behavior. (See concat eg) .

If necessary, we can add a new keyword like meta or output_meta for users to control this. That would get set in init and passed here.

I think this idea might add a lot more friction for users as setting _meta especially for users using different ML libraries . Hopefully above change is acceptable for a better User behavior.

dask_ml/wrappers.py

docs/source/hyper-parameter-search.rst

tests/model_selection/test_hyperband.py

tests/test_parallel_post_fit.py

… etc

TomAugspurger

Last question:

Right now we have meta=None warn, saying that the default will be to infer in the future. How does a user get that infer behavior today without a warning? I don't think they can (short of silencing the warning) since Dask uses None to mean infer.

So I think we need the actual default to be

_no_default = object()

class ParallelPostFit:
    def __init__(..., meta=_no_default):
        if meta is _no_default:
            warnings.warn(...)
        self.meta = meta

so that we can distinguish a user explicitly opting into inference from a user who just hasn't specified anything.

But do we even care about that case? I'm not sure, since I think the transform will essentially always fail on a length-0 input since sklearn does length checks... I'm not sure, but I think it's best to make this change.

dask_ml/wrappers.py

VibhuJawa · 2021-11-11T21:32:55Z

@TomAugspurger , I addressed some of the reviews you requested but did not make the changes around raising the warning in __init__ (see comment) . I also did not add the add the _no_default = object() because of concerns raised in comment here.

It would be amazing if you could reply with you thoughts on both.

TomAugspurger

Sorry, I think I confused myself with this meta stuff.

Maybe we just offer meta as an option to those that need it (cuml users)? That would mean we don't even need a warning. We just have a default of None which means "use ndarray, but override if you need to".

Can you share an example of how cuml users will use this? Something like ParallelPostFit(..., meta=cupy.array((), dtype=float))

dask_ml/wrappers.py

VibhuJawa · 2021-11-11T22:57:07Z

Can you share an example of how cuml users will use this? Something like ParallelPostFit(..., meta=cupy.array((), dtype=float))

This is how cuML users will have to use it. cuML currently outputs a cudf.Series when trained on cuDF frames and cupy array when trained on cupy-arrays.

For model trained on cudf-dataframe:

cuml_model = cuml.linear_model.LinearRegression().fit(df, y)
dask_model = ParallelPostFit(estimator=cuml_model, meta=cudf.Series([1]))
dask_model.predict(dask_cudf_df).compute()

0    0.982425
1    1.043937
2    0.117165
0    0.946690
1   -0.090217
dtype: float64

For model trained on cupy-series (Default and cupy array both work) :

cuml_model = cuml.linear_model.LinearRegression().fit(df.to_cupy(), y)
### None works here too. 
### dask_model = ParallelPostFit(estimator=cuml_model, meta=None)

dask_model = ParallelPostFit(estimator=cuml_model, meta=cp.array([1]))
dask_model.predict(dask_cudf_df).compute()

array([ 0.98242531,  1.04393673,  0.11716462,  0.9466901 , -0.09021675])

Maybe we just offer meta as an option to those that need it (cuml users)? That would mean we don't even need a warning. We just have a default of None which means "use ndarray, but override if you need to".

I still think we should eventually default to infer rather than ndarray for predict like we do for transform . This will streamline behavior across functions .

More than happy to push on that once we give users enough time to update their calls.

TomAugspurger · 2021-11-12T17:34:39Z

I agree that we should be consistent between predict and transform.
Our long-term goal should be to infer, with the option to override.

The thing I'm not sure about is how to handle dask-ml's code that helps with inference, as opposed to Dask's regular map_blocks / map_partition's inference. Because (last I checked) most sklearn estimators raise when there are zero samples, Dask's typical map_blocks inference fails. So we allocate an array with one sample at https://github.com/dask/dask-ml/pull/862/files#diff-20a89984412f6d21bb7a3f836c75a6f17df88ea034a8d886792dd5f102b58517R217 and infer meta on that.

I don't know how that translates back to user API though. Maybe it's not actually an issue. Any thoughts @quasiben or @VibhuJawa?

dask_ml/wrappers.py

VibhuJawa · 2021-11-12T20:00:13Z

The thing I'm not sure about is how to handle dask-ml's code that helps with inference, as opposed to Dask's regular map_blocks / map_partition's inference. Because (last I checked) most sklearn estimators raise when there are zero samples, Dask's typical map_blocks inference fails.

Can confirm that this is still true today Most estimators indeed fail on zero shaped arrays.

I don't know how that translates back to user API though. Maybe it's not actually an issue. Any thoughts @quasiben or @VibhuJawa?

Eventually, I think the following will be a good end behavior for users:

For dask-dataframes :

Default
Default meta to dd.core.no_default because that gives us the ability to fall back to _meta_nonempty thus we don't run into the zero shaped issue that dask-array runs into .

Reasoning:
That is sk_model.predict(dask_df._meta_nonempty) works but sk_model.predict(dask_df._meta) fails . But as dask-dataframe.map_partitions falls back on _meta_nonempty we are good.

For Value Dependent Encoders

For encoders that are value dependent the users can over ride this with ParallelPostFit(..., meta=...) .

For dask-arrays:

Default

We can do what we do for transform (see code) and run meta inference on 1 -sample sized array, for predict too.

Reasoning:

sk_model.predict(dask_ar._meta) fails and we dont have the _meta_nonempty fallback for dask-arrays.

For Value Dependent Encoders

For encoders that are value dependent the users can over ride this with ParallelPostFit(..., meta=...) .

Proposed Next Steps:
With this PR we add meta attribute to ParallelPostFit and when we tackle removing the np.array fallback with #875 , we change the behavior of predict to reflect above .

With that issue we give users 1 release notice about the upcoming change in behavior.

quasiben · 2021-11-12T21:34:40Z

What do you think about creating (either in dask-ml or in dask-core) a small non-empty one container for arrays. This would be something like:

np.asarray(1, like=arr._meta)

Or in practice:

In [10]: arr = cupy.arange(11)
In [11]: d_c = da.from_array(arr, chunks=(4,), asarray=False)
In [12]: type(np.asarray(1, like=d_c._meta))
Out[12]: cupy._core.core.ndarray

This would rely on NEP-35 first introduced in NumPy 1.20.1. So we could push this to help with a default behavior of inferring type but pass the same bit of data to the _postfit_estimator meta arg.

meta = np.asarray(1, like=d_c._meta) or meta = np.zeros((1,d_c.shape[1]), like=d_c._meta)

TomAugspurger · 2021-11-15T22:58:07Z

Thanks Ben, that's helpful. I do think whatever small temporary arrays we create should use like=.

OK, I've finally had a chance to sit down, install cudf, and play with this PR. I have some thoughts. Apologies for the repeated iteraations on this @VibhuJawa, but I think we're close.

I don't think that a single meta= argument to ParallelPostFit.__init__ make sense. At a minimum, we would need something like a predict_meta=..., predict_proba_meta=..., transform_meta=.... So how about this:

Adopt a policy of ParallelPostFit.predict and ParallelPostFit.transform will infer the dtype by default. I'm OK with doing this as a bugfix. The old version was just plain wrong in assuming that it was always a specific dtype / array type.
Wherever possible, use the default behavior of map_blocks / map_partitions, applying the callable to an empty array / dataframe.
Wherever necessary construct a nonempty array of the right type and pass that to the subestimator. Maybe cuml already accepts NumPy arrays / pandas DataFrames in predict / transform, so using like=... isn't mandatory yet?
Add a predict_meta, predict_proba_meta, and transform_meta to ParallelPostFit.__init__. Both default to None (which means infer). I think we don't bother with warnings. But this gives users the option to override inference if necessary.

Again, apologies for going back and forth on this @VibhuJawa, but I think it's worth getting things right. I'll have some time tomorrow night to work on this and will push some commits if you don't beat me to it.

VibhuJawa · 2021-11-15T23:46:50Z

apologies for going back and forth on this @VibhuJawa, but I think it's worth getting things right. I'll have some time tomorrow night to work on this and will push some commits if you don't beat me to it.

No worries on the back and forth. I agree with the idea of getting things right . Thanks a lot for working through with me on this and the detailed suggestions.

I agree with the approach that you suggested as well as all the suggestions. I am gonna push some changes to this PR in accordance to your suggestions and we can hopefully get those through. :-D

…ba_meta, transform and ensure parrity in predict, transform

…into dask_cudf_fix

VibhuJawa · 2021-11-16T22:54:08Z

@TomAugspurger , The PR should be ready for a review. I have tried my best to follow all the directions in this comment. Please let me know of any additional changes.

TomAugspurger · 2021-11-17T02:06:05Z

Thanks @VibhuJawa, looks great.

jakirkham · 2021-11-17T17:40:17Z

Thanks for helping review Tom! 😄

VibhuJawa added 3 commits October 15, 2021 15:02

working with dask_cudf dataframe

11f72af

Merge branch 'main' of https://github.com/dask/dask-ml into dask_cudf…

fa752e5

…_fix

cleaned up code

9c26f37

VibhuJawa changed the title ~~[WIP] Fix ParallelPostFit Predict for dask-cudf dataframe backend~~ [WIP] Remove meta hard-coding in ParallelPostFit Oct 19, 2021

VibhuJawa changed the title ~~[WIP] Remove meta hard-coding in ParallelPostFit~~ [REVIEW] Remove meta hard-coding in ParallelPostFit Oct 19, 2021

VibhuJawa marked this pull request as ready for review October 19, 2021 23:14

VibhuJawa mentioned this pull request Oct 19, 2021

[BUG]ParallelPostFit fails with cuML models #868

Closed

VibhuJawa added 2 commits October 25, 2021 13:38

added meta fallback and pytests

9fb42ee

fixed flake8 changes

f29dd19

VibhuJawa added 2 commits October 26, 2021 12:22

Trigger Build

fd83064

Merge branch 'dask:main' into dask_cudf_fix

c8197c2

quasiben reviewed Nov 1, 2021

View reviewed changes

dask_ml/wrappers.py Outdated Show resolved Hide resolved

Added futurewarning to ParallelPostFit.predict for empty meta

f6316a9

VibhuJawa changed the title ~~[REVIEW] Remove meta hard-coding in ParallelPostFit~~ [REVIEW] Add meta attribute to ParallelPostFit.predict Nov 4, 2021

VibhuJawa mentioned this pull request Nov 4, 2021

[FEA] Deprecate np.array fallback for ParallelPostFit predict function #875

Closed

VibhuJawa added 3 commits November 4, 2021 11:15

Trigger Build

3d6b38f

Ignore warning raised in ParallelPostFit.predict due to meta=None

6c29844

Trigger Build

749f690

TomAugspurger reviewed Nov 5, 2021

View reviewed changes

dask_ml/wrappers.py Outdated Show resolved Hide resolved

docs/source/hyper-parameter-search.rst Outdated Show resolved Hide resolved

tests/model_selection/test_hyperband.py Outdated Show resolved Hide resolved

tests/test_parallel_post_fit.py Outdated Show resolved Hide resolved

VibhuJawa changed the title ~~[REVIEW] Add meta attribute to ParallelPostFit.predict~~ [WIP] Add meta attribute to ParallelPostFit.predict Nov 5, 2021

VibhuJawa changed the title ~~[WIP] Add meta attribute to ParallelPostFit.predict~~ [WIP] Add meta attribute to ParallelPostFit Nov 8, 2021

Added meta as a class attribute and added relevant doc strings, tests…

094c31a

… etc

VibhuJawa changed the title ~~[WIP] Add meta attribute to ParallelPostFit~~ [REVIEW] Add meta attribute to ParallelPostFit Nov 8, 2021

VibhuJawa changed the title ~~[REVIEW] Add meta attribute to ParallelPostFit~~ WIP] Add meta attribute to ParallelPostFit Nov 8, 2021

VibhuJawa changed the title ~~WIP] Add meta attribute to ParallelPostFit~~ [WIP] Add meta attribute to ParallelPostFit Nov 8, 2021

TomAugspurger reviewed Nov 8, 2021

View reviewed changes

dask_ml/wrappers.py Outdated Show resolved Hide resolved

dask_ml/wrappers.py Outdated Show resolved Hide resolved

VibhuJawa mentioned this pull request Nov 11, 2021

[ENH] Add tests for cuML+Dask-ML+dask-sql dask-contrib/dask-sql#309

Closed

Merge branch 'dask:main' into dask_cudf_fix

5887dcb

VibhuJawa changed the title ~~[WIP] Add meta attribute to ParallelPostFit~~ [REVIEW] Add meta attribute to ParallelPostFit Nov 11, 2021

TomAugspurger reviewed Nov 11, 2021

View reviewed changes

dask_ml/wrappers.py Outdated Show resolved Hide resolved

jakirkham reviewed Nov 12, 2021

View reviewed changes

dask_ml/wrappers.py Outdated Show resolved Hide resolved

VibhuJawa changed the title ~~[REVIEW] Add meta attribute to ParallelPostFit~~ [PREDICT] Add meta attribute to ParallelPostFit Nov 16, 2021

VibhuJawa changed the title ~~[PREDICT] Add meta attribute to ParallelPostFit~~ [WIP] Add meta attribute to ParallelPostFit Nov 16, 2021

VibhuJawa added 6 commits November 16, 2021 00:12

Updated implimentation to add separate metas for predict, predict_pro…

e16d475

…ba_meta, transform and ensure parrity in predict, transform

fix docstring in Incremental

433cd08

Merge branch 'dask_cudf_fix' of https://github.com/VibhuJawa/dask-ml …

6ac4602

…into dask_cudf_fix

remove minor typo in hyper-parameter-search.rst

6cb28a6

removed comma from tests/test_incremental.py

05091cb

Added meta tests for predict, predict_proba, transform

25bbfca

jrbourbeau mentioned this pull request Nov 16, 2021

Release? #884

Closed

VibhuJawa changed the title ~~[WIP] Add meta attribute to ParallelPostFit~~ [REVIEW] Add meta attribute to ParallelPostFit Nov 16, 2021

VibhuJawa changed the title ~~[REVIEW] Add meta attribute to ParallelPostFit~~ [REVIEW] Change predict, transform, predict_proba to infer metadata by default for ParallelPostFit Nov 16, 2021

TomAugspurger merged commit f752e29 into dask:main Nov 17, 2021

TomAugspurger mentioned this pull request Nov 17, 2021

RLS: 2016.11.16 #885

Merged

charlesbluca mentioned this pull request Nov 19, 2021

Add metadata for wrap_fit when using GPU dataframes dask-contrib/dask-sql#317

Closed

charlesbluca mentioned this pull request Dec 2, 2021

GPU CI dask/community#138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Change `predict`, `transform`, `predict_proba` to infer metadata by default for `ParallelPostFit` #862

[REVIEW] Change `predict`, `transform`, `predict_proba` to infer metadata by default for `ParallelPostFit` #862

VibhuJawa commented Oct 15, 2021 •

edited

Loading

quasiben commented Oct 21, 2021

TomAugspurger commented Oct 21, 2021

VibhuJawa commented Oct 25, 2021

VibhuJawa commented Oct 25, 2021

TomAugspurger left a comment

VibhuJawa commented Nov 11, 2021

TomAugspurger left a comment

VibhuJawa commented Nov 11, 2021 •

edited

Loading

TomAugspurger commented Nov 12, 2021

VibhuJawa commented Nov 12, 2021 •

edited

Loading

quasiben commented Nov 12, 2021 •

edited

Loading

TomAugspurger commented Nov 15, 2021

VibhuJawa commented Nov 15, 2021

VibhuJawa commented Nov 16, 2021 •

edited

Loading

TomAugspurger commented Nov 17, 2021

jakirkham commented Nov 17, 2021

[REVIEW] Change predict, transform, predict_proba to infer metadata by default for ParallelPostFit #862

[REVIEW] Change predict, transform, predict_proba to infer metadata by default for ParallelPostFit #862

Conversation

VibhuJawa commented Oct 15, 2021 • edited Loading

quasiben commented Oct 21, 2021

TomAugspurger commented Oct 21, 2021

VibhuJawa commented Oct 25, 2021

VibhuJawa commented Oct 25, 2021

TomAugspurger left a comment

Choose a reason for hiding this comment

VibhuJawa commented Nov 11, 2021

TomAugspurger left a comment

Choose a reason for hiding this comment

VibhuJawa commented Nov 11, 2021 • edited Loading

TomAugspurger commented Nov 12, 2021

VibhuJawa commented Nov 12, 2021 • edited Loading

quasiben commented Nov 12, 2021 • edited Loading

TomAugspurger commented Nov 15, 2021

VibhuJawa commented Nov 15, 2021

VibhuJawa commented Nov 16, 2021 • edited Loading

TomAugspurger commented Nov 17, 2021

jakirkham commented Nov 17, 2021

[REVIEW] Change `predict`, `transform`, `predict_proba` to infer metadata by default for `ParallelPostFit` #862

[REVIEW] Change `predict`, `transform`, `predict_proba` to infer metadata by default for `ParallelPostFit` #862

VibhuJawa commented Oct 15, 2021 •

edited

Loading

VibhuJawa commented Nov 11, 2021 •

edited

Loading

VibhuJawa commented Nov 12, 2021 •

edited

Loading

quasiben commented Nov 12, 2021 •

edited

Loading

VibhuJawa commented Nov 16, 2021 •

edited

Loading