[WIP] Fix empty partition prediction with ParallelPostFit #912

VibhuJawa · 2022-03-25T02:10:39Z

Thiis PR fixes #911

TomAugspurger · 2022-03-25T14:57:04Z

Thanks @VibhuJawa. When looking at the traceback in #911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?

VibhuJawa · 2022-03-28T18:49:26Z

Thanks for reviewing the issue tom.

When looking at the traceback in #911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?

So i tried exploring a clean way to expose the self._validate_data parameter but came up with nothing. Open to any ideas on that front.

I think the problem is that each family of models in sklearn calls it with different parameters, (See below) and I am also not sure if all models will just work even if we can some how coerce it to accept them. (See related discussion here) .

Please let me know if the approach i am taking in this PR is not feasible. I will try to explore other ways to go about solving this problem.

mmccarty · 2022-09-20T19:25:38Z

@TomAugspurger - Any further feedback or guidance here? I don't see a way to expose check_array's kwargs without changes to sklearn. IMO, it seems reasonable to handle this case in dask-ml.

cc @jrbourbeau @betatim for vis

betatim · 2022-09-21T08:54:13Z

I think what is happening is that predict() (and co) are being called with an empty input that contains zero samples. It seems sensibly for the scikit-learn estimators to consider that an error. So I agree with Mike that this is something that dask-ml should handle. Probably by not calling predict(), transform() etc when a partition is empty and instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be, None also an empty array?).

VibhuJawa · 2022-09-21T09:12:30Z

instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be, None also an empty array?).

In this PR,

If the output is supposed to be arrays , I return empty arrays (for both sparse and dense arrays)
If the output is supposed to be dataframe (or dataframe like objec), I return an empty dataframe like objects

dask-ml/dask_ml/wrappers.py

Lines 661 to 688 in 28b97e0

    
           if hasattr(output_meta, "__array_function__"): 
        
               if len(output_meta.shape) == 1: 
        
                   shape = 0 
        
               else: 
        
                   shape = list(output_meta.shape) 
        
                   shape[0] = 0 
        
               ar = np.zeros( 
        
                   shape=shape, 
        
                   dtype=output_meta.dtype, 
        
                   like=output_meta, 
        
               ) 
        
               return ar 
        
           elif "scipy.sparse" in type(output_meta).__module__: 
        
               # sparse matrices dont support 
        
               # `like` due to non implimented __array_function__ 
        
               # Refer https://github.com/scipy/scipy/issues/10362 
        
               # Note below works for both cupy and scipy sparse matrices 
        
               # TODO: REMOVE code duplication 
        
               if len(ar.shape) == 1: 
        
                   shape = 0 
        
               else: 
        
                   shape = list(ar.shape) 
        
                   shape[0] = 0 
        
               ar = type(output_meta)(shape, dtype=output_meta.dtype) 
        
               return ar 
        
           elif hasattr(output_meta, "iloc"): 
        
               return output_meta.iloc[:0, :]

This also matches what cuML does (which returns an empty series) . See below:

>>> type(reg)
<class 'cuml.linear_model.logistic_regression.LogisticRegression'>
>>> reg.predict(X_new.iloc[:0])
Series([], dtype: float32)

first pass at fixing empty partition failures

28b97e0

VibhuJawa mentioned this pull request Mar 25, 2022

[BUG]Prediction with empty partitions fails on sklearn dask-ml models dask-contrib/dask-sql#414

Open

sarahyurick mentioned this pull request Sep 21, 2022

[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict dask-contrib/dask-sql#783

Closed

VibhuJawa mentioned this pull request Oct 5, 2022

Handle nullable types and empty partitions before Dask-ML predict dask-contrib/dask-sql#799

Closed

sarahyurick mentioned this pull request Oct 5, 2022

Replace dask_ml.wrappers.ParallelPostFit with custom ParallelPostFit class dask-contrib/dask-sql#832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix empty partition prediction with ParallelPostFit #912

[WIP] Fix empty partition prediction with ParallelPostFit #912

VibhuJawa commented Mar 25, 2022 •

edited

Loading

TomAugspurger commented Mar 25, 2022

VibhuJawa commented Mar 28, 2022

mmccarty commented Sep 20, 2022

betatim commented Sep 21, 2022

VibhuJawa commented Sep 21, 2022 •

edited

Loading

[WIP] Fix empty partition prediction with ParallelPostFit #912

Are you sure you want to change the base?

[WIP] Fix empty partition prediction with ParallelPostFit #912

Conversation

VibhuJawa commented Mar 25, 2022 • edited Loading

TomAugspurger commented Mar 25, 2022

VibhuJawa commented Mar 28, 2022

mmccarty commented Sep 20, 2022

betatim commented Sep 21, 2022

VibhuJawa commented Sep 21, 2022 • edited Loading

VibhuJawa commented Mar 25, 2022 •

edited

Loading

VibhuJawa commented Sep 21, 2022 •

edited

Loading