[Fea] Data imputation limited by null conversion #2966

wphicks · 2020-10-13T15:40:32Z

Is your feature request related to a problem? Please describe.
In sklearn, a fairly common data imputation workflow might look something like this

import numpy as np
import pandas
from sklearn.impute import SimpleImputer

df = pandas.DataFrame(data=[[7, 2, 3], [4, None, 6], [10, 5, 9]])
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(df)

The Rapids equivalent would looks something like:

import cupy as cp
import cudf
from cuml.experimental.preprocessing import SimpleImputer

df = cudf.DataFrame(data=[[7, 2, 3], [4, None, 6], [10, 5, 9]])
imp = SimpleImputer(missing_values=cp.nan, strategy='mean')
imp.fit_transform(df)

Under the hood, we try to convert the cudf DataFrame to a cupy array, which fails because of null values in the DataFrame. This severely limits the usefulness of our data imputation methods.

Describe the solution you'd like
We can fix this either in cuml through special handling of DataFrame input or in cudf by providing some infrastructure for dealing with null values when we convert to cupy, though that may also require cupy changes (possibly related: cudf/5754).

Describe alternatives you've considered
For floating point data, we can use fillna(cp.nan) before running data imputation. For integers, we would have to either know of an integer value which cannot appear in the data or generate one.

The text was updated successfully, but these errors were encountered:

wphicks · 2020-10-13T15:41:12Z

@viclafargue: Added this issue to follow up from our external discussion.

viclafargue · 2020-12-01T13:03:11Z

Solved with #3194

wphicks added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 13, 2020

viclafargue added bug Something isn't working Cython / Python Cython or Python issue and removed ? - Needs Triage Need team to review and classify labels Oct 13, 2020

viclafargue self-assigned this Oct 13, 2020

viclafargue mentioned this issue Nov 26, 2020

[REVIEW] Fix cuDF to cuPy conversion (missing value) #3194

Merged

viclafargue closed this as completed Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fea] Data imputation limited by null conversion #2966

[Fea] Data imputation limited by null conversion #2966

wphicks commented Oct 13, 2020

wphicks commented Oct 13, 2020

viclafargue commented Dec 1, 2020

[Fea] Data imputation limited by null conversion #2966

[Fea] Data imputation limited by null conversion #2966

Comments

wphicks commented Oct 13, 2020

wphicks commented Oct 13, 2020

viclafargue commented Dec 1, 2020