Add min_impute and nan warning #423

grromrell · 2017-11-02T22:01:48Z

Fixing #421

Just adds some runtime warnings to the DataFrameImputer. I also added a potential change to the X.fillna that could be used to limit the amount of imputing done by using the limit keyword. I thought it was worth considering as an option. Also not married to the value for max_impute, thoughts there are helpful.

Aylr

@grromrell you've got some really nice work here and I left a few suggestions to tighten it up. Happy to discuss more. Welcome aboard! (feel free to hit me up on slack - find an invite on healthcare.ai)

Aylr · 2017-11-03T21:11:07Z

healthcareai/common/transformers.py

@@ -20,11 +21,12 @@ class DataFrameImputer(TransformerMixin):
 Columns of other types (assumed continuous) are imputed with mean of column.
 """

- def __init__(self, impute=True, verbose=True):
+ def __init__(self, impute=True, verbose=True, max_impute=.5):


Great idea. Would you mind keeping the verbose arg last?

At this point I think that this method needs a docstring. We prefer google format, and if you're using an editor like pycharm it's trivial to get that formatting right. No need to go overboard, just note what the arguments mean.

Aylr · 2017-11-03T21:14:26Z

healthcareai/common/transformers.py

+ for c in X:
+ pct_impute = X[c].isnull().sum() / len(X)
+ if pct_impute > self.max_impute:
+ warnings.warn("{0:.2f}% of data for column '{1}' is missing. Imputed "


This might read a little better if you state the column name first, then the %missing. I totally love this, and I'm going to steal your idea of using warnings for a few other thigns!

Aylr · 2017-11-03T21:15:16Z

healthcareai/common/transformers.py

+ RuntimeWarning)
+
+ #Alternative fill, only fill maximum number of values based on max_impute
+ #result = X.fillna(self.fill, limit=len(X)//(1/self.max_impute))


I'm not sure how this will affect downstream algorithms. Do you have any sense for this? If not, I'd say create another issue and remove this commented code from this PR.

Aylr · 2017-11-03T21:15:55Z

healthcareai/common/transformers.py

@@ -54,7 +56,17 @@ def transform(self, X, y=None):
 # Return if not imputing
 if self.impute is False:
 return X
-
+
+ #Warn users if %nan is too high


I'd be thrilled if this were factored out to a separate function with tests (and I'm happy to guide you through that if you'd like).

…ethod

grromrell · 2017-11-21T17:16:48Z

I added some docs, changed the way the warning prints and removed the alternative fill method. I did not separate the warn_nan into its own function as it seems like this is the only place it would get used and adding a unit test didn't seem worth it for a simple warning.

Add min_impute and nan warning

ed0fdb6

Aylr suggested changes Nov 3, 2017

View reviewed changes

Aylr added this to the Sprint 35 milestone Nov 6, 2017

Add documentation, reword documentation and remove alternative fill m…

ee92de0

…ethod

Aylr mentioned this pull request Dec 11, 2017

Warn user about imputing more than n% of rows #421

Open

levithatcher modified the milestones: Sprint 35, Sprint 38 Jan 4, 2018

levithatcher self-requested a review January 4, 2018 19:03

Aylr removed this from the Sprint 38 milestone Jan 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add min_impute and nan warning #423

Add min_impute and nan warning #423

grromrell commented Nov 2, 2017

Aylr left a comment

Aylr Nov 3, 2017

Aylr Nov 3, 2017

Aylr Nov 3, 2017

Aylr Nov 3, 2017

grromrell commented Nov 21, 2017

Add min_impute and nan warning #423

Are you sure you want to change the base?

Add min_impute and nan warning #423

Conversation

grromrell commented Nov 2, 2017

Aylr left a comment

Choose a reason for hiding this comment

Aylr Nov 3, 2017

Choose a reason for hiding this comment

Aylr Nov 3, 2017

Choose a reason for hiding this comment

Aylr Nov 3, 2017

Choose a reason for hiding this comment

Aylr Nov 3, 2017

Choose a reason for hiding this comment

grromrell commented Nov 21, 2017