Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add min_impute and nan warning #423

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Add min_impute and nan warning #423

wants to merge 2 commits into from

Conversation

grromrell
Copy link

Fixing #421

Just adds some runtime warnings to the DataFrameImputer. I also added a potential change to the X.fillna that could be used to limit the amount of imputing done by using the limit keyword. I thought it was worth considering as an option. Also not married to the value for max_impute, thoughts there are helpful.

Copy link
Contributor

@Aylr Aylr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grromrell you've got some really nice work here and I left a few suggestions to tighten it up. Happy to discuss more. Welcome aboard! (feel free to hit me up on slack - find an invite on healthcare.ai)

@@ -20,11 +21,12 @@ class DataFrameImputer(TransformerMixin):
Columns of other types (assumed continuous) are imputed with mean of column.
"""

def __init__(self, impute=True, verbose=True):
def __init__(self, impute=True, verbose=True, max_impute=.5):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Great idea. Would you mind keeping the verbose arg last?
  • At this point I think that this method needs a docstring. We prefer google format, and if you're using an editor like pycharm it's trivial to get that formatting right. No need to go overboard, just note what the arguments mean.

for c in X:
pct_impute = X[c].isnull().sum() / len(X)
if pct_impute > self.max_impute:
warnings.warn("{0:.2f}% of data for column '{1}' is missing. Imputed "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might read a little better if you state the column name first, then the %missing. I totally love this, and I'm going to steal your idea of using warnings for a few other thigns!

RuntimeWarning)

#Alternative fill, only fill maximum number of values based on max_impute
#result = X.fillna(self.fill, limit=len(X)//(1/self.max_impute))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this will affect downstream algorithms. Do you have any sense for this? If not, I'd say create another issue and remove this commented code from this PR.

@@ -54,7 +56,17 @@ def transform(self, X, y=None):
# Return if not imputing
if self.impute is False:
return X


#Warn users if %nan is too high
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be thrilled if this were factored out to a separate function with tests (and I'm happy to guide you through that if you'd like).

@Aylr Aylr added this to the Sprint 35 milestone Nov 6, 2017
@grromrell
Copy link
Author

I added some docs, changed the way the warning prints and removed the alternative fill method. I did not separate the warn_nan into its own function as it seems like this is the only place it would get used and adding a unit test didn't seem worth it for a simple warning.

@levithatcher levithatcher modified the milestones: Sprint 35, Sprint 38 Jan 4, 2018
@levithatcher levithatcher self-requested a review January 4, 2018 19:03
@Aylr Aylr removed this from the Sprint 38 milestone Jan 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants