Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dropping/raising on/ignoring NA data. #6

Merged
merged 1 commit into from
Jul 12, 2020

Conversation

matthewwardrop
Copy link
Owner

@matthewwardrop matthewwardrop commented Jul 12, 2020

This patch adds support for dealing with NA data in the dataset, adding some marginal overhead to performance when enabled. To use it, pass na_action to the FormulaMaterializer subclass of interest, or via the higher-level convenience wrappers. For example:

from formulaic import Formula
df = ...
Formula('a + b').get_model_matrix(df, na_action="drop")

Valid options for na_action are: "drop" (default), "raise" or "ignore". The R names for these are "omit", "error" and "pass", which I am tossing up using instead These names will be kept. Current choices are because "drop" is more in line with pandas, "raise" is more Pythonic, and "ignore" was what I originally thought about. [Patsy also appears to use "drop" and "raise"; but doesn't seem to support "ignore"?]

Null checks are performed after evaluation of columns, so custom python functions can be used to impute nulls local to a column (soon I will add an impute stateful transformation that makes this more natural).

Not included in this patch, but a related effort, is support for performing stateful transformations on the entire dataset before performing the usual model matrix materialisation. This will allow for more sophisticated imputations, such as those involving PCA methods, which require access to the entire original dataset. The resulting state can then be reused in the generation of new model matrices, which is why this approach is a value-add over just expecting users to do the imputation upstream.

@codecov
Copy link

codecov bot commented Jul 12, 2020

Codecov Report

Merging #6 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master        #6   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           27        28    +1     
  Lines         1064      1093   +29     
=========================================
+ Hits          1064      1093   +29     
Flag Coverage Δ
#unittests 100.00% <100.00%> (ø)
Impacted Files Coverage Δ
formulaic/materializers/arrow.py 100.00% <100.00%> (ø)
formulaic/materializers/base.py 100.00% <100.00%> (ø)
formulaic/materializers/pandas.py 100.00% <100.00%> (ø)
formulaic/materializers/types/enums.py 100.00% <100.00%> (ø)
formulaic/model_spec.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba2bac8...fbe2b88. Read the comment docs.

@matthewwardrop
Copy link
Owner Author

@bashtage I believe this covers most of the issues raised in #3, with the addition of a data preparation stateful transform hook for more involved imputations on its way in a separate body of work. I'd love to get your take on this!

@bashtage
Copy link
Contributor

Great. I'll give it a spin this week.

@bashtage
Copy link
Contributor

I had some thoughts about impute. What are the inputs to a function that could be used with impute?

The example you previously discussed would work with a function that accepted a column argument as in y ~ impute(x, mean). I could imagine one may wish to impute using other information, for example, the other data in the row, or both row and column information (e.g., some method of Spatial imputation or in panel data settings).

@matthewwardrop matthewwardrop self-assigned this Jul 12, 2020
@matthewwardrop matthewwardrop added the enhancement New feature or request label Jul 12, 2020
@matthewwardrop
Copy link
Owner Author

matthewwardrop commented Jul 12, 2020

@bashtage I was imagining the prepare hook would allow for this spatial imputation. Formulaic does support functions returning "exploded" data, so, for example, a function myfunc could return a dictionary of {'col1': ..., 'col2': ..., ...}, and these will appear in the resulting model matrix as myfunc(...)[col1], myfunc(...)[col2], ... . If I added to the context a variable representing the entire data structure then this could be used to do the kind of imputation you are talking about; but I wonder if it is not simpler and cleaner to add the stateful data preparation hook as I suggested?

@matthewwardrop matthewwardrop merged commit 88b985a into master Jul 12, 2020
@matthewwardrop matthewwardrop deleted the add_na_action branch July 12, 2020 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants