Add support for dropping/raising on/ignoring NA data. #6

matthewwardrop · 2020-07-12T05:08:06Z

This patch adds support for dealing with NA data in the dataset, adding some marginal overhead to performance when enabled. To use it, pass na_action to the FormulaMaterializer subclass of interest, or via the higher-level convenience wrappers. For example:

from formulaic import Formula
df = ...
Formula('a + b').get_model_matrix(df, na_action="drop")

Valid options for na_action are: "drop" (default), "raise" or "ignore". The R names for these are "omit", "error" and "pass", ~~which I am tossing up using instead~~ These names will be kept. Current choices are because "drop" is more in line with pandas, "raise" is more Pythonic, and "ignore" was what I originally thought about. [Patsy also appears to use "drop" and "raise"; but doesn't seem to support "ignore"?]

Null checks are performed after evaluation of columns, so custom python functions can be used to impute nulls local to a column (soon I will add an impute stateful transformation that makes this more natural).

Not included in this patch, but a related effort, is support for performing stateful transformations on the entire dataset before performing the usual model matrix materialisation. This will allow for more sophisticated imputations, such as those involving PCA methods, which require access to the entire original dataset. The resulting state can then be reused in the generation of new model matrices, which is why this approach is a value-add over just expecting users to do the imputation upstream.

codecov · 2020-07-12T05:11:14Z

Codecov Report

Merging #6 into master will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master        #6   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           27        28    +1     
  Lines         1064      1093   +29     
=========================================
+ Hits          1064      1093   +29

Flag	Coverage Δ
#unittests	`100.00% <100.00%> (ø)`

Impacted Files	Coverage Δ
formulaic/materializers/arrow.py	`100.00% <100.00%> (ø)`
formulaic/materializers/base.py	`100.00% <100.00%> (ø)`
formulaic/materializers/pandas.py	`100.00% <100.00%> (ø)`
formulaic/materializers/types/enums.py	`100.00% <100.00%> (ø)`
formulaic/model_spec.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba2bac8...fbe2b88. Read the comment docs.

matthewwardrop · 2020-07-12T05:12:38Z

@bashtage I believe this covers most of the issues raised in #3, with the addition of a data preparation stateful transform hook for more involved imputations on its way in a separate body of work. I'd love to get your take on this!

bashtage · 2020-07-12T06:54:17Z

Great. I'll give it a spin this week.

bashtage · 2020-07-12T06:59:34Z

I had some thoughts about impute. What are the inputs to a function that could be used with impute?

The example you previously discussed would work with a function that accepted a column argument as in y ~ impute(x, mean). I could imagine one may wish to impute using other information, for example, the other data in the row, or both row and column information (e.g., some method of Spatial imputation or in panel data settings).

matthewwardrop · 2020-07-12T20:38:09Z

@bashtage I was imagining the prepare hook would allow for this spatial imputation. Formulaic does support functions returning "exploded" data, so, for example, a function myfunc could return a dictionary of {'col1': ..., 'col2': ..., ...}, and these will appear in the resulting model matrix as myfunc(...)[col1], myfunc(...)[col2], ... . If I added to the context a variable representing the entire data structure then this could be used to do the kind of imputation you are talking about; but I wonder if it is not simpler and cleaner to add the stateful data preparation hook as I suggested?

Add support for dropping/raising on/ignoring NA data.

fbe2b88

matthewwardrop force-pushed the add_na_action branch from 25d6608 to fbe2b88 Compare July 12, 2020 05:09

matthewwardrop self-assigned this Jul 12, 2020

matthewwardrop added the enhancement New feature or request label Jul 12, 2020

matthewwardrop merged commit 88b985a into master Jul 12, 2020

matthewwardrop deleted the add_na_action branch July 12, 2020 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dropping/raising on/ignoring NA data. #6

Add support for dropping/raising on/ignoring NA data. #6

matthewwardrop commented Jul 12, 2020 •

edited

Loading

codecov bot commented Jul 12, 2020 •

edited

Loading

matthewwardrop commented Jul 12, 2020

bashtage commented Jul 12, 2020

bashtage commented Jul 12, 2020

matthewwardrop commented Jul 12, 2020 •

edited

Loading

Add support for dropping/raising on/ignoring NA data. #6

Add support for dropping/raising on/ignoring NA data. #6

Conversation

matthewwardrop commented Jul 12, 2020 • edited Loading

codecov bot commented Jul 12, 2020 • edited Loading

Codecov Report

matthewwardrop commented Jul 12, 2020

bashtage commented Jul 12, 2020

bashtage commented Jul 12, 2020

matthewwardrop commented Jul 12, 2020 • edited Loading

matthewwardrop commented Jul 12, 2020 •

edited

Loading

codecov bot commented Jul 12, 2020 •

edited

Loading

matthewwardrop commented Jul 12, 2020 •

edited

Loading