Add support for dropping/raising on/ignoring NA data. #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch adds support for dealing with NA data in the dataset, adding some marginal overhead to performance when enabled. To use it, pass
na_action
to theFormulaMaterializer
subclass of interest, or via the higher-level convenience wrappers. For example:Valid options for
na_action
are: "drop" (default), "raise" or "ignore". The R names for these are "omit", "error" and "pass",which I am tossing up using insteadThese names will be kept. Current choices are because "drop" is more in line with pandas, "raise" is more Pythonic, and "ignore" was what I originally thought about. [Patsy also appears to use "drop" and "raise"; but doesn't seem to support "ignore"?]Null checks are performed after evaluation of columns, so custom python functions can be used to impute nulls local to a column (soon I will add an
impute
stateful transformation that makes this more natural).Not included in this patch, but a related effort, is support for performing stateful transformations on the entire dataset before performing the usual model matrix materialisation. This will allow for more sophisticated imputations, such as those involving PCA methods, which require access to the entire original dataset. The resulting state can then be reused in the generation of new model matrices, which is why this approach is a value-add over just expecting users to do the imputation upstream.