Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dealing with rows #202

Open
kleinschmidt opened this issue Dec 18, 2020 · 2 comments
Open

dealing with rows #202

kleinschmidt opened this issue Dec 18, 2020 · 2 comments
Milestone

Comments

@kleinschmidt
Copy link
Member

The current major blocker for getting rid of TableStatisticalModel wrappers (#32) and the ModelFrame/ModelMatrix structs is that they also keep a mask for the rows that are actually included from the underlying table. It's necessary to keep these around if (possibly among other things) you want to predict back into the original table, since you need to know where the missing values were. So just keeping the FormulaTerm around isn't enough to have a completely stand-alone table-to-matrix transformation which can replace ModelFrame.

A related issue is that missing values in the output are now also introduced by terms themselves, because of the lead/lag support.

So my current thinking is that we ought to give up on trying to never generate missing values in the output (consistent with #153) and set rows to missing where any missing value is encountered in any column of the table, and provide some kind of functionality to compute the missings mask if necessary so that consumers can decide what to do. That leads to a situation where the consumer has to do a bit more fiddly book-keeping and it also means that the terms are not fully stand-alone but I'm not sure I see an alternative at this point (except for some kind of extremely lazy architecture where the terms actually hold views of the underlying table or something like that, I think @nalimilan were talking about this at ZiF...)

@kleinschmidt kleinschmidt added this to the 1.0 milestone Dec 18, 2020
@palday
Copy link
Member

palday commented Dec 18, 2020

This would actually fix a number of current issues with handling missings, especially when expanding categorical variables into contrasts.

@kleinschmidt
Copy link
Member Author

Just as a way of organizing thoughts about how to approach this. There are (at
least) two approaches to dealing with rows.

  1. modlecols generates each column independently, possibly generating missings
    when they're encountered in the underlying table. The resulting Arrays will
    have a "swiss cheese" structure where there are "holes" (missings)
    throughout, but other values in the same row might still be present. This is
    maybe more straightforward since each term is handled independently, but it
    could mess up stuff like schema extraction/contrast coding (if, e.g., there's
    a categorical variable with a level that is only present in rows where
    another variable is missing and hence may never appear in the generated data;
    IIRC we have tests for exactly this case...).
  2. modelcols attempts to avoid rows in the data where there is at least one
    missing, setting the entire row to missing or omitting it altogether. This
    avoids the problem of possibly pathological schemas but it also closes the
    door on e.g. creating tools to do missing value imputation on the output. It
    might also mess with uses of lead/lag where you want to lead/lag based on
    the original rows, not the complete rows.

So it might be necessary to be able to switch between these modes somehow; maybe
adding a completecases flag on the schema and modelcols functions? Or at
the very least provide a boolean mask of complete cases (or a convenient way to
create one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants