dealing with rows #202

kleinschmidt · 2020-12-18T16:30:59Z

The current major blocker for getting rid of TableStatisticalModel wrappers (#32) and the ModelFrame/ModelMatrix structs is that they also keep a mask for the rows that are actually included from the underlying table. It's necessary to keep these around if (possibly among other things) you want to predict back into the original table, since you need to know where the missing values were. So just keeping the FormulaTerm around isn't enough to have a completely stand-alone table-to-matrix transformation which can replace ModelFrame.

A related issue is that missing values in the output are now also introduced by terms themselves, because of the lead/lag support.

So my current thinking is that we ought to give up on trying to never generate missing values in the output (consistent with #153) and set rows to missing where any missing value is encountered in any column of the table, and provide some kind of functionality to compute the missings mask if necessary so that consumers can decide what to do. That leads to a situation where the consumer has to do a bit more fiddly book-keeping and it also means that the terms are not fully stand-alone but I'm not sure I see an alternative at this point (except for some kind of extremely lazy architecture where the terms actually hold views of the underlying table or something like that, I think @nalimilan were talking about this at ZiF...)

The text was updated successfully, but these errors were encountered:

palday · 2020-12-18T16:34:27Z

This would actually fix a number of current issues with handling missings, especially when expanding categorical variables into contrasts.

kleinschmidt · 2021-01-20T20:38:31Z

Just as a way of organizing thoughts about how to approach this. There are (at
least) two approaches to dealing with rows.

modlecols generates each column independently, possibly generating missings
when they're encountered in the underlying table. The resulting Arrays will
have a "swiss cheese" structure where there are "holes" (missings)
throughout, but other values in the same row might still be present. This is
maybe more straightforward since each term is handled independently, but it
could mess up stuff like schema extraction/contrast coding (if, e.g., there's
a categorical variable with a level that is only present in rows where
another variable is missing and hence may never appear in the generated data;
IIRC we have tests for exactly this case...).
modelcols attempts to avoid rows in the data where there is at least one
missing, setting the entire row to missing or omitting it altogether. This
avoids the problem of possibly pathological schemas but it also closes the
door on e.g. creating tools to do missing value imputation on the output. It
might also mess with uses of lead/lag where you want to lead/lag based on
the original rows, not the complete rows.

So it might be necessary to be able to switch between these modes somehow; maybe
adding a completecases flag on the schema and modelcols functions? Or at
the very least provide a boolean mask of complete cases (or a convenient way to
create one).

kleinschmidt added this to the 1.0 milestone Dec 18, 2020

kleinschmidt mentioned this issue Oct 5, 2022

lift missings in response, fitted, residuals? JuliaStats/MixedModels.jl#650

Open

nalimilan mentioned this issue Aug 24, 2023

Propagate missing values from data when returning vectors JuliaStats/GLM.jl#546

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dealing with rows #202

dealing with rows #202

kleinschmidt commented Dec 18, 2020

palday commented Dec 18, 2020

kleinschmidt commented Jan 20, 2021

dealing with rows #202

dealing with rows #202

Comments

kleinschmidt commented Dec 18, 2020

palday commented Dec 18, 2020

kleinschmidt commented Jan 20, 2021