Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in handling of missing for functions in @formula call #366

Open
pdeffebach opened this issue Mar 31, 2020 · 12 comments
Open

Bug in handling of missing for functions in @formula call #366

pdeffebach opened this issue Mar 31, 2020 · 12 comments

Comments

@pdeffebach
Copy link
Contributor

julia> using DataFrames, GLM

julia> df = DataFrame(y = rand(100), x = rand(100));

julia> lm(@formula(y ~ x + lag(x)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

Constructing a lagged variable via df.lag_wage = lag(df.wage) and then running a regression works fine.

It seems like GLM (or StatsModels) has not caught up to the new ability to specify transformations in the @formula call.

@kleinschmidt
Copy link
Member

yes, that's because the missing is generated by modelcols, which happens after missings are dropped from the underlying table.

@pdeffebach
Copy link
Contributor Author

This would imply we need another check to drop missings, right?

@kleinschmidt
Copy link
Member

yes, and this is one of the things that's held up a revamp of the model fitting API in StatsModels.jl (e.g. why we still have ModelFrame/ModelMatrix).

one possiblity that's been discussed is to just not worry about dropping missings on the input side, and leave it up to the consumer to handle missings (e.g., allow StatsModels Terms to generate modelcols with missings). Then you'd just need one missing removal pass.

@kleinschmidt
Copy link
Member

The underlying problem is that some terms can introduce missings so you need to do at least one pass after generating the model cols. it seems kind of wasteful to generate a full matrix that's potentially Union{Missing,T} and then require consumers to convert that to a T. There's also the problem of keeping track of which rows of your input table the predictions/fitted values correspond to. I think that's the deeper issue. Right now that's handled by ModelFrame (there's a mask vector that tells you which rows of the input table made it into the model matrix).

@pdeffebach
Copy link
Contributor Author

one possiblity that's been discussed is to just not worry about dropping missings on the input side, and leave it up to the consumer to handle missings (e.g., allow StatsModels Terms to generate modelcols with missings).

I'm generally empathetic to having the user (or package like GLM) handle missings on their own. But StatsModels dropping missings for everyone that uses it seems like a first-best outcome. There is just too much missing data out there and having each package handle it differently would be really hard. Centralization is good.

@pdeffebach
Copy link
Contributor Author

Is there any update on this? How this been fixed upstream?

@kleinschmidt
Copy link
Member

I'm afraid not. It's related to JuliaStats/StatsModels.jl#153 and neither I nor @palday have had much extra bandwidth for that recently.

@kleinschmidt
Copy link
Member

But if you wanted to make a PR to StatsModels with a proposal for how and where to handle the dropping of missing values created by modelcols that would be a huge help and a good start!

@pdeffebach
Copy link
Contributor Author

@kleinschmidt I ran into this again today.

Are some of the internal fixes to StatsModels going to be able to resolve this? I know there is work to allow more customized handling of this stuff.

@palday
Copy link
Member

palday commented Nov 18, 2021

A temporary kludge would be dropping missings from the resultant X in the lm(::FormulaTerm,...) methods before LinearModel(X, ....) is called. That's something that should be a straightforward PR against GLM.jl.

@pdeffebach
Copy link
Contributor Author

Yeah, that could work. But then we have to allocate that matrix twice. Would be nice to handle this upstream.

@i-thot-u-wanted2dance
Copy link

Is there just no plan to fix this? The inability to use lag terms in formulas is extremely damaging to the usefulness of this package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants