Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow missings in the modelmatrix, necessary to use lead/lag in @formula #109

Merged
merged 14 commits into from
Feb 26, 2021

Conversation

greimel
Copy link
Contributor

@greimel greimel commented Jul 10, 2020

With lead/lag in a @formula the model matrix contains missings. I adjusted the code so that the output of modelmatrix is checked for missings and and observations with missings are removed.

MWE (can be added as test)

using DataFrames, FixedEffectModels

df = DataFrame(y = randn(10), x = randn(10))
f0 = @formula(y ~ x + lag(x, 1))
reg(df, f0) # errors without this PR

This becomes more useful when defining groupedlag for panel data, as I did here: JuliaStats/StatsModels.jl#182

This is an unsolved issue also for GLM.jl: JuliaStats/GLM.jl#366

@matthieugomez
Copy link
Member

Arg, thanks for spotting the issue. I remember raising this issue a while ago (https://discourse.julialang.org/t/psa-breaking-changes-in-statsmodels-v0-6-0-terms-2-0-son-of-terms/25626/4?u=matthieu).
The one thing I do not like with your solution is that it requires to create temporary matrices, which can be very wasteful for big datasets. I think I would prefer to wait until there is a better way to handle this in StatsModels.

@matthieugomez
Copy link
Member

In the meantime, it would be awesome if you could do a pull request that returns a better error explaining the problem.

@greimel
Copy link
Contributor Author

greimel commented Jul 13, 2020

EDIT: I see now, that my point was wrong. One can first filter by complete cases of weights and cluster and pass a view to apply_schema.

@greimel
Copy link
Contributor Author

greimel commented Jul 13, 2020

Wouldn't you think that this is a better temporary solution than a better error message?

It would be nice to be able to use lags with your package.

@matthieugomez
Copy link
Member

I don't think it is worth it if it makes all regressions that do not use lag slower (or unfeasible). Moreover, lag does not even work with timed variable, or within groups (for now), so I'm not sure it's that useful for the kind of things people do with the package.

@greimel
Copy link
Contributor Author

greimel commented Jul 13, 2020

I added a commit so that regressions without lags should not be affected.

It's true that the lead and lag functions in StatsModels are not very powerful yet. I extended the lag function for groups in JuliaStats/StatsModels.jl#182, where it was suggested to put it in a separate package for now: https://github.com/greimel/GroupedTemporalTerms.jl

Does this address all of your concerns, or are there more?

src/fit.jl Outdated
esample2 = trues(size(y))

if size(Xexo, 2) > 0
esample2 .&= completecases(DataFrame(Xexo))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to do

for c in eachcol(X)
          esample2 .&= .!ismissing.(c)
 end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in new commit

src/fit.jl Outdated
all(isfinite, Xexo) || throw("Some observations for the exogeneous variables are infinite")

Xexo = modelmatrix(formula_schema, subdf)
esample2 = trues(size(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment to the issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in new commit, please let me know if the comments are overkill now

@matthieugomez
Copy link
Member

If you want it merged, could you answer the comments and add tests?

@greimel greimel marked this pull request as draft July 15, 2020 10:53
@greimel
Copy link
Contributor Author

greimel commented Jul 15, 2020

This version passes a test comparing hand-lagged to auto-lagged regression (with fixed effects, but no weights, no clustering, no iv) on private data.

I will push a comprehensive test soon.

Would you accept a test dependency on https://github.com/greimel/GroupedTemporalTerms, or should I use non-grouped lags?

@greimel
Copy link
Contributor Author

greimel commented Jul 15, 2020

I added tests, tests pass locally

This is ready for review. If desired I can force push three logical commits. I needed this for debugging.

@greimel greimel marked this pull request as ready for review July 15, 2020 13:21
@greimel
Copy link
Contributor Author

greimel commented Jul 15, 2020

I realized that I missed one case. The case where the df contains missings and the formula produces missings cannot be handled yet. This because the missings in the df are removed before the the lag is applied.

I made the reg function throw an ArgumentError in that case.

@greimel
Copy link
Contributor Author

greimel commented Jul 20, 2020

case where the df contains missings and the formula produces missings cannot be handled yet

I actually need this case for my research. So I am gonna try to handle this case as well. You might want to wait with merging this until then if you wish.

I am about to push some commits that make the analogous changes to the partial_out.

@eloualiche eloualiche merged commit d7514eb into FixedEffects:master Feb 26, 2021
@greimel greimel deleted the missings branch February 26, 2021 09:00
@greimel
Copy link
Contributor Author

greimel commented Feb 26, 2021

Thanks!

@greimel greimel restored the missings branch March 4, 2021 09:55
eloualiche pushed a commit that referenced this pull request Mar 11, 2021
@eloualiche
Copy link
Member

Reverted this for the time being out of concern with memory allocation. Sorry for prematurely approving the PR.

greimel added a commit to greimel/FixedEffectModels.jl that referenced this pull request May 12, 2021
greimel pushed a commit to greimel/FixedEffectModels.jl that referenced this pull request May 12, 2021
greimel added a commit to greimel/FixedEffectModels.jl that referenced this pull request May 12, 2021
greimel added a commit to greimel/FixedEffectModels.jl that referenced this pull request Jun 29, 2022
… `lead`/`lag` in `@formula` (FixedEffects#109)""

This reverts commit f68e456.
merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants