Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference using fixest's aggregate command #33

Open
jtorcasso opened this issue Apr 25, 2023 · 7 comments
Open

Inference using fixest's aggregate command #33

jtorcasso opened this issue Apr 25, 2023 · 7 comments

Comments

@jtorcasso
Copy link

Fixest's aggregate command may be an efficient alternative for inference in cases when marginaleffects takes a long time. Consider the MWE below:

library(etwfe)

data("mpdta", package="did")

mod = etwfe(
    fml = lemp ~ lpop,
    tvar  = year,
    gvar = first.treat,
    data = mpdta,
    vcov = ~countyreal
)

emfx(mod)

aggregate(mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))

emfx returns an estimate of -0.05062703 with std. error of 0.01249979, and aggregate returns the same exact values in this case.

@grantmcdermott
Copy link
Owner

This is great, thanks Jake.

The problem with aggregate is that it doesn't necessarily work well for non-linear families.

library(etwfe)

data("mpdta", package="did")
mpdta$emp = exp(mpdta$lemp)

pmod = etwfe(
    fml = emp ~ lpop,
    tvar  = year,
    gvar = first.treat,
    data = mpdta,
    vcov = ~countyreal,
    family = "poisson"
)
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and eight others have been removed because of collinearity (see $collin.var).

emfx(pmod)
#> 
#>     Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)
#>  .Dtreat mean(TRUE) - mean(FALSE)    TRUE    -28.6       18.4 -1.55     0.12
#>  2.5 % 97.5 %
#>  -64.6   7.49
#> 
#> Columns: term, contrast, .Dtreat, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

aggregate(pmod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))
#>        Estimate Std. Error   t value     Pr(>|t|)
#> ATT -0.05247523 0.01206096 -4.350834 1.356206e-05

Created on 2023-04-26 with reprex v2.0.2

But it's definitely an attractive option for big linear models. Let me think about some internal ode logic.

@s3alfisc
Copy link

s3alfisc commented Jun 26, 2023

Hi both - one other big advantage of moving to fixest::aggregate() is that it will make it very straightforward for me to implement wild cluster bootstrap support, as I basically copied it into boot_aggregate to support fixest::sunab() =)

@s3alfisc
Copy link

s3alfisc commented Jul 8, 2023

Hi @jtorcasso, here's a first attempt of wild cluster bootstrap and etwfe interoperatibility via fwildclusterboot::boot_aggregate(), using @kylebutts fixest clone that introduces the sparse_model_matrix method. It's very buggy, so using other arguments of boot_aggregate than used in the example / non-fixest objects might fail.

library(devtools)
install_github("https://github.com/s3alfisc/fwildclusterboot/tree/etwfe-support")
# this should install kyle's fork of fixest, if not, do it manually
#install_github("https://github.com/kylebutts/fixest/tree/sparse-matrix")

library(etwfe)
library(fwildclusterboot)
data("mpdta", package="did")

mod = etwfe(
  fml = lemp ~ lpop,
  tvar  = year,
  gvar = first.treat,
  data = mpdta,
  vcov = ~countyreal, 
  ssc = fixest::ssc(adj = FALSE, cluster.adj = FALSE)
)

emfx(mod)
# Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)    S  2.5 %
# .Dtreat mean(TRUE) - mean(FALSE)    TRUE  -0.0506     0.0124 -4.08   <0.001 14.4 -0.075
# 97.5 %
# -0.0263


aggregate(
  mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$")
)
# Estimate Std. Error   t value     Pr(>|t|)
# ATT -0.05062703  0.0124121 -4.078845 5.267553e-05


boot_aggregate(
  mod,
  B = 999, 
  agg = c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"), 
  clustid = ~countyreal, 
  ssc = boot_ssc(adj = FALSE, cluster.adj = FALSE)
)
# Run the wild bootstrap: this might take some time...(but 
# hopefully not too much time =) ). 
# |======================================================| 100%     
#        Estimate   t value    Pr(>|t|)     [0.025%     0.975%]
# [1,] -0.05062703 -4.078845 0.001001001 -0.07420141 -0.02634166
# Warning message:
#   Matrix inversion failure: Using a generalized inverse instead. Check the produced
# t-statistic, does it match the one of your regression package (under the same
# small sample correction)? If yes, this is likely not something to worry about.

Note that the non-bootstrapped t-statistics from fwildclusterboot and fixest match exactly, as it should be =) Also, note that using solve() fails, but using a generalized inverse seems to get the job done.

Best, Alex

@poetlarsen
Copy link

Hi, I am following up on this thread to see if there were any updates or recommendations to aggregate estimates for event studies more efficiently. I'm working with several datasets, one of which is nearly 50 million observations and 35 periods, and the current marginaleffects setup with emfx takes so long its impossible to generate meaningful summary results. (This is driven by the SE estimation). I need SEs for plots, so turning off vcov isn't really an option. Any help or suggestions would be greatly appreciated!

@kylebutts
Copy link

@poetlarsen Bootstrap is your friend on this one with emfx(mod, vcov = FALSE)

@s3alfisc
Copy link

Or just using "aggregate" as described above should also work, unless you run a Poisson regression?

@grantmcdermott
Copy link
Owner

grantmcdermott commented Feb 26, 2024

+1 to @kylebutts and @s3alfisc's comments.

OTOH—and apart from the non-linear family issue—the other tricky thing with the aggregate.fixest approach is that there isn't a straightforward way to recover an event study path using the function's regex logic. The issue being that regex doesn't understand ordering in a way that allows it to easily group relative timing among the relevant cohorts. At least, not with the the standard etwfe specification, since we don't specify relative timing up front as part of the model formula. (If anyone has an idea of how to get around this, I'm all ears!)

Summarising, I think we can use aggregate.fixest to quickly recover the "simple" ATT, as well as the "group" and "calendar" effects. But not the "event" type. Here's some sample code that generalizes @jtorcasso's original example that should work on any linear model, regardless of the input data and variable specifics.

## estimate your model
mod = etwfe(...)

## Simple ATT

# emfx(mod) ## for comparison
aggregate(mod, c("ATT" = "^\\.Dtreat(?:(?!_dm$).)*$"))

# For more complex ATTs we'll need the input gvar and tvar

gvar = attr(mod, "etwfe")[["gvar"]]
tvar = attr(mod, "etwfe")[["tvar"]]

## Group ATTs

# emfx(mod, "group") ## for comparison
aggregate(
  mod,
  paste0("(", gvar, "::[[:digit:]]+)(?:(?!_dm$).)*$")
)

## Calendar ATTs

# emfx(mod, "calendar") ## for comparison
aggregate(
  mod,
  paste0("(", tvar, "::[[:digit:]]+)((?:(?!_dm$).)*$)") 
)


## Event ATTs (??)

# emfx(mod, "event")) ## For comparison
## ??
# aggregate(
#   mod,
#   paste0("(", gvar, ".*", tvar, ")((?:(?!_dm$).)*$)")
# )

Considering all of this, @poetlarsen it might be worth exploring whether another of the "modern" estimators isn't better suited to your use case. The fastest option for that many observations is almost certainly going to be sunab, although I suspect that did2s will give a good account itself too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants