Inference using fixest's aggregate command #33

jtorcasso · 2023-04-25T22:38:13Z

Fixest's aggregate command may be an efficient alternative for inference in cases when marginaleffects takes a long time. Consider the MWE below:

library(etwfe)

data("mpdta", package="did")

mod = etwfe(
    fml = lemp ~ lpop,
    tvar  = year,
    gvar = first.treat,
    data = mpdta,
    vcov = ~countyreal
)

emfx(mod)

aggregate(mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))

emfx returns an estimate of -0.05062703 with std. error of 0.01249979, and aggregate returns the same exact values in this case.

The text was updated successfully, but these errors were encountered:

grantmcdermott · 2023-04-26T15:43:32Z

This is great, thanks Jake.

The problem with aggregate is that it doesn't necessarily work well for non-linear families.

library(etwfe)

data("mpdta", package="did")
mpdta$emp = exp(mpdta$lemp)

pmod = etwfe(
    fml = emp ~ lpop,
    tvar  = year,
    gvar = first.treat,
    data = mpdta,
    vcov = ~countyreal,
    family = "poisson"
)
#> The variables '.Dtreat:first.treat::2006:year::2004', '.Dtreat:first.treat::2006:year::2005' and eight others have been removed because of collinearity (see $collin.var).

emfx(pmod)
#> 
#>     Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)
#>  .Dtreat mean(TRUE) - mean(FALSE)    TRUE    -28.6       18.4 -1.55     0.12
#>  2.5 % 97.5 %
#>  -64.6   7.49
#> 
#> Columns: term, contrast, .Dtreat, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo

aggregate(pmod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"))
#>        Estimate Std. Error   t value     Pr(>|t|)
#> ATT -0.05247523 0.01206096 -4.350834 1.356206e-05

^{Created on 2023-04-26 with reprex v2.0.2}

But it's definitely an attractive option for big linear models. Let me think about some internal ode logic.

s3alfisc · 2023-06-26T22:15:45Z

Hi both - one other big advantage of moving to fixest::aggregate() is that it will make it very straightforward for me to implement wild cluster bootstrap support, as I basically copied it into boot_aggregate to support fixest::sunab() =)

s3alfisc · 2023-07-08T17:52:04Z

Hi @jtorcasso, here's a first attempt of wild cluster bootstrap and etwfe interoperatibility via fwildclusterboot::boot_aggregate(), using @kylebutts fixest clone that introduces the sparse_model_matrix method. It's very buggy, so using other arguments of boot_aggregate than used in the example / non-fixest objects might fail.

library(devtools)
install_github("https://github.com/s3alfisc/fwildclusterboot/tree/etwfe-support")
# this should install kyle's fork of fixest, if not, do it manually
#install_github("https://github.com/kylebutts/fixest/tree/sparse-matrix")

library(etwfe)
library(fwildclusterboot)
data("mpdta", package="did")

mod = etwfe(
  fml = lemp ~ lpop,
  tvar  = year,
  gvar = first.treat,
  data = mpdta,
  vcov = ~countyreal, 
  ssc = fixest::ssc(adj = FALSE, cluster.adj = FALSE)
)

emfx(mod)
# Term                 Contrast .Dtreat Estimate Std. Error     z Pr(>|z|)    S  2.5 %
# .Dtreat mean(TRUE) - mean(FALSE)    TRUE  -0.0506     0.0124 -4.08   <0.001 14.4 -0.075
# 97.5 %
# -0.0263


aggregate(
  mod, c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$")
)
# Estimate Std. Error   t value     Pr(>|t|)
# ATT -0.05062703  0.0124121 -4.078845 5.267553e-05


boot_aggregate(
  mod,
  B = 999, 
  agg = c("ATT"="^.Dtreat:first.treat::[0-9]{4}:year::[0-9]{4}$"), 
  clustid = ~countyreal, 
  ssc = boot_ssc(adj = FALSE, cluster.adj = FALSE)
)
# Run the wild bootstrap: this might take some time...(but 
# hopefully not too much time =) ). 
# |======================================================| 100%     
#        Estimate   t value    Pr(>|t|)     [0.025%     0.975%]
# [1,] -0.05062703 -4.078845 0.001001001 -0.07420141 -0.02634166
# Warning message:
#   Matrix inversion failure: Using a generalized inverse instead. Check the produced
# t-statistic, does it match the one of your regression package (under the same
# small sample correction)? If yes, this is likely not something to worry about.

Note that the non-bootstrapped t-statistics from fwildclusterboot and fixest match exactly, as it should be =) Also, note that using solve() fails, but using a generalized inverse seems to get the job done.

Best, Alex

poetlarsen · 2024-02-23T02:29:09Z

Hi, I am following up on this thread to see if there were any updates or recommendations to aggregate estimates for event studies more efficiently. I'm working with several datasets, one of which is nearly 50 million observations and 35 periods, and the current marginaleffects setup with emfx takes so long its impossible to generate meaningful summary results. (This is driven by the SE estimation). I need SEs for plots, so turning off vcov isn't really an option. Any help or suggestions would be greatly appreciated!

kylebutts · 2024-02-23T14:09:53Z

@poetlarsen Bootstrap is your friend on this one with emfx(mod, vcov = FALSE)

s3alfisc · 2024-02-23T19:13:57Z

Or just using "aggregate" as described above should also work, unless you run a Poisson regression?

grantmcdermott · 2024-02-26T18:32:54Z

+1 to @kylebutts and @s3alfisc's comments.

OTOH—and apart from the non-linear family issue—the other tricky thing with the aggregate.fixest approach is that there isn't a straightforward way to recover an event study path using the function's regex logic. The issue being that regex doesn't understand ordering in a way that allows it to easily group relative timing among the relevant cohorts. At least, not with the the standard etwfe specification, since we don't specify relative timing up front as part of the model formula. (If anyone has an idea of how to get around this, I'm all ears!)

Summarising, I think we can use aggregate.fixest to quickly recover the "simple" ATT, as well as the "group" and "calendar" effects. But not the "event" type. Here's some sample code that generalizes @jtorcasso's original example that should work on any linear model, regardless of the input data and variable specifics.

## estimate your model
mod = etwfe(...)

## Simple ATT

# emfx(mod) ## for comparison
aggregate(mod, c("ATT" = "^\\.Dtreat(?:(?!_dm$).)*$"))

# For more complex ATTs we'll need the input gvar and tvar

gvar = attr(mod, "etwfe")[["gvar"]]
tvar = attr(mod, "etwfe")[["tvar"]]

## Group ATTs

# emfx(mod, "group") ## for comparison
aggregate(
  mod,
  paste0("(", gvar, "::[[:digit:]]+)(?:(?!_dm$).)*$")
)

## Calendar ATTs

# emfx(mod, "calendar") ## for comparison
aggregate(
  mod,
  paste0("(", tvar, "::[[:digit:]]+)((?:(?!_dm$).)*$)") 
)


## Event ATTs (??)

# emfx(mod, "event")) ## For comparison
## ??
# aggregate(
#   mod,
#   paste0("(", gvar, ".*", tvar, ")((?:(?!_dm$).)*$)")
# )

Considering all of this, @poetlarsen it might be worth exploring whether another of the "modern" estimators isn't better suited to your use case. The fastest option for that many observations is almost certainly going to be sunab, although I suspect that did2s will give a good account itself too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference using fixest's aggregate command #33

Inference using fixest's aggregate command #33

jtorcasso commented Apr 25, 2023

grantmcdermott commented Apr 26, 2023

s3alfisc commented Jun 26, 2023 •

edited

Loading

s3alfisc commented Jul 8, 2023 •

edited

Loading

poetlarsen commented Feb 23, 2024

kylebutts commented Feb 23, 2024

s3alfisc commented Feb 23, 2024

grantmcdermott commented Feb 26, 2024 •

edited

Loading

Inference using fixest's aggregate command #33

Inference using fixest's aggregate command #33

Comments

jtorcasso commented Apr 25, 2023

grantmcdermott commented Apr 26, 2023

s3alfisc commented Jun 26, 2023 • edited Loading

s3alfisc commented Jul 8, 2023 • edited Loading

poetlarsen commented Feb 23, 2024

kylebutts commented Feb 23, 2024

s3alfisc commented Feb 23, 2024

grantmcdermott commented Feb 26, 2024 • edited Loading

s3alfisc commented Jun 26, 2023 •

edited

Loading

s3alfisc commented Jul 8, 2023 •

edited

Loading

grantmcdermott commented Feb 26, 2024 •

edited

Loading