Skip to content

Commit

Permalink
vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
grantmcdermott committed Dec 16, 2024
1 parent 8cfe5aa commit 686db71
Showing 1 changed file with 16 additions and 14 deletions.
30 changes: 16 additions & 14 deletions vignettes/etwfe.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ knitr::opts_chunk$set(
options(width = 100)
options(rmarkdown.html_vignette.check_title = FALSE)
options(marginaleffects_safe = FALSE)
config_modelsummary(startup_message = FALSE)
fixest::setFixest_notes(FALSE)
```
Expand Down Expand Up @@ -449,27 +450,28 @@ For its part, the second `emfx()` stage also tends to be pretty performant. If
your data has less than 100k rows, it's unlikely that you'll have to wait more
than a few seconds to obtain results. However, `emfx`'s computation time does
tend to scale non-linearly with the size of the original data, as well as the
number of interactions from the underlying `etwfe` model object. Without getting
number of interactions from the underlying `etwfe` model object.^[Without getting
too deep into the weeds, we are relying on a numerical delta method of the
(excellent) **marginaleffects** package underneath the hood to recover the ATTs
of interest. This method requires estimating two prediction models for *each*
coefficient in the model and then computing their standard errors. So it's a
potentially expensive operation that can push the computation time for large
datasets (> 1m rows) up to several minutes or longer.
datasets (> 1m rows) up to several minutes or longer.]

Fortunately, there are two complementary strategies that you can use to speed
things up. The first is to turn off the most expensive part of the whole
procedure---standard error calculation---by calling `emfx(..., vcov = FALSE)`.
Doing so should bring the estimation time back down to a few seconds or less,
This should bring the estimation time back down to a few seconds or less,
even for datasets in excess of a million rows. Of course, the loss of standard
errors might not be an acceptable trade-off for projects where statistical
inference is critical. But the good news is this first strategy can still be
combined our second strategy: it turns out that collapsing the data by groups
prior to estimating the marginal effects can yield substantial speed gains on
its own. Users can do this by invoking the `emfx(..., collapse = TRUE)`
inference is critical. But the good news is that we "combine" turning off
standard errors with a second strategy. Specially, it turns out that compressing
the data by groups prior to estimation can yield substantial speed gains on
its own; see Wong _et al._ ([2021](https://doi.org/10.48550/arXiv.2102.11297))
on this. Users can do this by invoking the `emfx(..., compress = TRUE)`
argument. While the effect here is not as dramatic as the first strategy,
collapsing the data does have the virtue of retaining information about the
standard errors. The trade-off this time, however, is that collapsing our data
standard errors. The trade-off this time, however, is that compressing our data
does lead to a loss in accuracy for our estimated parameters. On the other hand,
testing suggests that this loss in accuracy tends to be relatively minor, with
results equivalent up to the 1st or 2nd significant decimal place (or even
Expand All @@ -480,10 +482,10 @@ about the estimation time for large datasets and models:

0. Estimate `mod = etwfe(...)` as per usual.
1. Run `emfx(mod, vcov = FALSE, ...)`.
2. Run `emfx(mod, vcov = FALSE, collapse = TRUE, ...)`.
2. Run `emfx(mod, vcov = FALSE, compress = TRUE, ...)`.
3. Compare the point estimates from steps 1 and 2. If they are are similar
enough to your satisfaction, get the approximate standard errors by running
`emfx(mod, collapse = TRUE, ...)`.
`emfx(mod, compress = TRUE, ...)`.

It's a bit of performance art, since all of the examples in this vignette
complete very quickly anyway. But here is a reworking of our earlier event study
Expand All @@ -496,18 +498,18 @@ example to demonstrate this performance-conscious workflow.
emfx(mod, type = "event", vcov = FALSE)
# Step 2
emfx(mod, type = "event", vcov = FALSE, collapse = TRUE)
emfx(mod, type = "event", vcov = FALSE, compress = TRUE)
# Step 3: Results from 1 and 2 are similar enough, so get approx. SEs
mod_es2 = emfx(mod, type = "event", collapse = TRUE)
mod_es_compressed = emfx(mod, type = "event", compress = TRUE)
```

To put a fine point on it, we can can compare our original event study with the
collapsed estimates and see that the results are indeed very similar.
compressed estimates and see that the results are indeed very similar.

```{r}
modelsummary(
list("Original" = mod_es, "Collapsed" = mod_es2),
list("Original" = mod_es, "Compressed" = mod_es_compressed),
shape = term:event:statistic ~ model,
coef_rename = rename_fn,
gof_omit = "Adj|Within|IC|RMSE",
Expand Down

0 comments on commit 686db71

Please sign in to comment.