Speed up emfx #19

frederickluser · 2023-01-30T10:06:54Z

Hey Grant,

I added now the option collapse_data to emfx, which will significantly improve running time for large data sets.

If collapse_data = T, then the data will be collapsed using data.table before computing marginal effects and then weighted by group-cohort sizes in the marginaleffects command itself. The results are (almost) identical to the case without collapsing. If ivar is not NULL in etwfe, collapsing is not possible and a warning message appears.

I updated the tinytests and the checks produced no errors. Do you think this would be a sensible merge? You're welcome to request changes.

Best, Frederic

speeds up etfx considerably, while results remain the same

emfx becomes way faster for large data sets if we first collapse the data

grantmcdermott · 2023-01-31T06:20:30Z

Super, thanks Frederic. I've been too busy to give this my full attention, but I'm hoping to get around to it later this week. Speeding up the emfx calculations is something I'm definitely interested in adding. Two quick thoughts that come to mind (without have looked at the code!)

Does this aggregation trick work with nonlinear models (e.g., Poisson)? It would be good to check that we aren't smuggling in some linearity assumption.
We could think about adding some internal logic so that we automatically default to the aggregating shortcut for datasets larger than, say, 100k rows. Users should be notified ofc and could select out of this functionality. In this case, the default arg would be something like collapse = "auto".

Excited to take a proper look.

frederickluser · 2023-02-01T16:07:51Z

Dear Grant,
the aggregation trick works also with non-linear models, I tried a poisson.
I also added an collapse = "auto" option and tested it, but my fork is now at conflict with the marginaleffects changes.

Best, Frederic

grantmcdermott · 2023-02-01T16:11:25Z

@frederickluser We've got some conflicts that need to be resolved due to the compatibility updates with marginaleffects 0.9.0 (#20). Feel free to have a crack at these, otherwise I'll try as soon as I get a chance. Goal is to submit to CRAN by the end of the day (Pacific Time).

grantmcdermott · 2023-02-02T05:41:48Z

Hi @frederickluser I've resolved most of the conflicts, but there are bunch of test failures on the test_emfx_xvar.R file that I'm unable to resolve.

Consider the simple_known object and what I get for the (nominally equivalent) function call of emfx(x1, xvar = "xvar"). I'm focusing on the estimate values to abstract from some of the other things that need to be fixed (e.g., ordering by xvar and some minor printing changes that stem from the marginaleffects update).

> simple_known[, c("estimate", "xvar")]
      estimate xvar
1 -0.004358929    1
2 -0.077394771    2
3 -0.121365807    3

> as.data.frame(emfx(x3, xvar = "xvar"))[, c("estimate", "xvar")] |> dplyr::arrange(xvar)
     estimate xvar
1 -0.01221530    1
2 -0.06043789    2
3 -0.09087307    3

Do you know why we're getting different estimates here? I'm using the same seed as the original file (set.seed(123)), so I don't think it can be that.

Can you please check against some known output on your side. I'm worried that the test results are changing through these different PRs...

frederickluser · 2023-02-02T08:41:08Z

Hey Grant. I messed up the check somehow, sorry.

The problem was that the collapse_data="auto" now switched to collapse_data=FALSE for this small sample data. But the simple_known was ran with collapse_data=T.
I checked now both cases and the results are the same as before. For the simplest solution for the test_emfx_xvar.R function, I just overwrote the default to collapse_data = T.

The difference you've shown before is just the marginal difference between collapsing and using the full data.
I hope that solves the problem.

grantmcdermott · 2023-02-02T15:16:54Z

Excellent. Thanks for fixing and for this nice pull request!

grantmcdermott · 2023-02-02T20:33:56Z

Heads-up that I've tweaked these functionality a bit (e.g. in ea3bb3e). Key user-level changes are:

The argument is now called collapse instead of collapse_data.
I've bumped the default "auto" = TRUE threshold up to 500k rows.

Frédéric Kluser added 2 commits January 25, 2023 18:15

call marginaleffects with collapsed data

b5672c1

speeds up etfx considerably, while results remain the same

implemented fast emfx

51406e5

emfx becomes way faster for large data sets if we first collapse the data

automate the collapsing

708ee70

Frédéric Kluser and others added 3 commits February 1, 2023 17:14

Update emfx.Rd

f64c2f1

Merge branch 'main' into main

d6ed676

Update collapse_data arg explanation

0e51a48

tests

534d26b

grantmcdermott merged commit d2110ce into grantmcdermott:main Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up emfx #19

Speed up emfx #19

frederickluser commented Jan 30, 2023

grantmcdermott commented Jan 31, 2023

frederickluser commented Feb 1, 2023

grantmcdermott commented Feb 1, 2023

grantmcdermott commented Feb 2, 2023

frederickluser commented Feb 2, 2023

grantmcdermott commented Feb 2, 2023

grantmcdermott commented Feb 2, 2023

Speed up emfx #19

Speed up emfx #19

Conversation

frederickluser commented Jan 30, 2023

grantmcdermott commented Jan 31, 2023

frederickluser commented Feb 1, 2023

grantmcdermott commented Feb 1, 2023

grantmcdermott commented Feb 2, 2023

frederickluser commented Feb 2, 2023

grantmcdermott commented Feb 2, 2023

grantmcdermott commented Feb 2, 2023