Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up emfx #19

Merged
merged 7 commits into from
Feb 2, 2023
Merged

Speed up emfx #19

merged 7 commits into from
Feb 2, 2023

Conversation

frederickluser
Copy link
Contributor

Hey Grant,

I added now the option collapse_data to emfx, which will significantly improve running time for large data sets.

If collapse_data = T, then the data will be collapsed using data.table before computing marginal effects and then weighted by group-cohort sizes in the marginaleffects command itself. The results are (almost) identical to the case without collapsing. If ivar is not NULL in etwfe, collapsing is not possible and a warning message appears.

I updated the tinytests and the checks produced no errors. Do you think this would be a sensible merge? You're welcome to request changes.

Best, Frederic

Frédéric Kluser added 2 commits January 25, 2023 18:15
speeds up etfx considerably, while results remain the same
emfx becomes way faster for large data sets if we first collapse the data
@grantmcdermott
Copy link
Owner

Super, thanks Frederic. I've been too busy to give this my full attention, but I'm hoping to get around to it later this week. Speeding up the emfx calculations is something I'm definitely interested in adding. Two quick thoughts that come to mind (without have looked at the code!)

  1. Does this aggregation trick work with nonlinear models (e.g., Poisson)? It would be good to check that we aren't smuggling in some linearity assumption.
  2. We could think about adding some internal logic so that we automatically default to the aggregating shortcut for datasets larger than, say, 100k rows. Users should be notified ofc and could select out of this functionality. In this case, the default arg would be something like collapse = "auto".

Excited to take a proper look.

@frederickluser
Copy link
Contributor Author

Dear Grant,
the aggregation trick works also with non-linear models, I tried a poisson.
I also added an collapse = "auto" option and tested it, but my fork is now at conflict with the marginaleffects changes.

Best, Frederic

@grantmcdermott
Copy link
Owner

@frederickluser We've got some conflicts that need to be resolved due to the compatibility updates with marginaleffects 0.9.0 (#20). Feel free to have a crack at these, otherwise I'll try as soon as I get a chance. Goal is to submit to CRAN by the end of the day (Pacific Time).

@grantmcdermott
Copy link
Owner

Hi @frederickluser I've resolved most of the conflicts, but there are bunch of test failures on the test_emfx_xvar.R file that I'm unable to resolve.

Consider the simple_known object and what I get for the (nominally equivalent) function call of emfx(x1, xvar = "xvar"). I'm focusing on the estimate values to abstract from some of the other things that need to be fixed (e.g., ordering by xvar and some minor printing changes that stem from the marginaleffects update).

> simple_known[, c("estimate", "xvar")]
      estimate xvar
1 -0.004358929    1
2 -0.077394771    2
3 -0.121365807    3

> as.data.frame(emfx(x3, xvar = "xvar"))[, c("estimate", "xvar")] |> dplyr::arrange(xvar)
     estimate xvar
1 -0.01221530    1
2 -0.06043789    2
3 -0.09087307    3

Do you know why we're getting different estimates here? I'm using the same seed as the original file (set.seed(123)), so I don't think it can be that.

Can you please check against some known output on your side. I'm worried that the test results are changing through these different PRs...

@frederickluser
Copy link
Contributor Author

Hey Grant. I messed up the check somehow, sorry.

The problem was that the collapse_data="auto" now switched to collapse_data=FALSE for this small sample data. But the simple_known was ran with collapse_data=T.
I checked now both cases and the results are the same as before. For the simplest solution for the test_emfx_xvar.R function, I just overwrote the default to collapse_data = T.

The difference you've shown before is just the marginal difference between collapsing and using the full data.
I hope that solves the problem.

@grantmcdermott
Copy link
Owner

Excellent. Thanks for fixing and for this nice pull request!

@grantmcdermott grantmcdermott merged commit d2110ce into grantmcdermott:main Feb 2, 2023
@grantmcdermott
Copy link
Owner

Heads-up that I've tweaked these functionality a bit (e.g. in ea3bb3e). Key user-level changes are:

  • The argument is now called collapse instead of collapse_data.
  • I've bumped the default "auto" = TRUE threshold up to 500k rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants