Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 2.27.0 RC #2493

Closed
rok-cesnovar opened this issue May 20, 2021 · 58 comments · Fixed by stan-dev/stan#3043
Closed

Performance regression in 2.27.0 RC #2493

rok-cesnovar opened this issue May 20, 2021 · 58 comments · Fixed by stan-dev/stan#3043

Comments

@rok-cesnovar
Copy link
Member

rok-cesnovar commented May 20, 2021

@wds15 reported a performance regression that we need to get to the bottom of: https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/2

Stan model and data: regression.zip

I can reproduce and will investigate and post any progress here.

The first instinct was that its related to the range checks, but it seems that may not be the case.

@rok-cesnovar
Copy link
Member Author

Reverting stan-dev/stanc3#849 (added range checks) makes no difference in this case so it seems to be something else.

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

I added profiling to the stan program:

blrm_exnex-profile_stan.txt

Output:

> fit_26$time()$total
[1] 21.89892
> fit_27b$time()$total
[1] 25.5831
> fit_26$profiles()
[[1]]
       name   thread_id total_time forward_time reverse_time chain_stack
1   log-lik 0x10c17bdc0   14.21690     12.57950     1.637460   226782179
2     prior 0x10c17bdc0    1.70180      1.59811     0.103685    13408400
3 transform 0x10c17bdc0    1.75127      1.13223     0.619044    21120372
  no_chain_stack autodiff_calls no_autodiff_calls
1      102915172         167605                 1
2        5698570         167605                 1
3       10057320         167605              5001

> fit_27b$profiles()
[[1]]
       name   thread_id total_time forward_time reverse_time chain_stack
1   log-lik 0x10ed04dc0   16.56900     14.49760     2.071480   225845240
2     prior 0x10ed04dc0    1.96636      1.83771     0.128652    13353360
3 transform 0x10ed04dc0    2.40248      1.72354     0.678942     9014166
  no_chain_stack autodiff_calls no_autodiff_calls
1      102489998         166917                 1
2        5007510         166917                 1
3       24037776         166917              5001

Not sure.. The problem seems to be smeared out somewhat.

Ah... I also disabled generated quantities.

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

I just reran this with STAN_CPP_OPTIMS=TRUE for 2.26.0 and 2.27.0. No change:

> fit_26$time()$total
[1] 21.64465
> fit_27b$time()$total
[1] 25.10066
> fit_26$profiles()
[[1]]
       name   thread_id total_time forward_time reverse_time chain_stack
1   log-lik 0x109891dc0   13.10350     11.51520     1.588270   226782179
2     prior 0x109891dc0    1.69421      1.58724     0.106974    13408400
3 transform 0x109891dc0    1.64914      1.06284     0.586304    21120372
  no_chain_stack autodiff_calls no_autodiff_calls
1      102915172         167605                 1
2        5698570         167605                 1
3       10057320         167605              5001

> fit_27b$profiles()
[[1]]
       name   thread_id total_time forward_time reverse_time chain_stack
1   log-lik 0x111094dc0   15.68390     13.62370     2.060180   225845240
2     prior 0x111094dc0    1.96505      1.84542     0.119629    13353360
3 transform 0x111094dc0    2.29094      1.66794     0.622994     9014166
  no_chain_stack autodiff_calls no_autodiff_calls
1      102489998         166917                 1
2        5007510         166917                 1
3       24037776         166917              5001

> 

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

Given that the slowdown is so smeared out I thought that maybe a change in the compiler is guilty. The usual way of testing this does not work here. So I tried to use stanc3 from 2.26.0 and then compile it with 2.27.0, but that fails in the compilation stage.

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

Some progress: What I can do is use cmdstan 2.26.0 and point it to use math 2.27.0. Then I get with this mixed setup:

> fit_mix$time()$total
[1] 19.24794
> fit_mix$profiles()
[[1]]
       name   thread_id total_time forward_time reverse_time chain_stack
1   log-lik 0x1110fbdc0   12.26030     10.31270     1.947580   226056690
2     prior 0x1110fbdc0    1.43655      1.33204     0.104508    13365760
3 transform 0x1110fbdc0    1.72902      1.15965     0.569375     9022590
  no_chain_stack autodiff_calls no_autodiff_calls
1      102585954         167072                 1
2        5012160         167072                 1
3       24060240         167072              5001

Since this is now even somewhat faster than before, I do think that this suggests that the issue is upstream. So either stan or the compiler has some problem causing performance issues is what I would think.

@rok-cesnovar
Copy link
Member Author

Yeah, one cause is stan-dev/stanc3#842 but not for the whole regression.

@SteveBronder
Copy link
Collaborator

What in stan-dev/stanc3#842 would cause the regression? The local statement thing? That shouldn't have a big effect

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented May 20, 2021

Scratch that, had a bad testing setup when tested that. Sorry for that false alarm.

It seems to be the stan-dev/stanc3#856 PR

  • cmdstan on develop and stanc3 on git hash 036d9307055ba40ea5b2ca556f2d0014e3d8288d (one prior to 856) the execution I time at around 21s

  • cmdstan on develop and stanc3 on git hash 282439ea73cd348b22b09be78bb78875b03a0e64 (856 merged) I time at around 16s (binaries for that hash are being built here: https://github.com/stan-dev/stanc3/actions/runs/861509419)

In both cases, the only flag was the STAN_NO_RANGE_CHECKS. g++ 9.3.0 on Ubuntu 20.04.

@SteveBronder First we should see if anyone can reproduce my findings. But if we can the question is that PR critical right now (for this release) or is that required for varmat and tuples? I dont know enough about that PR to revert it on master (too many conflicts).

@SteveBronder
Copy link
Collaborator

I'm running the below and actually seeing 2.27 is faster locally for me

https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/14?u=stevebronder

@SteveBronder
Copy link
Collaborator

Before we talk about reverting anything I'd like someone to run the scheme I have in the above link. It just seems very odd to have such large differences in timings. I have a semi fancy machine but nothing that is going to see a 20%+ timing difference. Maybe it's clang? I can try that out

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

Nope... I am on clang and @rok-cesnovar used g++ on Ubuntu

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented May 20, 2021

And to avoid confusion, the model is in the top comment and has no profile statements.

Timing results are exactly the same with cmdstan or cmdstanr.

@rok-cesnovar
Copy link
Member Author

I however do not have march native on.

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

Here is the time result:

time -p for run in {1..10}; do ./blrm_exnex-profile-26 sample num_warmup=10000 num_samples=10000 data file="./test_24/blrm_exnex-combo3.data.R" random seed=1234 init=0; done

 Elapsed Time: 19.25 seconds (Warm-up)
               19.463 seconds (Sampling)
               38.713 seconds (Total)

real 386.28
user 381.35
sys 2.29


time -p for run in {1..10}; do ./blrm_exnex-profile-27 sample num_warmup=10000 num_samples=10000 data file="./test_24/blrm_exnex-combo3.data.R" random seed=1234 init=0; done

 Elapsed Time: 22.888 seconds (Warm-up)
               22.985 seconds (Sampling)
               45.873 seconds (Total)

real 459.85
user 455.00
sys 2.86

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

BTW... do I see this right in that the performance tests are not run on all PRs for stanc3? Maybe the performance test suite detected this issue already? It should have, but that we can sort out another day...

@SteveBronder
Copy link
Collaborator

@wds15 is blrm_exnex-profile-26 the one with profiling? I'm using the same model as the one you posted on discourse here (and the same model as @rok-cesnovar has in regression.zip in the top comment.

I updated my original comment with instructions on exactly what I'm running to reproduce my timings

https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/17?u=stevebronder

I just reran the above with clang-11 instead of gcc-10 and I'm getting that on average the old version is .6 seconds faster for each model than the rc (time results / 10 runs)

# Results over 10 runs
Old:
real 380.42
user 379.65
sys 0.59

New:
real 386.49
user 385.30
sys 0.90

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

Are u on Intel or AMD @SteveBronder

Mine is an Intel.

@SteveBronder
Copy link
Collaborator

I'm on an AMD Ryzen Threadripper 2950X. It has a lot of L1 cache but even then idt it that would be enough to not notice a 20%+ performance difference.

I think we all need to be running the exact same code to make sure we are actually comparing the same thing. Like running it from cmdstanr, R could just suddenly decide to run garbage collection which is going to add time. Timing things is hard! I don't see why we would complicate it more. Is there anything in the comment below that isn't clear on reproducing what I'm seeing?

https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/2?u=stevebronder

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

What's now wrong with the timing above? The profiling stuff really does not matter here.

@rok-cesnovar what are u on? AMD or Intel?

@SteveBronder
Copy link
Collaborator

What's now wrong with the timing above? The profiling stuff really does not matter here.

It absolutely does! If we are not using the same code I don't see how we are going to compare timings. And I really don't see why we would care about how slow profiled code is, if someone is instrumenting their model that's not performance critical. Profiling also has extra overhead and causes extra indirection in the model.

@rok-cesnovar
Copy link
Member Author

Intel i7 but fairly old (5+ years). I will spin up a AWS instance tomorrow so I can share it with Steve if he is unable to reproduce.

I get the exact same timing results running cmdstan models directly in the terminal. Same regression. I omitted the march native as I think its not relevant here. Vanilla is what most users will use.

And bisecting back in cmdstan and stanc3 I have run this at least 30-40 times today. Its evident when the regression is there.

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented May 20, 2021

I am not using the model with profiling. I am using the one posted in the top comment with a fixed seed and 5000/5000 iterations.

Its easily reproducibile in bash or cmdstanr.

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

@SteveBronder can u run just one time the model with profiling (whatever stan version)? I have really seen the same patterns for profiling / no profiling.

@rok-cesnovar great.

To me it sounds like a processor thing by now. If @SteveBronder can have access to an Intel machine then that's great.

@SteveBronder
Copy link
Collaborator

Its easily reproducibile in bash or cmdstanr.

I've tried reproducing it on two separate compilers and totally removing and downloading both versions twice as well. I can try to write a script to reproduce what I did in my discourse comment and can try on my laptop (intel)

I will spin up a AWS instance tomorrow so I can share it with Steve if he is unable to reproduce.

idt there's any need to go through all that effort, imo I just want us to make sure we are all running the same thing before we talk about timing things

@wds15
Copy link
Contributor

wds15 commented May 20, 2021

Maybe this convinces you, @SteveBronder ... here are results for the model from the zip file above (so I downloaded it again to make sure I don't mess up things):

time -p for run in {1..10}; do ./blrm-exnex-26 sample num_warmup=10000 num_samples=10000 data file="./combo3.data.R" random seed=1234 init=0; done

 Elapsed Time: 18.803 seconds (Warm-up)
               19.753 seconds (Sampling)
               38.556 seconds (Total)

real 390.55
user 385.44
sys 2.34

time -p for run in {1..10}; do ./blrm-exnex-27 sample num_warmup=10000 num_samples=10000 data file="./combo3.data.R" random seed=1234 init=0; done

 Elapsed Time: 22.305 seconds (Warm-up)
               23.251 seconds (Sampling)
               45.556 seconds (Total)

real 461.09
user 455.40
sys 2.57

You see that profiling is really not any issue here and the performance regression is there.

[22:56:41][weberse2@C02XK2AGJHD2:~/.cmdstanr]$ cat cmdstan-2.26.1/make/local
CXXFLAGS+=-DSTAN_NO_RANGE_CHECKS -O3 -march=native -mtune=native
STAN_NO_RANGE_CHECKS=true
STAN_CPP_OPTIMS=true
[22:56:48][weberse2@C02XK2AGJHD2:~/.cmdstanr]$ cat cmdstan-2.27.0-rc1/make/local
CXXFLAGS+=-DSTAN_NO_RANGE_CHECKS -O3 -march=native -mtune=native
STAN_NO_RANGE_CHECKS=true
STAN_CPP_OPTIMS=true

@SteveBronder
Copy link
Collaborator

Okay I wrote a script for this and though it still doesn't show on my desktop I am seeing on my chromebook (Intel i7).

https://gist.github.com/SteveBronder/0bed3d2e04f92f62fd19cb0b691f11f8

*********************************
2.26 Times
real 188.44
user 187.46
sys 0.89
*********************************

*********************************
2.27 RC Times
real 217.91
user 216.80
sys 0.99
*********************************

If it's the deserializer PR that's causing it I would bet it's the capacity checks in deserializer that we didn't have in the reader. I think what I can do there is open up a PR that uses the STAN_NO_RANGES macro to turn off those checks.

@SteveBronder
Copy link
Collaborator

Maybe this convinces you,

tbc, it's not that I literally didn't believe you were reading one set of numbers that were bigger than another set of numbers, it's that I quite literally could not get a version that was slower on my desktop. I can't really figure out what's going on if I can't replicate it locally. I've been duped many times, mostly by myself, when benchmarking things. Even for seconds on something that takes half a minute it's very hard to not only benchmark two versions against one another correctly but also to then make inference on what's happening and why.

@SteveBronder
Copy link
Collaborator

Can one of yinz try out this stan branch with 2.27?

https://github.com/stan-dev/stan/tree/fix/deserializer-capacity-checks

That seems to fix the issue when I run ./test_blrm_mod_vers.sh 24 10000 10 (I just deleted the downloaded stuff from the script and then just copy/pasted that stan branch into 2.27

@wds15
Copy link
Contributor

wds15 commented May 21, 2021

I totally get that it's a very confusing thing to see performance regression reports about 20% large and not being able to reproduce on your own.

I tried what you suggest, but it's not working for me:

 Elapsed Time: 22.411 seconds (Warm-up)
               23.237 seconds (Sampling)
               45.648 seconds (Total)

real 464.07
user 458.53
sys 2.69

I reconfigured my local cmdstan 2.27.0 using a make/local as

STAN=/Users/weberse2/work/stan/
CXXFLAGS+=-DSTAN_NO_RANGE_CHECKS -O3 -march=native -mtune=native
STAN_NO_RANGE_CHECKS=true
STAN_CPP_OPTIMS=true

and the above stan repo is checked out at your proposed fix for it. The above runtime is the same as before essentially.

If it is an option, then we should probably switch to a Docker container with these tests if that is an option.

@rok-cesnovar
Copy link
Member Author

No noticeable changes for me as well.

@SteveBronder
Copy link
Collaborator

I think we should also start cutting up that model to see if it's a particular piece that's slower or something systematic

@wds15
Copy link
Contributor

wds15 commented May 21, 2021

That's is what the profiling run is telling you already, I think. The slowdown is smeared out through the entire model.

@rok-cesnovar
Copy link
Member Author

This whole thing seemed very strange to me so I took a bit more time.

What I found was that the execution time for this problem is very seed dependent. With some seeds it gives the impression that the RC is slower but I would say that is just randomness and a different sampler path due to slight differences in numerics. I guess the code-gen changes could have caused those.

This might be conformation bias but that would explain why I saw false positives when reversing back through git history of stanc3 and then could not repeat because I was on a different cmdstan git hash. This would also explain why the regression is equally spread over the entire model? If it were the readers that would probably not be the case.

I added seed to the script (last arg) and with 1234 I get

./test.sh 4 5000 10 1234

*********************************
2.26 Times
real 160,65
user 159,75
sys 0,38
*********************************
*********************************
2.27 RC Times
real 161,00
user 159,92
sys 0,41
*********************************

but

./test.sh 4 5000 10 123

*********************************
2.26 Times
real 160,65
user 159,75
sys 0,38
*********************************
*********************************
2.27 RC Times
real 210,34
user 209,66
sys 0,41
*********************************

The changed script is here: https://gist.githubusercontent.com/rok-cesnovar/a1e17cb1ef93cc7c7d82f5e2bfe402da/raw/9796284b79bd2bb57ffb7462e53134e989c4fdb7/regression%2520test%2520script

With this its looking like there is no regression on my end now with the RC. Tried with g++ and clang.

I am going to leave a longer run (with 2000 iteration and 500 runs) without specifying a seed and that should be about the same execution time wise. Will report tomorrow.

@wds15
Copy link
Contributor

wds15 commented May 23, 2021

Oh dear… that could indeed be the case. I also noticed that more messages of parameters being inf have been emitted with the new RC version. I will try a few seeds to verify.

@mike-lawrence
Copy link
Contributor

mike-lawrence commented May 23, 2021

My results on linux running @rok-cesnovar's script (but tweaked for num_warmup=1e3 num_samples=1e3, seed=$run & 1e2 runs):

*********************************
2.26 Times
real 354.58
user 353.03
sys 1.16
*********************************
*********************************
2.27 RC Times
real 377.01
user 375.39
sys 1.16
*********************************

So, I'd say this is still a reproducible regression.

@wds15
Copy link
Contributor

wds15 commented May 23, 2021

Maybe we should warmup one chain and then use the warmup info and a seed to run the model many times for different seeds, but the same warmup starting point ( and only sample ).

@rok-cesnovar
Copy link
Member Author

@mike-lawrence Thanks!

My longer runs with different seeds also revealed there is still something there...
I am a bit baffled though. Some seeds its the same speed, for some RC is slower, 2.26 is however never slower.

These are a 1000 runs with different seeds so this difference is too big:

*********************************
2.26 Times
real 3458,03
user 3435,65
sys 9,81
*********************************
*********************************
2.27 RC Times
real 3768,89
user 3745,60
sys 9,88
*********************************

More digging...

@SteveBronder
Copy link
Collaborator

Okay not sure what happened but I now can replicate this on my desktop. I'm trying to run perf over both versions of the model right now but it's taking forever to generate the report for some reason. Also @rok-cesnovar I tried my fix from earlier with deserializer and didn't see much of a difference. I'll report back once I get the perf up and running

@rok-cesnovar
Copy link
Member Author

Its that god damn gamma rays I tell you :)

@SteveBronder
Copy link
Collaborator

Alright so after running the script I posted above I'm adding -gdwarf -fno-omit-frame-pointer to CXXFLAGS, cleaning and then rebuilding, and then running

perf record -g --freq=2800 --call-graph dwarf -d -o blrm_old.data ./cmdstan-2.26.1/examples/blrm/blrm sample num_warmup=5000 num_samples=5000 data file="./cmdstan-2.26.1/examples/blrm/blrm.data.R" random seed=1234
perf record -g --freq=2800 --call-graph dwarf -d -o blrm_rc.data ./cmdstan/examples/blrm/blrm sample num_warmup=5000 num_samples=5000 data file="./cmdstan/examples/blrm/blrm.data.R" random seed=1234

Then in two terminal windows I have

perf report --call-graph --stdio -G -i ./blrm_old.data
perf report --call-graph --stdio -G -i ./blrm_rc.data

Nothing is sticking out to me yet, in fact it looks like we spend less time in log_prob on the RC than the last version. But there's still a lot I have to untangle here.

One thing I'm thinking about. This model gives a good few error throws for

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: Exception: Exception: binomial_logit_lpmf: Probability parameter[1] is nan, but must be finite! (in 'examples/blrm/blrm.stan', line 460, column 4 to column 14) (in 'examples/blrm/blrm.stan', line 460, column 4 to column 14) (in 'examples/blrm/blrm.stan', line 460, column 4 to column 14)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

We do now have those throws happening on the cold path in elementwise_check(). I wonder if that error is thrown enough times that it hits the cold path enough to cause the slowdown? Seems unlikely, but @wds15 could you change the model so that it corrects for this before going to binomial_logit_lpmf? I think something like that is supposed to happen in blrm_logit_fast but I'm having a hard time parsing out the logic

@SteveBronder
Copy link
Collaborator

Because if a model without these error messages thrown doesn't have this slowdown then we do know that this is slowing down because we are forcing the model to go along the cold path for these runtime errors. That would also kind of explain why some seeds are good and some seeds are bad since some seeds wouldn't run into this. It would also help explain how CPUs could effect this as it may be that intel optimizes harder against going down the cold path while AMD does not

@wds15
Copy link
Contributor

wds15 commented May 25, 2021

I can do that tomorrow. No time earlier. Sorry.

@wds15
Copy link
Contributor

wds15 commented May 25, 2021

If you want to get earlier to this, one can try the “combo2” example model from the OncoBayes2 package. That is a simpler version of this model in comparison, but it does sample quite fast.

@SteveBronder
Copy link
Collaborator

So doing combo2 I'm seeing for 20K iterations (10K warmup and 10K samples) and fixed the warnings (I'll post something reproducible in a bit) but I'm still seeing the slowdown.

2.27:
 Elapsed Time: 7.024 seconds (Warm-up)
               10.864 seconds (Sampling)
               17.888 seconds (Total)

2.26:
 Elapsed Time: 5.81 seconds (Warm-up)
               8.649 seconds (Sampling)
               14.459 seconds (Total)

I'll try to use perf to see what's going on but it might be hard to find where the 3 seconds are coming from / going to here

@wds15
Copy link
Contributor

wds15 commented May 26, 2021

@SteveBronder I prepped an example for you which stresses Stan even more, but also avoids the numerical issues. Thus, I coded up a 4 component example model and I also tightened the prior which should lead to less numerical trouble. It looks as if that works in the sense of avoiding the numerical issues, but the slowdown is still there. I only went for 50 warmup & iterations:

> fit_26 <- model_26$sample(
+                        data = "blrm_exnex-combo4.data.R",
+                        seed = 123,
+                        chains = 1,
+                        init=0,
+                        parallel_chains = 1,
+                        iter_warmup=iter,
+                        iter_sampling=iter,
+                        refresh = 100
+                    )
Running MCMC with 1 chain...

Chain 1 Number of groups: 3 
Chain 1 Number of strata: 2 
Chain 1 EXNEX enabled for compounds 4/4:    [1,1,1,1] 
Chain 1 EXNEX enabled for interactions 7/7: [1,1,1,1,1,1,1] 
Chain 1 EXNEX mixture dimensionality 11 leads to 2048 combinations. 
Chain 1 Observation => group assignment: 
Chain 1 Group 1: [1,2,3] 
Chain 1 Group 2: [4,5,6,7,8,9,10] 
Chain 1 Group 3: [11,12,13,14,15,16,17,18] 
Chain 1 Group => stratum assignment: 
Chain 1 1 => 1 
Chain 1 2 => 2 
Chain 1 3 => 2 
Chain 1 Prior distribution on tau parameters: 
Chain 1 Log-Normal 
Chain 1 WARNING: There aren't enough warmup iterations to fit the 
Chain 1          three stages of adaptation as currently configured. 
Chain 1          Reducing each adaptation stage to 15%/75%/10% of 
Chain 1          the given number of warmup iterations: 
Chain 1            init_buffer = 7 
Chain 1            adapt_window = 38 
Chain 1            term_buffer = 5 
Chain 1 Iteration:  1 / 100 [  1%]  (Warmup) 
Chain 1 Iteration: 51 / 100 [ 51%]  (Sampling) 
Chain 1 Iteration: 100 / 100 [100%]  (Sampling) 
Chain 1 finished in 89.4 seconds.
> fit_27 <- model_27$sample(
+                        data = "blrm_exnex-combo4.data.R",
+                        seed = 123,
+                        chains = 1,
+                        init=0,
+                        parallel_chains = 1,
+                        iter_warmup=iter,
+                        iter_sampling=iter,
+                        refresh = 100
+                    )
Running MCMC with 1 chain...

Chain 1 Number of groups: 3 
Chain 1 Number of strata: 2 
Chain 1 EXNEX enabled for compounds 4/4:    [1,1,1,1] 
Chain 1 EXNEX enabled for interactions 7/7: [1,1,1,1,1,1,1] 
Chain 1 EXNEX mixture dimensionality 11 leads to 2048 combinations. 
Chain 1 Observation => group assignment: 
Chain 1 Group 1: [1,2,3] 
Chain 1 Group 2: [4,5,6,7,8,9,10] 
Chain 1 Group 3: [11,12,13,14,15,16,17,18] 
Chain 1 Group => stratum assignment: 
Chain 1 1 => 1 
Chain 1 2 => 2 
Chain 1 3 => 2 
Chain 1 Prior distribution on tau parameters: 
Chain 1 Log-Normal 
Chain 1 WARNING: There aren't enough warmup iterations to fit the 
Chain 1          three stages of adaptation as currently configured. 
Chain 1          Reducing each adaptation stage to 15%/75%/10% of 
Chain 1          the given number of warmup iterations: 
Chain 1            init_buffer = 7 
Chain 1            adapt_window = 38 
Chain 1            term_buffer = 5 
Chain 1 Iteration:  1 / 100 [  1%]  (Warmup) 
Chain 1 Iteration: 51 / 100 [ 51%]  (Sampling) 
Chain 1 Iteration: 100 / 100 [100%]  (Sampling) 
Chain 1 finished in 102.1 seconds.

Here is the data file you need:
blrm_exnex-combo4_data_R.txt

@SteveBronder
Copy link
Collaborator

Ty! Do I just plug this data into the combo2 model?

@wds15
Copy link
Contributor

wds15 commented May 27, 2021

Yep. The Stan model does not change …all you need is the data and the usual Stan file.

@SteveBronder
Copy link
Collaborator

Ack still having a hard time finding out whatsup. Code and data for combo model below. I have to be reading the perf wrong because there's def a slowdown with this model and I'm having trouble finding where it is

https://gist.github.com/SteveBronder/24086f561e38e033bd1df5ea3e164a13

@wds15
Copy link
Contributor

wds15 commented May 28, 2021

How about brute force bisectioning search of the commit history?

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented May 28, 2021

This is what I am running yes, though its going a bit slow as there is two repos to track and not all combinations of the two repos produce compileable C++.

@SteveBronder
Copy link
Collaborator

Alright think I got it! @rok-cesnovar @wds15 can yinz try this stan branch out?

https://github.com/stan-dev/stan/tree/fix/rvalue-and-vector-scalar-deserializer

I think the problem was that rvalue() was making copies in the case of std::vector<Eigen::Matrix> inputs when doing single index access. Looking at the perf the biggest time difference seemed to be that and plugging that branch into 2.27 with the no range check I'm seeing

2.27:

 Elapsed Time: 244.365 seconds (Warm-up)
               165.278 seconds (Sampling)
               409.643 seconds (Total)

2.26:

 Elapsed Time: 237.406 seconds (Warm-up)
               167.639 seconds (Sampling)
               405.045 seconds (Total)

Can yinz try out that branch? To replicate this, I ran the script i posted above to get 2.27 and 2.26 and then just copy/pasted the Stan branch above into 2.27, then added the combo2 model and combo4 data to the examples folder, and built everything with

CXXFLAGS+=-DSTAN_NO_RANGE_CHECKS -O3 -march=native -mtune=native

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented May 28, 2021

Yay! This seems to fix it on my end too. Will run an overnight test to triple check it. But it looks good!

thanks!!

@SteveBronder
Copy link
Collaborator

Awesome! @wds15 give it whirl as well when you get a chance then I'll open up a PR if all is good

@wds15
Copy link
Contributor

wds15 commented May 29, 2021

This looks good to me:

> fit_26_3$time()$total ## 3 comp, 10k
[1] 18.81891
> fit_27_3$time()$total ## 3 comp, 10k
[1] 19.7206
> fit_26$time()$total ## 4 comp, 100
[1] 86.04946
> fit_27$time()$total ## 4 comp, 100
[1] 82.75075
> 

Above totals are for just one run for the number of components as indicated and iterations (total) used.

Great catch! To me this now looks all good. Thanks a lot for digging so deep.... but I think it was worth it as this type of access is used by brms as well if I am not mistaken.

@rok-cesnovar
Copy link
Member Author

Thanks Steve!

@SteveBronder
Copy link
Collaborator

np! Just opened up the PR

@bob-carpenter
Copy link
Member

Nice spelunking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants