-
-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in 2.27.0 RC #2493
Comments
Reverting stan-dev/stanc3#849 (added range checks) makes no difference in this case so it seems to be something else. |
I added profiling to the stan program: Output: > fit_26$time()$total
[1] 21.89892
> fit_27b$time()$total
[1] 25.5831
> fit_26$profiles()
[[1]]
name thread_id total_time forward_time reverse_time chain_stack
1 log-lik 0x10c17bdc0 14.21690 12.57950 1.637460 226782179
2 prior 0x10c17bdc0 1.70180 1.59811 0.103685 13408400
3 transform 0x10c17bdc0 1.75127 1.13223 0.619044 21120372
no_chain_stack autodiff_calls no_autodiff_calls
1 102915172 167605 1
2 5698570 167605 1
3 10057320 167605 5001
> fit_27b$profiles()
[[1]]
name thread_id total_time forward_time reverse_time chain_stack
1 log-lik 0x10ed04dc0 16.56900 14.49760 2.071480 225845240
2 prior 0x10ed04dc0 1.96636 1.83771 0.128652 13353360
3 transform 0x10ed04dc0 2.40248 1.72354 0.678942 9014166
no_chain_stack autodiff_calls no_autodiff_calls
1 102489998 166917 1
2 5007510 166917 1
3 24037776 166917 5001 Not sure.. The problem seems to be smeared out somewhat. Ah... I also disabled generated quantities. |
I just reran this with > fit_26$time()$total
[1] 21.64465
> fit_27b$time()$total
[1] 25.10066
> fit_26$profiles()
[[1]]
name thread_id total_time forward_time reverse_time chain_stack
1 log-lik 0x109891dc0 13.10350 11.51520 1.588270 226782179
2 prior 0x109891dc0 1.69421 1.58724 0.106974 13408400
3 transform 0x109891dc0 1.64914 1.06284 0.586304 21120372
no_chain_stack autodiff_calls no_autodiff_calls
1 102915172 167605 1
2 5698570 167605 1
3 10057320 167605 5001
> fit_27b$profiles()
[[1]]
name thread_id total_time forward_time reverse_time chain_stack
1 log-lik 0x111094dc0 15.68390 13.62370 2.060180 225845240
2 prior 0x111094dc0 1.96505 1.84542 0.119629 13353360
3 transform 0x111094dc0 2.29094 1.66794 0.622994 9014166
no_chain_stack autodiff_calls no_autodiff_calls
1 102489998 166917 1
2 5007510 166917 1
3 24037776 166917 5001
> |
Given that the slowdown is so smeared out I thought that maybe a change in the compiler is guilty. The usual way of testing this does not work here. So I tried to use stanc3 from 2.26.0 and then compile it with 2.27.0, but that fails in the compilation stage. |
Some progress: What I can do is use cmdstan 2.26.0 and point it to use math 2.27.0. Then I get with this mixed setup: > fit_mix$time()$total
[1] 19.24794
> fit_mix$profiles()
[[1]]
name thread_id total_time forward_time reverse_time chain_stack
1 log-lik 0x1110fbdc0 12.26030 10.31270 1.947580 226056690
2 prior 0x1110fbdc0 1.43655 1.33204 0.104508 13365760
3 transform 0x1110fbdc0 1.72902 1.15965 0.569375 9022590
no_chain_stack autodiff_calls no_autodiff_calls
1 102585954 167072 1
2 5012160 167072 1
3 24060240 167072 5001 Since this is now even somewhat faster than before, I do think that this suggests that the issue is upstream. So either stan or the compiler has some problem causing performance issues is what I would think. |
Yeah, one cause is stan-dev/stanc3#842 but not for the whole regression. |
What in stan-dev/stanc3#842 would cause the regression? The local statement thing? That shouldn't have a big effect |
Scratch that, had a bad testing setup when tested that. Sorry for that false alarm. It seems to be the stan-dev/stanc3#856 PR
In both cases, the only flag was the STAN_NO_RANGE_CHECKS. g++ 9.3.0 on Ubuntu 20.04. @SteveBronder First we should see if anyone can reproduce my findings. But if we can the question is that PR critical right now (for this release) or is that required for varmat and tuples? I dont know enough about that PR to revert it on master (too many conflicts). |
I'm running the below and actually seeing 2.27 is faster locally for me https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/14?u=stevebronder |
Before we talk about reverting anything I'd like someone to run the scheme I have in the above link. It just seems very odd to have such large differences in timings. I have a semi fancy machine but nothing that is going to see a 20%+ timing difference. Maybe it's clang? I can try that out |
Nope... I am on clang and @rok-cesnovar used g++ on Ubuntu |
And to avoid confusion, the model is in the top comment and has no profile statements. Timing results are exactly the same with cmdstan or cmdstanr. |
I however do not have march native on. |
Here is the time result:
|
BTW... do I see this right in that the performance tests are not run on all PRs for stanc3? Maybe the performance test suite detected this issue already? It should have, but that we can sort out another day... |
@wds15 is I updated my original comment with instructions on exactly what I'm running to reproduce my timings https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/17?u=stevebronder I just reran the above with clang-11 instead of gcc-10 and I'm getting that on average the old version is .6 seconds faster for each model than the rc (time results / 10 runs)
|
Are u on Intel or AMD @SteveBronder Mine is an Intel. |
I'm on an AMD Ryzen Threadripper 2950X. It has a lot of L1 cache but even then idt it that would be enough to not notice a 20%+ performance difference. I think we all need to be running the exact same code to make sure we are actually comparing the same thing. Like running it from cmdstanr, R could just suddenly decide to run garbage collection which is going to add time. Timing things is hard! I don't see why we would complicate it more. Is there anything in the comment below that isn't clear on reproducing what I'm seeing? https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/2?u=stevebronder |
What's now wrong with the timing above? The profiling stuff really does not matter here. @rok-cesnovar what are u on? AMD or Intel? |
It absolutely does! If we are not using the same code I don't see how we are going to compare timings. And I really don't see why we would care about how slow profiled code is, if someone is instrumenting their model that's not performance critical. Profiling also has extra overhead and causes extra indirection in the model. |
Intel i7 but fairly old (5+ years). I will spin up a AWS instance tomorrow so I can share it with Steve if he is unable to reproduce. I get the exact same timing results running cmdstan models directly in the terminal. Same regression. I omitted the march native as I think its not relevant here. Vanilla is what most users will use. And bisecting back in cmdstan and stanc3 I have run this at least 30-40 times today. Its evident when the regression is there. |
I am not using the model with profiling. I am using the one posted in the top comment with a fixed seed and 5000/5000 iterations. Its easily reproducibile in bash or cmdstanr. |
@SteveBronder can u run just one time the model with profiling (whatever stan version)? I have really seen the same patterns for profiling / no profiling. @rok-cesnovar great. To me it sounds like a processor thing by now. If @SteveBronder can have access to an Intel machine then that's great. |
I've tried reproducing it on two separate compilers and totally removing and downloading both versions twice as well. I can try to write a script to reproduce what I did in my discourse comment and can try on my laptop (intel)
idt there's any need to go through all that effort, imo I just want us to make sure we are all running the same thing before we talk about timing things |
Maybe this convinces you, @SteveBronder ... here are results for the model from the zip file above (so I downloaded it again to make sure I don't mess up things):
You see that profiling is really not any issue here and the performance regression is there.
|
Okay I wrote a script for this and though it still doesn't show on my desktop I am seeing on my chromebook (Intel i7). https://gist.github.com/SteveBronder/0bed3d2e04f92f62fd19cb0b691f11f8
If it's the deserializer PR that's causing it I would bet it's the capacity checks in |
tbc, it's not that I literally didn't believe you were reading one set of numbers that were bigger than another set of numbers, it's that I quite literally could not get a version that was slower on my desktop. I can't really figure out what's going on if I can't replicate it locally. I've been duped many times, mostly by myself, when benchmarking things. Even for seconds on something that takes half a minute it's very hard to not only benchmark two versions against one another correctly but also to then make inference on what's happening and why. |
Can one of yinz try out this stan branch with 2.27? https://github.com/stan-dev/stan/tree/fix/deserializer-capacity-checks That seems to fix the issue when I run |
I totally get that it's a very confusing thing to see performance regression reports about 20% large and not being able to reproduce on your own. I tried what you suggest, but it's not working for me:
I reconfigured my local cmdstan 2.27.0 using a make/local as
and the above stan repo is checked out at your proposed fix for it. The above runtime is the same as before essentially. If it is an option, then we should probably switch to a Docker container with these tests if that is an option. |
No noticeable changes for me as well. |
I think we should also start cutting up that model to see if it's a particular piece that's slower or something systematic |
That's is what the profiling run is telling you already, I think. The slowdown is smeared out through the entire model. |
This whole thing seemed very strange to me so I took a bit more time. What I found was that the execution time for this problem is very seed dependent. With some seeds it gives the impression that the RC is slower but I would say that is just randomness and a different sampler path due to slight differences in numerics. I guess the code-gen changes could have caused those. This might be conformation bias but that would explain why I saw false positives when reversing back through git history of stanc3 and then could not repeat because I was on a different cmdstan git hash. This would also explain why the regression is equally spread over the entire model? If it were the readers that would probably not be the case. I added seed to the script (last arg) and with 1234 I get
but
The changed script is here: https://gist.githubusercontent.com/rok-cesnovar/a1e17cb1ef93cc7c7d82f5e2bfe402da/raw/9796284b79bd2bb57ffb7462e53134e989c4fdb7/regression%2520test%2520script With this its looking like there is no regression on my end now with the RC. Tried with g++ and clang. I am going to leave a longer run (with 2000 iteration and 500 runs) without specifying a seed and that should be about the same execution time wise. Will report tomorrow. |
Oh dear… that could indeed be the case. I also noticed that more messages of parameters being inf have been emitted with the new RC version. I will try a few seeds to verify. |
My results on linux running @rok-cesnovar's script (but tweaked for num_warmup=1e3 num_samples=1e3, seed=$run & 1e2 runs):
So, I'd say this is still a reproducible regression. |
Maybe we should warmup one chain and then use the warmup info and a seed to run the model many times for different seeds, but the same warmup starting point ( and only sample ). |
@mike-lawrence Thanks! My longer runs with different seeds also revealed there is still something there... These are a 1000 runs with different seeds so this difference is too big:
More digging... |
Okay not sure what happened but I now can replicate this on my desktop. I'm trying to run perf over both versions of the model right now but it's taking forever to generate the report for some reason. Also @rok-cesnovar I tried my fix from earlier with deserializer and didn't see much of a difference. I'll report back once I get the perf up and running |
Its that god damn gamma rays I tell you :) |
Alright so after running the script I posted above I'm adding
Then in two terminal windows I have
Nothing is sticking out to me yet, in fact it looks like we spend less time in One thing I'm thinking about. This model gives a good few error throws for
We do now have those throws happening on the cold path in |
Because if a model without these error messages thrown doesn't have this slowdown then we do know that this is slowing down because we are forcing the model to go along the cold path for these runtime errors. That would also kind of explain why some seeds are good and some seeds are bad since some seeds wouldn't run into this. It would also help explain how CPUs could effect this as it may be that intel optimizes harder against going down the cold path while AMD does not |
I can do that tomorrow. No time earlier. Sorry. |
If you want to get earlier to this, one can try the “combo2” example model from the OncoBayes2 package. That is a simpler version of this model in comparison, but it does sample quite fast. |
So doing combo2 I'm seeing for 20K iterations (10K warmup and 10K samples) and fixed the warnings (I'll post something reproducible in a bit) but I'm still seeing the slowdown.
I'll try to use perf to see what's going on but it might be hard to find where the 3 seconds are coming from / going to here |
@SteveBronder I prepped an example for you which stresses Stan even more, but also avoids the numerical issues. Thus, I coded up a 4 component example model and I also tightened the prior which should lead to less numerical trouble. It looks as if that works in the sense of avoiding the numerical issues, but the slowdown is still there. I only went for 50 warmup & iterations: > fit_26 <- model_26$sample(
+ data = "blrm_exnex-combo4.data.R",
+ seed = 123,
+ chains = 1,
+ init=0,
+ parallel_chains = 1,
+ iter_warmup=iter,
+ iter_sampling=iter,
+ refresh = 100
+ )
Running MCMC with 1 chain...
Chain 1 Number of groups: 3
Chain 1 Number of strata: 2
Chain 1 EXNEX enabled for compounds 4/4: [1,1,1,1]
Chain 1 EXNEX enabled for interactions 7/7: [1,1,1,1,1,1,1]
Chain 1 EXNEX mixture dimensionality 11 leads to 2048 combinations.
Chain 1 Observation => group assignment:
Chain 1 Group 1: [1,2,3]
Chain 1 Group 2: [4,5,6,7,8,9,10]
Chain 1 Group 3: [11,12,13,14,15,16,17,18]
Chain 1 Group => stratum assignment:
Chain 1 1 => 1
Chain 1 2 => 2
Chain 1 3 => 2
Chain 1 Prior distribution on tau parameters:
Chain 1 Log-Normal
Chain 1 WARNING: There aren't enough warmup iterations to fit the
Chain 1 three stages of adaptation as currently configured.
Chain 1 Reducing each adaptation stage to 15%/75%/10% of
Chain 1 the given number of warmup iterations:
Chain 1 init_buffer = 7
Chain 1 adapt_window = 38
Chain 1 term_buffer = 5
Chain 1 Iteration: 1 / 100 [ 1%] (Warmup)
Chain 1 Iteration: 51 / 100 [ 51%] (Sampling)
Chain 1 Iteration: 100 / 100 [100%] (Sampling)
Chain 1 finished in 89.4 seconds.
> fit_27 <- model_27$sample(
+ data = "blrm_exnex-combo4.data.R",
+ seed = 123,
+ chains = 1,
+ init=0,
+ parallel_chains = 1,
+ iter_warmup=iter,
+ iter_sampling=iter,
+ refresh = 100
+ )
Running MCMC with 1 chain...
Chain 1 Number of groups: 3
Chain 1 Number of strata: 2
Chain 1 EXNEX enabled for compounds 4/4: [1,1,1,1]
Chain 1 EXNEX enabled for interactions 7/7: [1,1,1,1,1,1,1]
Chain 1 EXNEX mixture dimensionality 11 leads to 2048 combinations.
Chain 1 Observation => group assignment:
Chain 1 Group 1: [1,2,3]
Chain 1 Group 2: [4,5,6,7,8,9,10]
Chain 1 Group 3: [11,12,13,14,15,16,17,18]
Chain 1 Group => stratum assignment:
Chain 1 1 => 1
Chain 1 2 => 2
Chain 1 3 => 2
Chain 1 Prior distribution on tau parameters:
Chain 1 Log-Normal
Chain 1 WARNING: There aren't enough warmup iterations to fit the
Chain 1 three stages of adaptation as currently configured.
Chain 1 Reducing each adaptation stage to 15%/75%/10% of
Chain 1 the given number of warmup iterations:
Chain 1 init_buffer = 7
Chain 1 adapt_window = 38
Chain 1 term_buffer = 5
Chain 1 Iteration: 1 / 100 [ 1%] (Warmup)
Chain 1 Iteration: 51 / 100 [ 51%] (Sampling)
Chain 1 Iteration: 100 / 100 [100%] (Sampling)
Chain 1 finished in 102.1 seconds.
Here is the data file you need: |
Ty! Do I just plug this data into the combo2 model? |
Yep. The Stan model does not change …all you need is the data and the usual Stan file. |
Ack still having a hard time finding out whatsup. Code and data for combo model below. I have to be reading the perf wrong because there's def a slowdown with this model and I'm having trouble finding where it is https://gist.github.com/SteveBronder/24086f561e38e033bd1df5ea3e164a13 |
How about brute force bisectioning search of the commit history? |
This is what I am running yes, though its going a bit slow as there is two repos to track and not all combinations of the two repos produce compileable C++. |
Alright think I got it! @rok-cesnovar @wds15 can yinz try this stan branch out? https://github.com/stan-dev/stan/tree/fix/rvalue-and-vector-scalar-deserializer I think the problem was that
Can yinz try out that branch? To replicate this, I ran the script i posted above to get 2.27 and 2.26 and then just copy/pasted the Stan branch above into 2.27, then added the combo2 model and combo4 data to the examples folder, and built everything with
|
Yay! This seems to fix it on my end too. Will run an overnight test to triple check it. But it looks good! thanks!! |
Awesome! @wds15 give it whirl as well when you get a chance then I'll open up a PR if all is good |
This looks good to me: > fit_26_3$time()$total ## 3 comp, 10k
[1] 18.81891
> fit_27_3$time()$total ## 3 comp, 10k
[1] 19.7206
> fit_26$time()$total ## 4 comp, 100
[1] 86.04946
> fit_27$time()$total ## 4 comp, 100
[1] 82.75075
> Above totals are for just one run for the number of components as indicated and iterations (total) used. Great catch! To me this now looks all good. Thanks a lot for digging so deep.... but I think it was worth it as this type of access is used by brms as well if I am not mistaken. |
Thanks Steve! |
np! Just opened up the PR |
Nice spelunking! |
@wds15 reported a performance regression that we need to get to the bottom of: https://discourse.mc-stan.org/t/cmdstan-2-27-0-release-candidate/22602/2
Stan model and data: regression.zip
I can reproduce and will investigate and post any progress here.
The first instinct was that its related to the range checks, but it seems that may not be the case.
The text was updated successfully, but these errors were encountered: