Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds accumulators for var and fwd #2535

Merged
merged 35 commits into from
Jul 27, 2021
Merged

Conversation

SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Jul 13, 2021

Summary

This adds specializations for var and fwd for Stan's accumulator class.

For stan-dev/stanc3#898 I needed to add the ability for the accumulator to accept var<Matrix> type inputs. But then I was looking at the impl and I didn't really understand why we kept a whole std::vector<T> when this is used in the compiler to just accumulate on the joint log probability.

While I was doing this I found that I was getting some really really weird include errors about how it was not using stan::math::sum() for var and fvar types. I think this was a bug where in the process of ADL the compiler was not finding definitions for stan::math::sum() for containers of fvar and var types before the definition of accumulate<T>. You can replicate these by deleting the specializations for accumulate this PR created in fwd and rev.

The major change for this PR is eagerly summing matrices and vectors and then adding their sum to the internal buffer. From the benchmark here this seems to be good. Below are the results from this PR and develop. @t4c1 I also added the stuff for matrix_cl<> as we will need that as well in the compiler.

The first bench for each is for adding vars individually and calling .sum() at the end. The second is for adding an eigen matrix of vars all at once.

This PR

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
acc_var_bench/2/manual_time            74.9 ns          135 ns      9601247
acc_var_bench/4/manual_time             101 ns          170 ns      6736194
acc_var_bench/8/manual_time             130 ns          211 ns      5458332
acc_var_bench/16/manual_time            178 ns          289 ns      3904865
acc_var_bench/32/manual_time            246 ns          421 ns      2843663
acc_var_bench/64/manual_time            391 ns          667 ns      1773817
acc_var_bench/128/manual_time           584 ns         1136 ns      1096269
acc_var_bench/256/manual_time          1000 ns         2076 ns       699461
acc_var_bench/512/manual_time          1830 ns         3933 ns       381838
acc_var_bench/1024/manual_time         3449 ns         7566 ns       202477
acc_var_bench/2048/manual_time         6379 ns        14549 ns       109720
acc_var_bench/4096/manual_time        12901 ns        28956 ns        54126
acc_eigen_bench/2/manual_time          49.0 ns          102 ns     14266056
acc_eigen_bench/4/manual_time          51.6 ns          110 ns     13408167
acc_eigen_bench/8/manual_time          57.9 ns          129 ns     12023346
acc_eigen_bench/16/manual_time         67.9 ns          167 ns     10457033
acc_eigen_bench/32/manual_time         96.1 ns          264 ns      7278352
acc_eigen_bench/64/manual_time          157 ns          440 ns      4469364
acc_eigen_bench/128/manual_time         264 ns          813 ns      2641925
acc_eigen_bench/256/manual_time         473 ns         1518 ns      1479699
acc_eigen_bench/512/manual_time         896 ns         2923 ns       780638
acc_eigen_bench/1024/manual_time       1776 ns         5764 ns       398264
acc_eigen_bench/2048/manual_time       3565 ns        11483 ns       196032
acc_eigen_bench/4096/manual_time       7150 ns        23090 ns        97784

Develop

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
toss_me<end_val2>                1422446791 ns   1422238034 ns            1
acc_var_bench/2/manual_time            72.2 ns          137 ns      9530407
acc_var_bench/4/manual_time            95.9 ns          164 ns      7082408
acc_var_bench/8/manual_time             123 ns          204 ns      5663455
acc_var_bench/16/manual_time            169 ns          278 ns      4199877
acc_var_bench/32/manual_time            244 ns          421 ns      2864907
acc_var_bench/64/manual_time            411 ns          709 ns      1712402
acc_var_bench/128/manual_time           631 ns         1188 ns      1095495
acc_var_bench/256/manual_time          1068 ns         2113 ns       648345
acc_var_bench/512/manual_time          1934 ns         3956 ns       362269
acc_var_bench/1024/manual_time         3578 ns         7525 ns       194104
acc_var_bench/2048/manual_time         6644 ns        14459 ns       105428
acc_var_bench/4096/manual_time        13110 ns        28688 ns        52293
acc_eigen_bench/2/manual_time          74.6 ns          138 ns      9348348
acc_eigen_bench/4/manual_time           101 ns          167 ns      6947154
acc_eigen_bench/8/manual_time           133 ns          212 ns      5207769
acc_eigen_bench/16/manual_time          174 ns          282 ns      3960223
acc_eigen_bench/32/manual_time          247 ns          424 ns      2887478
acc_eigen_bench/64/manual_time          397 ns          699 ns      1763989
acc_eigen_bench/128/manual_time         633 ns         1193 ns      1094997
acc_eigen_bench/256/manual_time        1065 ns         2108 ns       653280
acc_eigen_bench/512/manual_time        1933 ns         3960 ns       362200
acc_eigen_bench/1024/manual_time       3605 ns         7535 ns       193721
acc_eigen_bench/2048/manual_time       6709 ns        14475 ns       103306
acc_eigen_bench/4096/manual_time      13262 ns        28763 ns        52301

So no change for accumulating a bunch of vars and a good speedup for accumulating matrices. Seems good!

Tests

Same tests as before, with additional tests for Eigen::Matrix<var>, var_value<Matrix>, var_value<matrix_cl<double>>, std::vector<var_value<matrix_cl>>, and std::vector` types.

Side Effects

This is used in every single Stan program so we should be rather careful with it.

Release notes

Allow accumulator to accept var<Matrix> matrix types

Checklist

  • Math issue How to add static matrix? #1805

  • Copyright holder: Steve Bronder

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.05 3.33 0.92 -9.19% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.13% slower
eight_schools/eight_schools.stan 0.12 0.11 1.09 8.52% faster
gp_regr/gp_regr.stan 0.16 0.16 0.98 -2.17% slower
irt_2pl/irt_2pl.stan 5.86 5.83 1.01 0.57% faster
performance.compilation 89.12 86.63 1.03 2.79% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.74 8.74 1.0 0.02% faster
pkpd/one_comp_mm_elim_abs.stan 29.09 29.84 0.97 -2.58% slower
sir/sir.stan 130.9 126.11 1.04 3.66% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.0 -0.01% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.08 2.99 1.03 2.74% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.37 1.06 6.02% faster
arK/arK.stan 2.56 1.84 1.39 27.9% faster
arma/arma.stan 0.65 0.9 0.73 -36.99% slower
garch/garch.stan 0.64 0.53 1.2 16.91% faster
Mean result: 1.0284601075

Jenkins Console Log
Blue Ocean
Commit hash: 9417de4


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.08 3.03 1.02 1.52% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -1.67% slower
eight_schools/eight_schools.stan 0.12 0.11 1.04 3.54% faster
gp_regr/gp_regr.stan 0.16 0.16 0.99 -0.79% slower
irt_2pl/irt_2pl.stan 5.9 5.87 1.0 0.49% faster
performance.compilation 88.84 86.68 1.02 2.43% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.75 8.56 1.02 2.16% faster
pkpd/one_comp_mm_elim_abs.stan 30.4 30.08 1.01 1.07% faster
sir/sir.stan 131.96 126.11 1.05 4.43% faster
gp_regr/gen_gp_data.stan 0.04 0.03 1.04 4.13% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.09 3.05 1.02 1.48% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.41 0.99 -0.71% slower
arK/arK.stan 2.54 1.91 1.33 24.98% faster
arma/arma.stan 0.65 0.92 0.7 -43.1% slower
garch/garch.stan 0.64 0.61 1.06 5.59% faster
Mean result: 1.01859347969

Jenkins Console Log
Blue Ocean
Commit hash: 85058a2


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@t4c1
Copy link
Contributor

t4c1 commented Jul 15, 2021

But then I was looking at the impl and I didn't really understand why we kept a whole std::vector when this is used in the compiler to just accumulate on the joint log probability.

It says in the doxigen. It is faster for reverse mode since executing reverse pass of one sum vari is much faster than many add varis. I don't think it maters for prim, fwd or containers (as your benchmarks show directly summing is faster).

It might also be worth checking how much it actually matters for rev scalars. If it does not, I would suggest completely removing the accumulator class. If it does, I suggest doing general implementation that stores just a scalar and adds everything to it and a rev specialization that keeps the same vector it is using now, but sums any containers first.

I think this was a bug where in the process of ADL the compiler was not finding definitions for stan::math::sum() for containers of fvar and var types before the definition of accumulate. You can replicate these by deleting the specializations for accumulate this PR created in fwd and rev.

I would say just copying the code around is not the best solution. We should find the root cause (probably some includes missing or some cyclical includes) and fix that.

@hsbadr
Copy link
Member

hsbadr commented Jul 15, 2021

@SteveBronder Sorry for nagging you with errors for WIP; let me know if I need to stop 😄

Is the following error related to this PR? It occurs when building EpiNow2 package.

stan/math/prim/fun/accumulator.hpp:46:10: error: request for memberpush_backin ‘((stan::math::accumulator<double>*)this)->stan::math::accumulator<double>::buf_’, which is of non-class type
double46 |     buf_.push_back(x);
      |     ~~~~~^~~~~~~~~

In file included from /usr/lib/R/library/StanHeaders/include/stan/math/prim/fun.hpp:5,
                 from /usr/lib/R/library/StanHeaders/include/stan/math/prim.hpp:14,
                 from /usr/lib/R/library/StanHeaders/include/src/stan/io/dump.hpp:7,
                 from /usr/lib/R/library/rstan/include/rstan/stan_fit.hpp:43,
                 from /usr/lib/R/library/rstan/include/rstan/rstaninc.hpp:4,
                 from stanExports_gamma.h:20,
                 from stanExports_gamma.cc:5:
                 
/usr/lib/R/library/StanHeaders/include/stan/math/prim/fun/accumulator.hpp: In instantiation ofvoid stan::math::accumulator<T>::add(S) [with S = double; <template-parameter-2-2> = void; T = double]’:
stanExports_exp.h:182:23:   required fromstan::scalar_type_t<T2> model_exp_namespace::model_exp::log_prob_impl(VecR&, VecI&, std::ostream*) const [with bool propto__ = false; bool jacobian__ = false; VecR = std::vector<double>; Vec
I = std::vector<int>; stan::require_vector_like_t<VecR>* <anonymous> = 0; stan::require_vector_like_vt<std::is_integral, VecI>* <anonymous> = 0; stan::scalar_type_t<T2> = double; std::ostream = std::basic_ostream<char>]’
stanExports_exp.h:378:49:   required fromT__ model_exp_namespace::model_exp::log_prob(std::vector<T_l>&, std::vector<int>&, std::ostream*) const [with bool propto__ = false; bool jacobian__ = false; T__ = double; std::ostream = std
::basic_ostream<char>]’
/usr/lib/R/library/StanHeaders/include/src/stan/services/optimize/newton.hpp:55:47:   required fromint stan::services::optimize::newton(Model&, const stan::io::var_context&, unsigned int, unsigned int, double, int, bool, stan::call
backs::interrupt&, stan::callbacks::logger&, stan::callbacks::writer&, stan::callbacks::writer&) [with Model = model_exp_namespace::model_exp]’
/usr/lib/R/library/rstan/include/rstan/stan_fit.hpp:502:41:   required fromint rstan::{anonymous}::command(rstan::stan_args&, Model&, Rcpp::List&, const std::vector<long unsigned int>&, const std::vector<std::__cxx11::basic_string<
char> >&, RNG_t&) [with Model = model_exp_namespace::model_exp; RNG_t = boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014, 0, 2147483563>, boost::random::linear_congruential_engine<u
nsigned int, 40692, 0, 2147483399> >; Rcpp::List = Rcpp::Vector<19>]’
/usr/lib/R/library/rstan/include/rstan/stan_fit.hpp:1215:18:   required fromSEXPREC* rstan::stan_fit<Model, RNG_t>::call_sampler(SEXP) [with Model = model_exp_namespace::model_exp; RNG_t = boost::random::additive_combine_engine<boo
st::random::linear_congruential_engine<unsigned int, 40014, 0, 2147483563>, boost::random::linear_congruential_engine<unsigned int, 40692, 0, 2147483399> >; SEXP = SEXPREC*]’
stanExports_exp.cc:15:87:   required from here
           

@SteveBronder
Copy link
Collaborator Author

@t4c1 I think I sorted out the includes. But we do need accumulate.hpp files in rev and fwd so it can know what sum() is doing. Do we need to include all of prim in rev.hpp? We don't do that for fwd.hpp.

@hsbadr no worries! Yeah that should be fixed in the last PR

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Jul 16, 2021

@t4c1 I failed miserably to sort out the includes issue. I went back to the specialization, but to sweeten the pot I added an actual var specialization that is a hair more than a duplicate along with a custom sum function. The results seem to be nice at small N, but any optimization here gets dominated by other stuff at large N

This PR

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
toss_me<end_val2>                1487954812 ns   1487575457 ns            1
acc_var_bench/2/manual_time            42.1 ns         94.1 ns     16655419
acc_var_bench/4/manual_time            48.6 ns          106 ns     14408267
acc_var_bench/8/manual_time            60.4 ns          129 ns     11047393
acc_var_bench/16/manual_time           85.4 ns          179 ns      8207120
acc_var_bench/32/manual_time            150 ns          312 ns      4691824
acc_var_bench/64/manual_time            247 ns          526 ns      2823901
acc_var_bench/128/manual_time           461 ns         1004 ns      1514423
acc_var_bench/256/manual_time           842 ns         1865 ns       832958
acc_var_bench/512/manual_time          1625 ns         3635 ns       431515
acc_var_bench/1024/manual_time         3155 ns         7089 ns       222256
acc_var_bench/2048/manual_time         6307 ns        14116 ns       110949
acc_var_bench/4096/manual_time        12632 ns        28209 ns        55723
acc_eigen_bench/2/manual_time          44.6 ns         95.5 ns     15960111
acc_eigen_bench/4/manual_time          44.9 ns         99.0 ns     15279944
acc_eigen_bench/8/manual_time          52.1 ns          121 ns     14030183
acc_eigen_bench/16/manual_time         67.7 ns          169 ns     10395467
acc_eigen_bench/32/manual_time          100 ns          264 ns      7015727
acc_eigen_bench/64/manual_time          159 ns          438 ns      4567182
acc_eigen_bench/128/manual_time         270 ns          810 ns      2575293
acc_eigen_bench/256/manual_time         483 ns         1511 ns      1450463
acc_eigen_bench/512/manual_time         943 ns         2944 ns       740283
acc_eigen_bench/1024/manual_time       1884 ns         5802 ns       369808
acc_eigen_bench/2048/manual_time       3767 ns        11534 ns       186112
acc_eigen_bench/4096/manual_time       7387 ns        22947 ns        94357

Develop

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
toss_me<end_val2>                1445162611 ns   1445059034 ns            1
acc_var_bench/2/manual_time            68.8 ns          129 ns     10290601
acc_var_bench/4/manual_time            93.8 ns          159 ns      7484019
acc_var_bench/8/manual_time             126 ns          206 ns      5533023
acc_var_bench/16/manual_time            166 ns          278 ns      4191120
acc_var_bench/32/manual_time            240 ns          414 ns      2927313
acc_var_bench/64/manual_time            402 ns          695 ns      1740759
acc_var_bench/128/manual_time           624 ns         1181 ns      1134512
acc_var_bench/256/manual_time          1044 ns         2088 ns       666959
acc_var_bench/512/manual_time          1898 ns         3926 ns       369688
acc_var_bench/1024/manual_time         3573 ns         7574 ns       196242
acc_var_bench/2048/manual_time         6651 ns        14570 ns       106540
acc_var_bench/4096/manual_time        13296 ns        29081 ns        53815
acc_eigen_bench/2/manual_time          73.3 ns          134 ns      9303031
acc_eigen_bench/4/manual_time           104 ns          172 ns      6819813
acc_eigen_bench/8/manual_time           135 ns          216 ns      5236040
acc_eigen_bench/16/manual_time          178 ns          286 ns      3899934
acc_eigen_bench/32/manual_time          249 ns          423 ns      2832374
acc_eigen_bench/64/manual_time          398 ns          683 ns      1720897
acc_eigen_bench/128/manual_time         646 ns         1201 ns      1105065
acc_eigen_bench/256/manual_time        1078 ns         2126 ns       646179
acc_eigen_bench/512/manual_time        1945 ns         3975 ns       357639
acc_eigen_bench/1024/manual_time       3623 ns         7594 ns       194075
acc_eigen_bench/2048/manual_time       6694 ns        14565 ns       104872
acc_eigen_bench/4096/manual_time      13313 ns        28993 ns        53636

stan/math/rev/fun.hpp Show resolved Hide resolved
stan/math/rev/fun/accumulator.hpp Show resolved Hide resolved
stan/math/rev/fun/sum.hpp Show resolved Hide resolved
* @tparam T Type of scalar added
*/
template <typename T>
class accumulator<fvar<T>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we dont need a fwd specialization for accumulator. At worst we might need a .hpp file for it that just includes /fwd/fun/sum.hpp and prim/fun/accumulator.hpp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried removing this and it failed to compile. This is definitely an include issue but a pretty big one that I think should be tackled in a separate PR that's going to take some time to sort out. I'm going to make an issue about it today or Sunday

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with this reasoning if the issue was present in develop before this PR. But this PR introduces the issue of duplicated code, so I dont think we should merge it until the issue is resolved.

I am fine with resolving includes in a separate PR, but I would wait with merging of this PR until the includes are resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm pretty sure this issue actually did exist before this PR but we didn't notice. The fundamental problem is in the godbolt below

https://godbolt.org/z/je89G573e

Before the class definition of accumulate we need to have the definition of sum() in order to compile with the var specialized sum. It was working before because one of the definitions of sum() in prim is

template <typename T>
inline T sum(const std::vector<T>& m) {
  return std::accumulate(m.begin(), m.end(), T{0});
}

Which will work for var, double, fvar, etc. We can have the include order anyway we want when we are just including functions, but when we include classes we need to make sure that the definitions that the class uses are available before the class is declared. We don't see this error very often because most things in Stan math are just functions (though this error did just pop up again with apply_scalar_unary where the definition for apply_scalar_unary<var> did not show up before prim's log() function.

I can take a shot at fixing it today, but it's going to require some fancy footwork and a good headache to make sure we get the includes correct. I'm worries this is going to take me a while to fix and this is stopping the new matrix stuff from working in the compiler so if your okay with it I'd prefer merging this PR in and then tacking this stuff in a seperate issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did have multiple issues with include order before, but we never duplicated classes just because of that. So far these issues could be solved by changing the order of includes.

Duplicating the code is a new problem introduced in this PR and I would like it resolved before this is merged.

Maybe I will be able to take some time to look at it today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ if you can I'd totally appreciate it. I took another stab at the includes today and couldn't find an answer that wasn't funky.

@SteveBronder
Copy link
Collaborator Author

@t4c1 with your changes stuff looks good! I moved the buffer size up to 128 as in benchmarking that seemed to help a bit though idt we want to be too greedy

------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
acc_var_bench<100>/2/manual_time          40.8 ns         89.0 ns     17280640
acc_var_bench<120>/2/manual_time          40.4 ns         90.1 ns     17374213
acc_var_bench<128>/2/manual_time          40.7 ns         90.5 ns     17251757

acc_var_bench<100>/4/manual_time          44.7 ns         96.5 ns     15511693
acc_var_bench<120>/4/manual_time          45.0 ns         98.3 ns     15430375
acc_var_bench<128>/4/manual_time          44.3 ns         96.1 ns     15785844

acc_var_bench<100>/8/manual_time          55.5 ns          121 ns     13047579
acc_var_bench<120>/8/manual_time          55.2 ns          121 ns     12570005
acc_var_bench<128>/8/manual_time          56.2 ns          122 ns     12417876

acc_var_bench<100>/16/manual_time         79.0 ns          164 ns      8740883
acc_var_bench<120>/16/manual_time         78.0 ns          170 ns      8970188
acc_var_bench<128>/16/manual_time         79.6 ns          172 ns      8933042

acc_var_bench<120>/32/manual_time          120 ns          250 ns      5799881
acc_var_bench<128>/32/manual_time          124 ns          259 ns      5624234
acc_var_bench<100>/32/manual_time          122 ns          254 ns      5728747

acc_var_bench<100>/64/manual_time          221 ns          437 ns      3137435
acc_var_bench<120>/64/manual_time          235 ns          453 ns      3002865
acc_var_bench<128>/64/manual_time          254 ns          482 ns      2804482

acc_var_bench<100>/128/manual_time         456 ns          876 ns      1526690
acc_var_bench<120>/128/manual_time         451 ns          885 ns      1551612
acc_var_bench<128>/128/manual_time         459 ns          892 ns      1531022

acc_var_bench<100>/256/manual_time         887 ns         1635 ns       792193
acc_var_bench<120>/256/manual_time         869 ns         1643 ns       816929
acc_var_bench<128>/256/manual_time         880 ns         1649 ns       786979

acc_var_bench<100>/512/manual_time        1730 ns         3116 ns       406622
acc_var_bench<120>/512/manual_time        1719 ns         3159 ns       403906
acc_var_bench<128>/512/manual_time        1706 ns         3143 ns       411588

acc_var_bench<100>/1024/manual_time       3652 ns         6485 ns       192958
acc_var_bench<120>/1024/manual_time       3385 ns         6173 ns       206805
acc_var_bench<128>/1024/manual_time       3454 ns         6293 ns       203522

acc_var_bench<100>/2048/manual_time       7074 ns        12538 ns        96779
acc_var_bench<120>/2048/manual_time       6790 ns        12317 ns       100413
acc_var_bench<128>/2048/manual_time       6722 ns        12215 ns       103835

acc_var_bench<100>/4096/manual_time      14554 ns        25698 ns        49463
acc_var_bench<120>/4096/manual_time      13787 ns        24965 ns        51024
acc_var_bench<128>/4096/manual_time      13710 ns        24890 ns        50741

Develop

-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
acc_var_bench/2/manual_time          69.7 ns          129 ns     10065519
acc_var_bench/4/manual_time          98.5 ns          163 ns      6982089
acc_var_bench/8/manual_time           124 ns          201 ns      5576763
acc_var_bench/16/manual_time          166 ns          267 ns      4149516
acc_var_bench/32/manual_time          236 ns          392 ns      2987745
acc_var_bench/64/manual_time          394 ns          655 ns      1771915
acc_var_bench/128/manual_time         640 ns         1203 ns      1107580
acc_var_bench/256/manual_time        1106 ns         2158 ns       626029
acc_var_bench/512/manual_time        1979 ns         4019 ns       351968
acc_var_bench/1024/manual_time       3659 ns         7666 ns       188863
acc_var_bench/2048/manual_time       6919 ns        14853 ns       101546
acc_var_bench/4096/manual_time      13460 ns        29253 ns        51913

@SteveBronder
Copy link
Collaborator Author

@t4c1 I'm still seeing the error on Jenkins when it's doing the jumbo tests. I'm going to add back the fvar specialization

@SteveBronder
Copy link
Collaborator Author

@t4c1 I also had to put back the buffers for fwd and prim, I'm not sure why but at -O3 those were giving me really weird errors when we just did the accumulator.

@SteveBronder
Copy link
Collaborator Author

@serban-nicusor-toptal is one of the servers down? I'm seeing Remote call on i-019563a37b28ab26e failed on Jenkins

@serban-nicusor-toptal
Copy link
Contributor

Hey @SteveBronder it's a low chance that we can sometimes lose a machine when being outbid so I think that may have happened there. I've restarted the job and gonna watch it until it's green.

@SteveBronder
Copy link
Collaborator Author

Awesome thanks!

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.06 3.02 1.01 1.21% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.0 -0.48% slower
eight_schools/eight_schools.stan 0.12 0.11 1.08 7.19% faster
gp_regr/gp_regr.stan 0.16 0.16 1.03 3.13% faster
irt_2pl/irt_2pl.stan 5.87 5.83 1.01 0.66% faster
performance.compilation 88.81 87.0 1.02 2.04% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.56 8.64 0.99 -0.86% slower
pkpd/one_comp_mm_elim_abs.stan 31.04 29.57 1.05 4.75% faster
sir/sir.stan 126.54 134.32 0.94 -6.15% slower
gp_regr/gen_gp_data.stan 0.04 0.04 1.0 -0.41% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.04 3.01 1.01 1.09% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.38 1.06 5.76% faster
arK/arK.stan 2.55 2.53 1.01 1.13% faster
arma/arma.stan 0.65 0.83 0.77 -29.35% slower
garch/garch.stan 0.65 0.56 1.15 13.19% faster
Mean result: 1.00884186101

Jenkins Console Log
Blue Ocean
Commit hash: 027ce0e


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@hsbadr
Copy link
Member

hsbadr commented Jul 23, 2021

@SteveBronder I'm still getting the following error when compiling a model:

stan/math/prim/fun/accumulator.hpp:41:10:
  error: request for memberpush_backin
  ‘((stan::math::accumulator<double>*)this)->stan::math::accumulator<double>::buf_’, which is of non-class typedouble41 |     buf_.push_back(x);
      |     ~~~~~^~~~~~~~~

@SteveBronder
Copy link
Collaborator Author

@hsbadr can you make sure you are on this PR's most recent version? This PR uses std::vector<T> as the inner class which has a push_back() method (see below)

https://github.com/stan-dev/math/pull/2535/files#diff-2548a5f18c39b22415d34677bc70827e37b0288c674f17fbca37cb73356ad48bL26

@hsbadr
Copy link
Member

hsbadr commented Jul 23, 2021

@hsbadr can you make sure you are on this PR's most recent version? This PR uses std::vector<T> as the inner class which has a push_back() method (see below)

https://github.com/stan-dev/math/pull/2535/files#diff-2548a5f18c39b22415d34677bc70827e37b0288c674f17fbca37cb73356ad48bL26

My bad! Just wanted to comment before merging, after it's been approved. Updating my sources has fixed the error. Thanks!

@SteveBronder
Copy link
Collaborator Author

@t4c1 it looks like this is failing upstream on the stan opencl tests? Seems unrelated to this PR

@SteveBronder
Copy link
Collaborator Author

@t4c1
Copy link
Contributor

t4c1 commented Jul 26, 2021

Yeah, it does not seem connected. I tried restarting the test and it consistently fails. However, I can not reproduce the error locally.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.1 2.98 1.04 4.1% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.2% slower
eight_schools/eight_schools.stan 0.11 0.1 1.09 8.04% faster
gp_regr/gp_regr.stan 0.16 0.16 1.02 2.35% faster
irt_2pl/irt_2pl.stan 5.86 5.89 1.0 -0.48% slower
performance.compilation 87.81 87.42 1.0 0.45% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.76 8.59 1.02 1.9% faster
pkpd/one_comp_mm_elim_abs.stan 30.08 30.41 0.99 -1.08% slower
sir/sir.stan 127.21 127.89 0.99 -0.54% slower
gp_regr/gen_gp_data.stan 0.03 0.04 0.96 -3.68% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.28 2.98 1.1 9.17% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.42 0.39 1.08 7.15% faster
arK/arK.stan 1.89 1.87 1.01 1.18% faster
arma/arma.stan 0.93 0.83 1.13 11.66% faster
garch/garch.stan 0.64 0.53 1.2 16.94% faster
Mean result: 1.04174255293

Jenkins Console Log
Blue Ocean
Commit hash: 5b1992e


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder SteveBronder merged commit 76383e8 into develop Jul 27, 2021
@SteveBronder SteveBronder deleted the feature/varmat-accumulator branch August 17, 2021 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants