-
-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sampling speed improvements #1382
Comments
For
to
see stan-dev/stanc3#1232 for details |
And one more thing: Current partial log link functions as used with the cmdstanr backend do prevent the // compute partial sums of the log-likelihood
real partial_log_lik_lpmf(int[] seq, int start, int end, data vector Y, data matrix Xc, vector b, real Intercept, real sigma, data int[] J_1, data vector Z_1_1, vector r_1_1) {
real ptarget = 0;
int N = end - start + 1;
// initialize linear predictor term
//vector[N] mu; = Intercept + rep_vector(0.0, N);
vector[N] mu = rep_vector(Intercept, N);
/*
for (n in 1:N) {
// add more terms to the linear predictor
int nn = n + start - 1;
mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn];
}
*/
mu += r_1_1[J_1[start:end]] .* Z_1_1[start:end];
ptarget += normal_id_glm_lpdf(Y[start:end] | Xc[start:end], mu, b, sigma);
return ptarget;
} Note that this still needs speed testing (SoA & AoS compiled versions). The goal is to get We will have to let the stanc folks weigh in here, but I think that Here is a small R code to generate the full program as well as the full slightly modified brms program: |
@avehtari would the following work to prevent the
Edit: Figured out how to check that myself. |
@wds15 About the vectorized index expressions for multilevel terms, they still seem to be slower with stan 2.30 than the currently used loops. Here is an example for a varying intercept, varying slope model. Old Stan code (with loops over r_* variables):
New Stan code (without loops over r_* variables):
Corresponding R code: library(cmdstanr)
library(lme4)
sleepstudy_long <- Reduce("rbind", replicate(10, sleepstudy, simplify = FALSE))
sdata <- make_standata(Reaction ~ Days + (Days|Subject), sleepstudy_long)
mod_old <- cmdstan_model("test_old.stan")
mod_old$check_syntax(stanc_options = list("debug-mem-patterns", "O1"))
fit_old <- mod_old$sample(data = sdata)
mod_new <- cmdstan_model("test_new.stan")
mod_new$check_syntax(stanc_options = list("debug-mem-patterns", "O1"))
fit_new <- mod_new$sample(data = sdata) |
@avehtari both the problem with The extended use of GLM functions is a bit more complicated issue (due to the interaction of multiple model parts), and I cannot tackle it now, but will work on it hopefully soon. |
I assume 2.30.1?
Having SoA alone is not sufficient for the speedup. Even if everything else in Stan generated C++ would be perfect, It is possible that if the index is jumping around, and the memory access is still slow. The blog post https://viralinstruction.com/posts/hardware/ illustrates nicely the big differences in speed depending on whether the memory is accessed in order or not. |
Yes. 2.30.1. Thank you for your additional insights. |
Thanks @paul-buerkner for trying out the array index thing. I am surprised to slows things down... will look into it once more myself. I know that Steve worked on it to make these type of indexing fast (having brms in mind), so let's hope it can be resolved. In any case - getting SoA for What I will try to write myself is a Great to see O1 compatibility getting into |
Two potential speed-ups for spline and HSGP models (without random effects)
1. use _glm more often.
E.g. formula
y ~ s(x)
generates code withthis can changed to use _glm (in the cases when there is _glm for the family used)
This provided 80% drop in sampling time. This should be applicable for example for models of the form
y ~ x1 + x2 + s(x1) + s(x2)
2. make code compatible with --O1 SoA (DONE)
See stanc --O1 optimization discourse post https://discourse.mc-stan.org/t/30-40-drop-in-sampling-time-using-stanc-o1-optimizations/28347
Even if not changing to use _glm, the code can be changed to benefit from stanc optimization --O1 and structure-of-arrays memory mapping, e.g. changing
to
drops the sampling time 40% when using --O1. In the first code, use of SoA is blocked by adding data vector
rep_vector(0.0, N)
and by for loop over sigma. This speedup would work also with families that don't have _glm function in Stan.The text was updated successfully, but these errors were encountered: