-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uses Eigen's internal multiindexing for multi indexing on vectors and matrices #3250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions - might be easier to explain next week over a board.
Have you fed this into your benchmarking framework to make sure it has the desired effect?
@SteveBronder did you get a chance to try this in your stan-benchmark playground? |
I'm waiting till I have the math pr for this up as well so that I can benchmark this and the new matrix type all at once |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SteveBronder and I discussed some benchmarking he did (which hopefully he can post a nice graph of) which showed that this change did not immediately lead to the improvements we thought it might - that will require some specializations in the arena matrix representation.
But, it also didn't make things any worse, so I'm approving this on its own now.
Running some benchmarks idt we should merge this PR yet. The benchmark can be found here. I took the code that brms generates and compared the new and current multi indexing to a loop as well as to a raw call from Eigen that does the same thing. The plot has facets by the number of group (M) and plots vectors of size N against the average time for each combination of M and N. The indices are set up so that each inner index does the worst case scenario of random access to elements in the vectors. Each line represents
The results were very suprising to me Looping is very consistent with linear time, but it looks like our multi indexing has a kink in it once we cross 50K observations or so irrespective of the number of groups. The main reason the new and current multi indexing is slower is because of the checks, which you can see from the The loop does not have to do any checks. For multi index we have to check that the index is valid within the range of the size of the vector. For the current multi index we do that check as we are creating the index return plain_type_t<EigVec>::NullaryExpr(
idx.ns_.size(), [name, &idx, &v_ref](Eigen::Index i) {
math::check_range("vector[multi] indexing", name, v_ref.size(),
idx.ns_[i]);
return v_ref.coeff(idx.ns_[i] - 1);
}); but in the new code we do that check outside of the loop for (auto idx_i : idx.ns_) {
math::check_range("vector[multi] indexing", name, v_size, idx_i);
} I think iterating over the index for the check is causing the main slowdown. The kink in the graph after 50K most likely happens because the index and vectors no longer fit in the lower levels of cache. Having that check inside or outside of the index creation most likely explains why this PRs multi indexing is slower than the current multi indexing The nocheck benchmark does a few things that I think we could do in Stan, from most to least performance gain:
I think 1:3 are very feasible and the speed gain seems very worth it |
But my end recommendation is we close this PR without merge. I can write up a seperate PR that does 1 and 2 from the list above. We can add those signatures and then add the necessary components from the compiler. The good news is we have a reasonable path forward that should give a magnitude speedup to any brms model that does mixed effects! |
I am still not sure if the benchmark you posted represents @bob-carpenter's original case he was concerned with, since it doesn't ever end up calling a lpdf. The concern as I understand it was when these matrices are constructed for use in a distribution, one way of writing it ends up leading to an un-vectorized lpdf, which goes against the other advice we give |
Doing the multi index inside an lpdf would give the same effects |
Actually @WardBrian I want to reopen this. Running the benchmarks on my mac, just turning on |
The plots make it look like the Eigen approach is the best. Am I missing something? Also, what specifically are you evaluating? I see the label on one of the plots involves multi-indexing with an array and then elementwise multiplication? Usually we just do the indexing for random effects, so I'm not sure where the elementwise multiply is coming from. Normally it'd just be I had thought part of the problem with multi-indexing was creating an intermediate, but it's now returning a template expression created by binding. template <typename EigVec, require_eigen_vector_t<EigVec>* = nullptr>
inline auto rvalue(EigVec&& v, const char* name, const index_multi& idx) {
return stan::math::make_holder(
[name, &idx](auto& v_ref) {
return plain_type_t<EigVec>::NullaryExpr(
idx.ns_.size(), [name, &idx, &v_ref](Eigen::Index i) {
math::check_range("vector[multi] indexing", name, v_ref.size(),
idx.ns_[i]);
return v_ref.coeff(idx.ns_[i] - 1);
});
},
stan::math::to_ref(v));
} Instead, it looks like the issue is creating the nested With the Eigen approach, how do you avoid the |
The code for all of the benchmarks are here. The Eigen benchmark is just doing
Which is a baseline for "If we remove all of the stan overhead like checks, indexing from 1, expression safety, etc. what is the best possible performance we could get?"
I have all the benchmark code in the link above and the comment above (comment here) gives all the info for each benchmark. But tldr each line is for
The elemenwise multiplication and multi index is what brms generates for models that do random effects like
The Nullary expression is just done once and is then returned so it should not be expensive. After all these benchmarks I'm really convinced the slowdown is from the range check. In the graph above removing the range check gives about a magnitude of performance speedup. You can see in the graph in the comment above all of the stan versions that do the range check sit near one another, but removing that range check with
It just uses a zero based index instead of Stan's 1 based index since it's a baseline of how fast this operation is minus any Stan changes. |
Thanks for all the clarification. That makes a lot more sense now.
Aha! A varying slope model. I just couldn't think of where we'd want to do a multi-index and then elementwise multiply. In any case, it sounds like the Eigen baseline isn't something that'd be workable in Stan due to the indexing-from-1 problem. Sorry, but no insight about how to fix it easily. It does seem like the current Stan approach is allocating a new class for each index. Isn't that causing a lot of overhead or is the overhead all compiled away somehow? I would've thought you could get away with just one level of indirection here. |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
Np!
Totally agree. I just have it there as a sort of "if we could do the most performance thing what would that look". It's not a feasible answer but moreso for comparison
It does make a new class but that class is an expression so there won't be any actual allocation from value. There's a small amount of overhead but most of it is compiled away I think we should merge this PR and if people want speed they can define the macro to remove the range checks. Idt there's any other way for us to make multi indexing fast |
Submission Checklist
./runTests.py src/test/unit
make cpplint
Summary
Uses Eigen 3.4's new non-contiguous indexing expressions to have the
rvalue()
functions that usemulti_index
return an expression of multi indexing instead of eagerly creating a new matrix from the indices.In the Stan language doing a loop like the code in (a) is faster than doing a multi index like in (b) because we have to make a giant matrix in (b) for the multi index.
(a)
(b)
Eigen 3.4 introduced an expression for non-contiguous indexing which we can utilize to not have a big matrix creation in (b).
Intended Effect
Make vectorized multi indexing as fast as the loop
How to Verify
Tests pass for current rvalue tests with no changes
Side Effects
Nope
Documentation
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Simons Foundation
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: