-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cholesky decomposition rewrite: initial work #150
Conversation
These initial numbers are very promising! I should have some time to give #142 a final review tomorrow and hopefully this PR too. Thanks! |
Take whatever time you need :) This PR is far from done, by the way. I've only wrapped the decomposition itself, but I still plan to provide e.g. |
#142 has been merged which I think means you need to rebase. |
Benchmarks show significant performance gains.
Rebasing took more of an effort than I expected, but at least I learned something new in the process. What's remaining is mostly:
Don't know when I actually have time to do this. I might find myself with almost no time or ample time in some cases. We'll see :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes so far look good to me!
I think the only thing pending is the old cholesky function deprecation?
@@ -0,0 +1,52 @@ | |||
use matrix::BaseMatrixMut; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this module would be better placed in the matrix
folder, so that it is src/matrix/utils.rs
?
src/matrix/decomposition/cholesky.rs
Outdated
/// with the factor itself. Instead, we may wish to | ||
/// solve linear systems or compute the determinant or the | ||
/// inverse of a symmetric positive definite matrix. | ||
/// In this case, see the next subsections./// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray ///
here
let k_row = a.row(k).raw_slice(); | ||
dot(&k_row[0 .. j], &j_row[0 .. j]) | ||
}; | ||
a[[k, j]] = a[[k, j]] - kj_dot; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to check whether using get_unchecked
here improves vectorization (I doubt it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my other comment. The vast majority of the time should be spent in the kj_dot
computation, so I think there's no vectorization to speak of for this single expression (?). However, dot
allows for vectorization, and indeed this is where the speed gains demonstrated by the benchmarks come from :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had misunderstood what was going on when I made this comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't blame you, this algorithm is not so intuitive because it doesn't follow what you would usually do by hand when doing Gaussian elimination, which basically corresponds to the algorithm based on outer products. However, according to Golub & Van Loan, the Gaxpy-rich version should only involve about half the memory traffic or so of the outer product version.
In the future we can however make it significantly more cache-efficient by using an algorithm based on (blocked) level 3 BLAS routines. This is a strong motivation for #155, because once we have some optimized and efficient BLAS-3 routines, we can significantly accelerate the various decompositions :)
src/matrix/decomposition/cholesky.rs
Outdated
let mut a = matrix; | ||
|
||
// Resolve each submatrix (j .. n, j .. n) | ||
for j in 0 .. n { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside this loop I think we might be able to tidy things up in places by using a MatrixSliceMut
and get_col_unchecked
calls. In theory this should help with vectorization too, but I might be wrong as columnar slice iteration isn't the most efficient.
Another thought - here we are iterating over columns on the outside and rows on the inside. This means all of our inner loops have more cache misses (a[[k, j]]
access pattern is worse than a[[j, k]]
here). Is it possible to transpose the algorithm so that each inner operation is over column access?
These are just thoughts from someone who has spent 2 minutes reading the code. Please ignore these if not relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I just realized that the comment on line 128 is somewhat misleading. This is a left-looking algorithm based on GAXPY, instead of the usual right-looking algorithm based on outer products, which is probably how most people think of the LU decomposition. In this case it's not quite right to talk about the submatrix (j ..n, j .. n)
. I will remove this comment to avoid confusion.
If you look closely, you'll see that for each k
, we compute the dot product of parts of the k
th and j
th row. This is where the majority of the time of the algorithm is spent. Since we're computing the dot products of two contiguous slices, this should be fairly cache friendly. Now, you'll also notice that when kj_dot
has been computed, the last item in the kth
row that was accessed in the dot product is the item with index j - 1
, which means that the next item would be j
, which corresponds exactly to a[[k, j]]
. In other words, at the end of computing kj_dot
, a[[k, j]]
is probably already in cache due to prefetching.
My understanding is thus that this whole operation should be very cache friendly. It is of course possibly that I've misunderstood some parts :)
Also, I'm not entirely sure if I understand your comment about MatrixSliceMut
. Could you perhaps elaborate?
I think the most sensible way to make the code cleaner is to rewrite rows 135 to 143 (the k
loop) as a call to a gaxpy
function. In fact, this is part of my motivation for #155, because this could be optimized individually and reused in several parts of the library. However, a difficulty here is that the "vector" y
in the gaxpy expression y <- y + Ax
is not a contiguous slice, but rather a column in a matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining - I agree that the current algorithm is better than a transposed version.
What I meant with the MatrixSliceMut
comment is not applicable now that I understand what's going on. I was trying to make the case that if we slice from [j.. , j..] we can get slightly nicer code by doing things like:
*a.col_mut(k) -= kj_dot;
But given that we need to take the whole row of a
at line 137-138 I think this would cause borrowing issues. And it only really saves us on that one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. That would be a good idea in that case. However, I think it's still possible doing something like that using split_at_mut()
, but I think perhaps the code doesn't get clearer from this, as you'd have to spend additional lines of code in appeasing the borrow checker.
use libnum::Float; | ||
use libnum::{Zero, Float}; | ||
|
||
/// Cholesky decomposition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great documentation here!
About the I actually think In any case, I'm not entirely sure what to do about it. Do you have any suggestions, @AtheMathmo? I think I'm otherwise done with the rest! :) |
Sorry for the delay with this. I agree with what you said above. We can't really move the internal_utils module. I also think that making all of For now I think we should keep things as you have them in this PR - evaluating whether users might need the I'll try to find time to give this PR one more look through either later today or tomorrow. From what I remember it should be ready to merge :) |
No worries! I just realized I forgot to mention something I wanted to discuss, that I think impacts the merge. I actually made the |
I agree that using a |
All right, I'll take care of it as soon as I have time. Might not be before after next week. I'm in the last leg before thesis submission so I'm deliberately trying not to get too distracted by other endeavours at the moment! 😛 |
No worries. I've been swallowed by school work for the last couple weeks so you have my sympathy. Good luck with the submission! |
OK, I finally found the time to make the changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small point.
Once we resolved that I am happy to merge :)
src/matrix/decomposition/cholesky.rs
Outdated
} | ||
|
||
/// Computes the determinant of the decomposed matrix. | ||
pub fn det(&self) -> T { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if we tried to cholesky decompose an empty matrix here?
From what I can see - we would get an empty matrix within self
and this function would return T::one()
. I don't think that this is desired behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. Still, what would you have the determinant of an empty matrix A
be? We could define it to be 0
, but this would imply that it's singular, and since rank(A) == A.rows() == A.cols()
, it is also invertible by the invertible matrix theorem, so we have a rather strange contradiction here. I think the best approach is to define the determinant to be any non-zero number, and so we might as well make it 1
for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and due to https://en.wikipedia.org/wiki/Empty_product and the usual definition of determinant, one could argue that det(A) == 1
in any case :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is a good argument.
Do you think it is worth adding a comment to state our choice here? It seems fairly niche but maybe it will save one person a headache in the future...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree! I will do this shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please could you add the same comment to the FullPivLU
that we have just merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly!
I think perhaps you may want to delay merging this PR until #160 is in, so that @Prismatica doesn't have to make more tiny adjustments due to potential merge conflicts in |
OK, I believe the outstanding issues should have been taken care of now! EDIT: Note that @Prismatica had already added a note to the docs of |
Awesome, thank you! |
I started some work on rewriting the Cholesky decomposition in the same style as LU. On the way there, I completely rewrote the algorithm using an algorithm from Golub and Van Loan, which describes the decomposition in terms of GAXPY operations. This meant that I can leverage your efficient
dot
implementation in theutils
module. The preliminary benchmarks show a significant improvement (cholesky_nxn
is the existingcholesky()
implementation):I just thought I'd put this up here, because I think some of this knowledge can also be applied to LU, although the situation is more complicated because of the additional work on
U
as well as the fact that one has to pivot.EDIT: Note that this PR includes the entirety of #142! (will rebase after #142 eventually gets merged)