-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Speed up dense-sparse matmul #38876
Conversation
Yes, please do a nanosoldier run. |
This is now carefully checked for the quadratic cases. From the pure type point of view, this is as good as it gets, I believe. But maybe nanosoldier will tell us that we should also include size considerations, let's see. I don't think I can kick-start it, can anybody help out here, please? |
@nanosoldier runbenchmarks(ALL, vs=":master") |
Seems like it didn't work. Let me try: |
I'm approving this since I had reviewed the original PR. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG |
This comment has been minimized.
This comment has been minimized.
I think my local benchmarks were flawed. I'm now learning how to do it professionally with BaseBenchmarks.jl, and there seem to be some real regressions currently. Sorry for the noise. I'll need some time to try out a few things, rebuild and run the benchmarks again, so marked as WIP. |
Can always convert to a draft PR. |
I wonder if we have an update here, and if it might be ready to make it into 1.6. |
Sorry, I confused/fooled myself a little bit, but now I think I see clearer. The benchmarks tell pretty clearly
The regression was probably due to my own modifications (I thought that rearranging the loops such that you walk through the dense factor in memory-optimal order would be beneficial, but it turns out that it's not). So I reverted that part. Let's have another (restricted) benchmark run. By the way, this PR does not include the silent @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG |
This looks pretty good!!! The remaining few regressions are related to methods that haven't been changed functionally, unless I'm missing something. |
That is actually a bit surprising. I would have thought it would at least be as good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pointed out the two sources of improvements. The other methods are just polished, but shouldn't change behavior at all.
@inbounds for col in 1:size(A, 2), k in nzrange(A, col) | ||
Aiα = $t(nzv[k]) * α | ||
rvk = rv[k] | ||
@simd for multivec_col in 1:mX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loop order is the same as before, but the two line have been hoisted out and the inner most loop @simd
ed. According to nanosoldier, this results in a significant improvement.
if X isa StridedOrTriangularMatrix | ||
@inbounds for col in 1:size(A, 2), k in nzrange(A, col) | ||
Aiα = nzv[k] * α | ||
rvk = rv[k] | ||
@simd for multivec_row in 1:mX | ||
C[multivec_row, col] += X[multivec_row, rvk] * Aiα | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is new, and loop order rearranged for optimal access pattern both in X
and C
, plus the @simd
in the innermost loop. According to nanosoldier, this results in a significant improvement.
I'll do a last nanosoldier run. Absent surprises, this is good to go. @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG |
Ok, nanosoldier, I want it all green, that's why: last time, and I'll let you go. @nanosoldier |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG |
Ok, how to proceed? Just merge? Are we going to include this in v1.6? I know it's very late in the release cycle, but it is purely addressing performance, no new features, no change in behavior. |
I think it is good to merge. @KristofferC Is it ok to backport? I'm tagging it for now. |
* Speed up dense-sparse matmul * add one at-simd, minor edits * improve A_mul_Bq for dense-sparse * revert ineffective changes * shift at-inbounds annotation (cherry picked from commit a3369df)
* Speed up dense-sparse matmul * add one at-simd, minor edits * improve A_mul_Bq for dense-sparse * revert ineffective changes * shift at-inbounds annotation (cherry picked from commit a3369df)
* Speed up dense-sparse matmul * add one at-simd, minor edits * improve A_mul_Bq for dense-sparse * revert ineffective changes * shift at-inbounds annotation
* Speed up dense-sparse matmul * add one at-simd, minor edits * improve A_mul_Bq for dense-sparse * revert ineffective changes * shift at-inbounds annotation (cherry picked from commit a3369df)
This is a revival of the infamous #24045. Most multiplication codes were already in good shape. Two of the features tested out in that PR were
@simd
ing the innermost loop and usingmuladd
. In a few cases, I found an unfortunate memory access pattern, and in one or two cases multiplication byalpha
from the wrong side (matters only for non-commutative number types). Here's a quick benchmarking script:First respective timing is
nightly
, second is this PR:@Sacha0's "old" tests seemed to indicate that, at the time, the
muladd
was not helpful and perhaps the@simd
neither. I believe, however, that the big boosts seen here inA*B
are both due to the reordering of loops andat-simd
. Shall we have a nanosoldier run?EDIT: updated timings. I see variations of about +/-2 ms, so the improvements here are due to reordering loops, and a new
@simd
annotation in one case.muladd
doesn't seem to play any role here.