-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add generalized dot product #32739
add generalized dot product #32739
Conversation
I think we also want specializations for For example |
This wouldn't hurt, but I'm not clear on how useful it would be. For any |
@stevengj I had true inner products in mind when I had the square assumption earlier, but I guess we are implementing here the machinery for ternary multiplication in general, in which there is no square and no symmetry assumption? |
Sure, the |
Probably should also have a specialized implementation for |
Given the proposed function signature, this should probably use function dot(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
T = promote_type(promote_type(eltype(x), eltype(A)), eltype(y))
s = zero(T)
@inbounds for j in eachindex(y)
yj = y[j]
temp = zero(T)
@simd for i in eachindex(x)
temp += adjoint(x[i]) * A[i,j]
end
s += temp * yj
end
return s
end (The above also removes some of the early-termination calls; they don't really seem warranted to me) Irregardless of the above, I noticed that this is not doing better than ordinary x′ = rand(N); A′ = rand(N,M); y′ = rand(M) Then, for @btime adjoint($x′)*$A′*$y′ # 193.201 μs (1 allocation: 7.94 KiB)
@btime dot($x′,$A′,$y′) # 312.699 μs (0 allocations: 0 bytes) Though it does better for More timings ...For @btime adjoint($x′)*$A′*$y′ # 1.340 ms (1 allocation: 15.75 KiB)
@btime dot($x′,$A′,$y′) # 1.615 ms (0 allocations: 0 bytes) For @btime adjoint($x′)*$A′*$y′ # 6.418 ms (2 allocations: 31.33 KiB)
@btime dot($x′,$A′,$y′) # 7.083 ms (0 allocations: 0 bytes) |
@thchr Thanks for your comments. You're right, the generic one will need careful optimization, because it competes with BLAS multiplications, which has some overhead due to the allocation of the intermediate result. For very large problems, as you show, this does not outweigh the performance advantage of BLAS multiplication. So, we will need to make sure we catch up with the pure Julia multiplication. |
I got started with some tests for the generic case. More tests will follow, and maybe the structured versions also need a little update. I was including the above comments into the generic version. This is how it looks like now: function dot(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
T = typeof(dot(first(x), first(A), first(y)))
zeroxA = zero(adjoint(first(x))*first(A))
s = zero(T)
@inbounds for j in eachindex(y)
yj = y[j]
if !iszero(yj)
temp = zeroxA
@simd for i in eachindex(x)
temp += adjoint(x[i]) * A[i,j]
end
s += temp * yj
end
end
return s
end There is a little issue with this. I wrote it to make it work for both number and matrix eltypes. In the block matrix case, the generic 3-arg dot allocates each time it adds something to |
Tests are passing! I won't have much/any time the upcoming weeks (due to exams, holidays 🏖 etc), so I'd like to suggest to keep this PR at the current level (modulo minor improvements). The tests cover the number eltype cases and pass. I think I wrote the code such that it should work with block matrices/vectors. However, I'd like to keep testing (and potentially required adjustments/fixes) for an independent PR. Compared to my initial commits, I have also taken care of minimizing memory access whenever possible. Regarding the fully dense version and the comparison to plain multiplication, I'm fine with putting If there is no hurry, I'm also fine with developing the mentioned features in this PR further as time permits. |
I think if the user calls |
stdlib/LinearAlgebra/src/generic.jl
Outdated
function dot(x::AbstractVector, A::AbstractMatrix, y::AbstractVector) | ||
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch()) | ||
T = typeof(dot(first(x), first(A), first(y))) | ||
zeroxA = zero(adjoint(first(x))*first(A)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
T = promote_op(dot, eltype(x), eltype(A), eltype(y))
and similar for zeroxA
might be better since that gets inferred (the current formulation isn't, at least on v1.1.1; appears to incur some array-size testing so far as I can tell from @code_warntype
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know. These two lines are written such that the code can handle block matrices/vectors, so eltypes like Vector
or Matrix
. The issue is that calling one
or zero
on them fails, because the type does not contain any information on the size. And also, the same "first
" approach is taken in the
function dot(x::AbstractArray, y::AbstractArray)
method a few lines above. Of course, this approach still breaks down when we have blocks of varying lengths, so this may not be a good idea either. To handle that, we would need to initialize temp
in every loop separately. I'll push a suggestion for how to fix that.
The one failing test "wasn't me". So this seems to be in rather good shape, I'd say. Is it realistic to include this into Julia v1.3? I think it would be nice to have that feature (if it is complete), and optimize it later if necessary. I have consistently interpreted I have covered all special matrices (bidiag, tridiags, diagonal, uniformscaling, symmetric/hermitian, triangular), what is missing is |
I'm close to pulling the plug for a while, so don't be surprised to see no response. I'd be very happy to find the feature in Julia v1.3, of which feature freeze is in one week, or so? |
Correct: feature freeze is slated for Aug 15th. |
Not sure this is a correct assumption; If I see that there's a specialized implementation for this, I'm going to assume that it's faster than the standard ones, or people wouldn't have bothered writing a specialized implementation. Especially in the common case where x and y are vectors and A is a dense matrix, the allocation of a temporary vector doesn't matter much. Also BLAS might use threading and therefore be (potentially much) faster. At least if this method has the potential to be slower than |
I have tried to improve/fix a couple of things quickly via the web interface; thanks @stevengj for the review. I have also created a NEWS entry. Unfortunately, there is a little merge conflict which I don’t know how to resolve via the web interface. If someone could rebase, that would be much appreciated. |
This should be only relevant to cases like `dot(x, J, y)`, where `x` and `y` are vectors of quaternion vectors, and `J` is a quaternion `UniformScaling`.
Man, who knew that was going to take so long? Thanks for sticking it out, @dkarrasch! |
This request originally came from @cortner's work to add preconditioners to Optim.jl! Can't believe that was in 2016. |
Well, thanks, and thanks also to the reviewers for their patience and care, especially @stevengj! |
I went ahead and added a generalized dot product
dot(x, A, y)
, as discussed in #16573. I shamelessly copied (and slightly adjusted) @simonbyrne's code for the completely sparse case #16573 (comment) and the dense case #16573 (comment), and added methods for structured matrices and sparse matrices/generic vectors. There are presumably a couple of things to do:add tests
NEWS entry (I'm not sure, but I guess we would want to announce this feature)
further optimizations (edit: I tried to reduce memory access similar to the corresponding multiplication functions, such that I hope that avoiding allocating the intermediate multiplication result will pay off)
I'm not an expert in
@simd
and these kind of enhancements, so I'm happy to learn whether one would expect improvements by using such features.This is written with the option of block matrices in mind, and that should be tested obviously.
Since we needzero
of eltypes for type promotion, how would one do that? We don't have StaticArrays to our disposal.If I'm not mistaken, then this would close #16573.
I don't see how one would leverage symmetry ofA
, or the fact thatx === y
.