-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dense-matrix mul!(C, A, B, alpha, beta) allocates #46865
Comments
Isn't that because you're accessing non- |
[EDIT: It probably allocates more often than not, but not always, so not deterministic, but if I recall not data-dependent. BLAS threads, see below?] No, it allocates N I julia/stdlib/LinearAlgebra/src/matmul.jl Line 639 in 99225ab
|
It's strange, it's defined like:
but I ruled out MulAddMul (running/timing it separately) and gemm_wrapper! problematic (but maybe not BlasFloat, I cloned mul! to mul2! to test but substituted with Float64):
|
I did (do still?) suspect threading, because I actually DID get 3*N allocations (on 1.7.3, and likely remembering right for 1.8.1), but then that went away down to 0 [N] allocations, and then I get N allocations. I'm pasting verbatim, to show I'm not wrong (and the function definitions should be dummy redefinitions):
I did run again with 1.7.3 with:
and got 3 allocations. Maybe I ruled out regular threads, but not BLAS threads. |
running on 1.8.1 (trying to rule out BLAS threads, but I'm not sure OMP_NUM_THREADS does anything in my version):
running the same in 1.7.3 I got:
Three times the allocations (likely not always |
I think I ruled out (BLAS) threads an issue below:
In that terminal I was however runing julia, not julia -t1 |
I added version info to the report above. On my system the allocation behavior (post compilation) seems to be deterministic and it must be something in |
This version also does not allocate: function manymul(N, out, op, in, alpha, beta)
_add = LinearAlgebra.MulAddMul{false, false, typeof(alpha), typeof(beta)}(alpha, beta)
for _ in 1:N
LinearAlgebra.gemm_wrapper!(out, 'N', 'N', op, in, _add)
out, in = in, out
end
out
end The difference is that a fully-concrete Even if constant propagation were working to eliminate this here, I guess this would not fix the non-constant case, where e.g. alpha and beta are values taken from a vector of Float64? In my actual code, this is what happens, so I care about the non-constant case. I would call Using |
Now that I'm back to the computer, I'm more and more convinced the only thing here is that you're accessing non- julia> using LinearAlgebra
julia> function manymul(N, C, A, B, alpha, beta)
for i in 1:N
mul!(C, A, B, alpha, beta)
C, A = A, C
end
C
end
manymul (generic function with 1 method)
julia> const D = 16;
julia> const A = randn(D, D);
julia> const B = randn(D, D);
julia> const C = zero(A);
julia> const N = 100000;
julia> @time manymul(N, C, A, B, 1.0, 0.5);
0.075428 seconds So I don't see what's the issue here. |
Yes,
I guess why it doesn't with local variables (unlike for const), might be a clue (should work the same?!), and differing number of allocations with same code that puzzle me. |
Interesting! I also see 0 allocations in @giordano's example. However: D = 16
const A = randn(D, D)
const B = randn(D, D)
const C = zero(A)
const N = 100000
const alphas = [1.0]
@time manymul(N, C, A, B, alphas[1], 0.5) #allocates N times Considering also @PallHaraldsson's latest evidence, it seems we rely on constant propagation of alpha and beta to avoid allocation, and it is quite fragile. Personally, I don't think we should be relying on constant prop. here, at least not in the BLAS case - I don't think reading alpha or beta from a vector is a particularly crazy thing to do - but also the cases where alpha and beta really are constants seem to be fragile in ways I wouldn't have expected. @PallHaraldsson's example is particularly odd! Actually, it's also very surprising to me that having non-const globals for A, B, and C makes a difference vs @giordano's example. I might have expected some constant dispatch overhead when calling |
You mean non-deterministic. I would have thought the compiler compiles the same way each time for same inputs, is that not to be relied upon, and why might that be? |
I didn't observe any non-determinism. Constant propagation depends on the compiler figuring out that certain values are constant and can be converted to some kind of |
I mean non-deterministinc (allocations, then none, then back on) as I showed there: Yes, I did "compile", but the same code in-between. I doubt I can help more, I started debugging this thinking it might be simple, but seems above my (zero) pay-grade. Not that I don't want to help. |
Huh, that is indeed puzzling. Btw, I took a peek at At least in |
Just want to add that these allocations, although small, can be disastrous for scaling with threaded parallelism due to GC contention. And yes, we have seen this in real code. It's also easy to make a simple example that shows this: # start julia with nonzero thread count!
function manymul_threads(N, Cs, A, B, alpha, beta)
Threads.@threads for C in Cs
manymul(N/length(Cs), C, A, B, alpha, beta)
end
Cs
end
BLAS.set_num_threads(1)
Cs = [zero(A) for _ in 1:10]
@time manymul_threads(N, C, A, B, 1.0, 0.5) # as previous examples, for comparison
@time manymul_threads(N, Cs, A, B, 1.0, 0.5) # same number of allocations as above, but *slower* at 4 threads
using SparseArrays
Asp = sparse(A)
@time manymul(N, C, Asp, B, 1.0, 0.5) # slower than dense, but *does not allocate*
@time manymul_threads(N, Cs, Asp, B, 1.0, 0.5) # actually faster than the single-threaded version |
Oh man, this issue has history. JuliaLang/LinearAlgebra.jl#684, #29634 |
Worth noting that the 2x2 and 3x3 versions of the tests above also allocate on 1.8.1 (any propagated constants are not making it as far as |
I wrote a macro that pulls the value-dependent branching for macro mambranch(expr)
expr.head == :call || throw(ArgumentError("Can only handle function calls."))
for (i, e) in enumerate(expr.args)
e isa Expr || continue
if e.head == :call && e.args[1] == :(LinearAlgebra.MulAddMul)
local asym = e.args[2]
local bsym = e.args[3]
local e_sub11 = copy(expr)
e_sub11.args[i] = :(LinearAlgebra.MulAddMul{true, true, typeof($asym), typeof($bsym)}($asym, $bsym))
local e_sub10 = copy(expr)
e_sub10.args[i] = :(LinearAlgebra.MulAddMul{true, false, typeof($asym), typeof($bsym)}($asym, $bsym))
local e_sub01 = copy(expr)
e_sub01.args[i] = :(LinearAlgebra.MulAddMul{false, true, typeof($asym), typeof($bsym)}($asym, $bsym))
local e_sub00 = copy(expr)
e_sub00.args[i] = :(LinearAlgebra.MulAddMul{false, false, typeof($asym), typeof($bsym)}($asym, $bsym))
local e_out = quote
if isone($asym) && iszero($bsym)
$e_sub11
elseif isone($asym)
$e_sub10
elseif iszero($bsym)
$e_sub01
else
$e_sub00
end
end
return esc(e_out)
end
end
throw(ArgumentError("No valid MulAddMul expression found."))
end With the macro, you can write a fully inferable function manymul(N, out, op, in, alpha, beta)
for _ in 1:N
@mambranch LinearAlgebra.gemm_wrapper!(out, 'N', 'N', op, in, LinearAlgebra.MulAddMul(alpha, beta))
out, in = in, out
end
out
end It gets transformed by the macro into approximately this: function manymul(N, out, op, in, alpha, beta)
for _ in 1:N
ais1, bis0 = isone(alpha), iszero(beta)
if ais1 && bis0
LinearAlgebra.gemm_wrapper!(out, 'N', 'N', op, in, LinearAlgebra.MulAddMul{true, true, typeof(alpha), typeof(beta)}(alpha, beta))
elseif ais1
LinearAlgebra.gemm_wrapper!(out, 'N', 'N', op, in, LinearAlgebra.MulAddMul{true, false, typeof(alpha), typeof(beta)}(alpha, beta))
elseif bis0
LinearAlgebra.gemm_wrapper!(out, 'N', 'N', op, in, LinearAlgebra.MulAddMul{false, true, typeof(alpha), typeof(beta)}(alpha, beta))
else
LinearAlgebra.gemm_wrapper!(out, 'N', 'N', op, in, LinearAlgebra.MulAddMul{false, false, typeof(alpha), typeof(beta)}(alpha, beta))
end
out, in = in, out
end
out
end This does not allocate for either const or variable alpha/beta! Adding something like this to I don't see any disadvantages: This branching won't happen at runtime in the const-prop case, and was happening in any case in the Edit: See #47088 for a PR. |
There ought to be a simpler solution that exploits union-splitting? I couldn't seem to make that happen though. |
Any update on this issue? We've been using Static.jl to get around this for now. |
Should be fixed in 1.12! #52439 |
On Julia 1.8.1 I noticed the following:
Cthulhu suggests this is due to runtime dispatch related to
MulAdd()
. This can impact performance of e.g. ODE solving involvingmul!()
for small matrix sizes. The example above takes around 10% longer withmul!()
vs.gemm!()
, according to benchmarktools (single-threaded BLAS).Is this known/intended?
My
versioninfo()
:The text was updated successfully, but these errors were encountered: