-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading performance regression #32701
Comments
I'm hoping #32477 will help this at least a bit. |
Yes, it does help quite a bit — especially in the 1-thread case — but the multithreading case is still nowhere near 1.1.
|
CC @lmh91, who brought this to our attention. |
The second-most frequent class of threading questions seems to be simple loops with small computational load, often dominated by overhead, where this issue is a killer. To further quantify the matter, I benchmarked various versions of Here is v1.3.0-alpha with #32477 applied ( Observations:
The old threading scheme was very competitive in the intermediate range. Here is my threaded implementation for completeness: function axpy_thread(a,x,y) # yes, I know I switched x and y
n = length(x)
if length(y) != n
throw(ArgumentError("size mismatch"))
end
n8 = n >> 3
@threads for j in 0:n8-1
@inbounds begin
@simd for i=1:8
x[8j+i] += a * y[8j+i]
end
end
end
@inbounds for j in 8n8+1:n
x[j] += a * y[j]
end
end This is not optimized; hardware-dependent tuning can probably shave off something like half for "small" N. In Julia v1.1. |
Just to set expectations, while there are certainly performance improvements that will come before 1.3 final, it's likely that 1.3 will not quite match 1.2's limited but well-tuned threading. It will get there in time, however. If you don't need composable threading and you cannot take any kind of performance hit, it is possible that you will need to skip 1.3 entirely. |
Sure - we did encounter this issue with 1.2-rc2 though. :-) |
Stefan's numbering is off by one. The 1.1 release (or 1.0) should be used as comparison. By 1.4 it should be better tuned. v1.2 should be skipped entirely, and you'll need to evaluate if v1.3s sufficiently advanced by release (or submit patches to enhance it and accelerate the work) |
Ah, right, sorry. I just wanted to confirm that the startup delay we see is the same on v1.2-rc2 and v1.3-alpha, in case that information is helpful - but I guess that was clear to you guys already.
We'll actually need v1.2 for some things, because of #31670 (thanks a lot for fixing that one, @vtjnash!). But I'm certainly not complaining - I do understand that the partr scheduler is a major change and that some foundations for it (please correct me if that's wrong) were already laid in the upcoming v1.2. I'm sure we can work around the increased scheduling overhead for now (e.g. by running threads longer and implementing a barrier to sync). I didn't mean to convey impatience - I'm very excited by the new possibilities, and I understand that it may take some time to perfect this.
I would if I could - but I'm not sure I can (accelerate that work) :-). From what @JeffBezanson told me at JuliaCon, the inner workings of this are ... tricky (which may be an understatement). |
I'm not in a position to offer patches, but here's some data to help those who are: Profiling blames much of the overhead on this section of rn1 = cong(heap_p, cong_unbias, &ptls->rngseed);
rn2 = cong(heap_p, cong_unbias, &ptls->rngseed);
prio1 = jl_atomic_load(&heaps[rn1].prio);
prio2 = jl_atomic_load(&heaps[rn2].prio); This is part of a block executed hundreds of times per task-startup[1]. Although the profiler points to I think I've seen benchmarks showing that Edit: notes added for clarification: |
Looks like this issues has disappeared, or at least been substantially mitigated (as least in our application), in Julia v1.2.0 and v1.3.0-rc1. @lmh91 is running a few more tests to confirm. |
At least that element-wise computations still perform not ideally under >=1.2, which I post originally here: using Base.Threads
using LinearAlgebra
using BenchmarkTools
const N = 100000
a = rand(N)
b = rand(N)
function foo(a, b)
@threads for i ∈ eachindex(a)
a[i] = (a[i] + b[i]) ^ (a[i] - b[i])
end
end
@btime foo($a, $b) And a simple test: ~/codes » /home/opt/julia-1.1.1/bin/julia test.jl pshi@discover
75.924 μs (1 allocation: 32 bytes)
---------------------------------------------------------------------------------------------------------
~/codes » /home/opt/julia-1.2.0/bin/julia test.jl pshi@discover
114.931 μs (133 allocations: 13.53 KiB) |
It seems there is substantial new overhead just with the function foo(N) Threads.@threads for i in 1:N end end
using BenchmarkTools
@btime foo(10000) gives (the value of v1.1: It does matter a fair bit for functions with small amounts of work per thread. I'm seeing approximately a 20% performance hit for some tasks. Not a big deal, but I look forward to if/when it is resolved. If it's relevant, a lot of time on v1.2/v1.3-rc1 is spent in |
Have to correct myself, we still see a performance regression. We'll put together some benchmark numbers. |
Looks like v1.5 does reduce the scheduler overhead quite a bit. Here's a new test, measuring
So the situation has definitely improved. We're still far from pre-partr times, and the scheduler overhead still scales strongly with the number of parallel tasks (though not quite linearly). Is that scaling behavior of the overhead inherent due to the way partr works, or could that possibly change in the future? I do realize that this is highly non-trivial, and that the flexibility that partr offers (wouldn't want to miss it) can't come for free. Test code: using Base.Threads
using BenchmarkTools
function foo!(A::AbstractArray, n::Integer)
for i in 1:n
@threads for j in eachindex(A)
A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
end
end
A
end
function bar!(A::AbstractArray, n::Integer)
for i in 1:n
for j in eachindex(A)
A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
end
end
A
end
A = rand(nthreads())
@time foo!(A, 10^2);
@time bar!(A, 10^2);
@benchmark foo!($A, 100)
@benchmark bar!($A, 100) |
The difference between v1.4 and v1.5 seems to be due to the different way Using the test below (with 64 threads on 64 cores), I get about 35 us per iteration (35 ms in total for n = 1000) in Interestingly, using Base.Threads
using BenchmarkTools
task_func(i::Integer) = let i = i
() -> begin
end
end
function do_work(n::Integer)
for i in 1:n
# ccall(:jl_enter_threaded_region, Cvoid, ())
try
tasks = [Task(task_func(i)) for i in 1:nthreads()]
for i in eachindex(tasks)
t = tasks[i]
t.sticky = true
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, i-1)
schedule(t)
end
for i in eachindex(tasks)
wait(tasks[i])
end
finally
# ccall(:jl_exit_threaded_region, Cvoid, ())
end
end
end
function do_work2(n::Integer)
for i in 1:n
@threads for i in 1:nthreads()
end
end
end
@time do_work(10^3)
@time do_work2(10^3)
@benchmark do_work(10^3)
@benchmark do_work2(10^3) |
The difference seems to be due to the |
In my measurements, these seemed to be generally all faster now |
Can you post the plots? |
@oschulz identified a fairly major performance regression with multithreading within SolidStateDetectors; I confirmed the regression, further minimized it, and pushed it to https://github.com/mbauman/MultithreadPerf32701. There's still room for further minimization, but in short:
The text was updated successfully, but these errors were encountered: