Multithreading performance regression #32701

mbauman · 2019-07-26T19:16:31Z

@oschulz identified a fairly major performance regression with multithreading within SolidStateDetectors; I confirmed the regression, further minimized it, and pushed it to https://github.com/mbauman/MultithreadPerf32701. There's still room for further minimization, but in short:

shell> env JULIA_NUM_THREADS=1 ~/julia/release-1.1/julia --project=. benchmark.jl
[ Info: 1.1.1-pre.2: 1 threads
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  6
  --------------
  minimum time:     15.537 μs (0.00% GC)
  median time:      15.618 μs (0.00% GC)
  mean time:        15.669 μs (0.00% GC)
  maximum time:     33.078 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
shell> env JULIA_NUM_THREADS=20 ~/julia/release-1.1/julia --project=. benchmark.jl
[ Info: 1.1.1-pre.2: 20 threads
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  6
  --------------
  minimum time:     3.963 μs (0.00% GC)
  median time:      4.445 μs (0.00% GC)
  mean time:        4.487 μs (0.00% GC)
  maximum time:     6.338 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7
shell> env JULIA_NUM_THREADS=1 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.10: 1 threads
BenchmarkTools.Trial: 
  memory estimate:  864 bytes
  allocs estimate:  9
  --------------
  minimum time:     20.437 μs (0.00% GC)
  median time:      20.601 μs (0.00% GC)
  mean time:        20.992 μs (0.00% GC)
  maximum time:     73.357 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
shell> env JULIA_NUM_THREADS=20 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.10: 20 threads
BenchmarkTools.Trial: 
  memory estimate:  14.94 KiB
  allocs estimate:  142
  --------------
  minimum time:     60.928 μs (0.00% GC)
  median time:      76.579 μs (0.00% GC)
  mean time:        78.658 μs (1.86% GC)
  maximum time:     7.549 ms (97.44% GC)
  --------------
  samples:          10000
  evals/sample:     1

JeffBezanson · 2019-07-26T19:25:09Z

I'm hoping #32477 will help this at least a bit.

mbauman · 2019-07-26T19:46:59Z

Yes, it does help quite a bit — especially in the 1-thread case — but the multithreading case is still nowhere near 1.1.

$ JULIA_NUM_THREADS=1 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.19: 1 threads
BenchmarkTools.Trial:
  memory estimate:  928 bytes
  allocs estimate:  9
  --------------
  minimum time:     16.000 μs (0.00% GC)
  median time:      16.115 μs (0.00% GC)
  mean time:        16.530 μs (0.00% GC)
  maximum time:     58.516 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
[master:?][arctic1:~/test_thread_perf]
1 $ JULIA_NUM_THREADS=20 julia --project=. benchmark.jl
[ Info: 1.3.0-alpha.19: 20 threads
BenchmarkTools.Trial:
  memory estimate:  14.97 KiB
  allocs estimate:  140
  --------------
  minimum time:     17.536 μs (0.00% GC)
  median time:      30.455 μs (0.00% GC)
  mean time:        33.053 μs (3.91% GC)
  maximum time:     5.826 ms (97.33% GC)
  --------------
  samples:          10000
  evals/sample:     1

oschulz · 2019-07-27T17:31:48Z

CC @lmh91, who brought this to our attention.

RalphAS · 2019-07-29T03:26:57Z

The second-most frequent class of threading questions seems to be simple loops with small computational load, often dominated by overhead, where this issue is a killer.

To further quantify the matter, I benchmarked various versions of axpy!, for Float64 vectors. (The practical application is obviously not axpy!, but a similar thing without optimized library support.) My baseline is OpenBLAS running 8 threads on 8 cores. "bcast" in the plots is the obvious unthreaded broadcast statement.

Here is v1.3.0-alpha with #32477 applied (N is vector length, nt is thread count):

Observations:

Unthreaded broadcast does as well as OpenBLAS for small N.
The current overhead of Tasks is crippling in the intermediate range.
On my system I need at least 4 threads to get nearly full memory bandwidth (the lower bound for large N), so broadcast falls behind there, but any multithreading scheme eventually benefits.

Here is v1.1.1:

The old threading scheme was very competitive in the intermediate range.

Here is my threaded implementation for completeness:

function axpy_thread(a,x,y) # yes, I know I switched x and y
    n = length(x)
    if length(y) != n
        throw(ArgumentError("size mismatch"))
    end
    n8 = n >> 3
    @threads for j in 0:n8-1
        @inbounds begin
            @simd for i=1:8
                x[8j+i] += a * y[8j+i]
            end
        end
    end
    @inbounds for j in 8n8+1:n
        x[j] += a * y[j]
    end
end

This is not optimized; hardware-dependent tuning can probably shave off something like half for "small" N. In Julia v1.1.

StefanKarpinski · 2019-07-29T22:31:24Z

Just to set expectations, while there are certainly performance improvements that will come before 1.3 final, it's likely that 1.3 will not quite match 1.2's limited but well-tuned threading. It will get there in time, however. If you don't need composable threading and you cannot take any kind of performance hit, it is possible that you will need to skip 1.3 entirely.

oschulz · 2019-07-30T07:56:37Z

Sure - we did encounter this issue with 1.2-rc2 though. :-)

vtjnash · 2019-07-30T14:11:37Z

Stefan's numbering is off by one. The 1.1 release (or 1.0) should be used as comparison. By 1.4 it should be better tuned. v1.2 should be skipped entirely, and you'll need to evaluate if v1.3s sufficiently advanced by release (or submit patches to enhance it and accelerate the work)

oschulz · 2019-07-30T19:58:06Z

Stefan's numbering is off by one.

Ah, right, sorry. I just wanted to confirm that the startup delay we see is the same on v1.2-rc2 and v1.3-alpha, in case that information is helpful - but I guess that was clear to you guys already.

v1.2 should be skipped entirely

We'll actually need v1.2 for some things, because of #31670 (thanks a lot for fixing that one, @vtjnash!). But I'm certainly not complaining - I do understand that the partr scheduler is a major change and that some foundations for it (please correct me if that's wrong) were already laid in the upcoming v1.2.

I'm sure we can work around the increased scheduling overhead for now (e.g. by running threads longer and implementing a barrier to sync). I didn't mean to convey impatience - I'm very excited by the new possibilities, and I understand that it may take some time to perfect this.

or submit patches to enhance it and accelerate the work

I would if I could - but I'm not sure I can (accelerate that work) :-). From what @JeffBezanson told me at JuliaCon, the inner workings of this are ... tricky (which may be an understatement).

RalphAS · 2019-08-03T04:20:32Z

I'm not in a position to offer patches, but here's some data to help those who are:

Profiling blames much of the overhead on this section of multiq_deletemin in partr.c:

        rn1 = cong(heap_p, cong_unbias, &ptls->rngseed);
        rn2 = cong(heap_p, cong_unbias, &ptls->rngseed);
        prio1 = jl_atomic_load(&heaps[rn1].prio);
        prio2 = jl_atomic_load(&heaps[rn2].prio);

This is part of a block executed hundreds of times per task-startup[1]. Although the profiler points to cong (a simple random-number generator), it seems more plausible that the delay is actually in the atomic loads. In support of this claim, Linux perf reports more than 20% L1 cache misses (and more than 5% L2 misses) for the axpy example described above[2].

I think I've seen benchmarks showing that cilk_for is not so encumbered; IIUC it is also task-based. This makes me impatient in my hope that Julia's new threading model will be both versatile and efficient.

Edit: notes added for clarification:
[1] The "hundreds of times per task" comes from instrumenting partr.c with thread-friendly counters. From what I understand about the multiqueue algorithm, the large iteration count seems to indicate a (well-hidden) bug.
[2] The cache misses were measured on binaries without the counters.

oschulz · 2019-08-24T15:36:03Z

Looks like this issues has disappeared, or at least been substantially mitigated (as least in our application), in Julia v1.2.0 and v1.3.0-rc1. @lmh91 is running a few more tests to confirm.

shipengcheng1230 · 2019-08-24T16:34:02Z

At least that element-wise computations still perform not ideally under >=1.2, which I post originally here:

using Base.Threads
using LinearAlgebra
using BenchmarkTools

const N = 100000
a = rand(N)
b = rand(N)

function foo(a, b)
    @threads for i ∈ eachindex(a)
        a[i] = (a[i] + b[i]) ^ (a[i] - b[i])
    end
end

@btime foo($a, $b)

And a simple test:

~/codes » /home/opt/julia-1.1.1/bin/julia test.jl                                     pshi@discover
  75.924 μs (1 allocation: 32 bytes)
---------------------------------------------------------------------------------------------------------
~/codes » /home/opt/julia-1.2.0/bin/julia test.jl                                     pshi@discover
  114.931 μs (133 allocations: 13.53 KiB)

awllee · 2019-09-04T16:56:42Z

It seems there is substantial new overhead just with the @threads call.

function foo(N) Threads.@threads for i in 1:N end end

using BenchmarkTools
@btime foo(10000)

gives (the value of N does not matter)

v1.1: 427.692 ns (1 allocation: 32 bytes)
v1.2: 18.264 μs (28 allocations: 2.25 KiB
v1.3-rc1: 15.304 μs (30 allocations: 2.83 KiB)

It does matter a fair bit for functions with small amounts of work per thread. I'm seeing approximately a 20% performance hit for some tasks. Not a big deal, but I look forward to if/when it is resolved.

If it's relevant, a lot of time on v1.2/v1.3-rc1 is spent in /usr/lib/system/libsystem_kernel.dylib : __psynch_cvwait whereas practically no time is spent there on v1.1.

oschulz · 2019-09-09T06:32:32Z

Looks like this issues has disappeared, or at least been substantially mitigated (as least in our application)

Have to correct myself, we still see a performance regression. We'll put together some benchmark numbers.

oschulz · 2020-06-08T23:36:19Z

Looks like v1.5 does reduce the scheduler overhead quite a bit. Here's a new test, measuring @threads overhead on a 64-core CPU (AMD Epyc 7702P):

	v1.0	v1.4	v1.5.0-beta1.0
1 thread	0.07 us	4 us	2 us
8 threads	3 us	26 us	11 us
64 threads	5 us	180 us	50 us

So the situation has definitely improved. We're still far from pre-partr times, and the scheduler overhead still scales strongly with the number of parallel tasks (though not quite linearly). Is that scaling behavior of the overhead inherent due to the way partr works, or could that possibly change in the future? I do realize that this is highly non-trivial, and that the flexibility that partr offers (wouldn't want to miss it) can't come for free.

Test code:

using Base.Threads
using BenchmarkTools

function foo!(A::AbstractArray, n::Integer)
    for i in 1:n
        @threads for j in eachindex(A)
            A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
        end
    end

    A
end

function bar!(A::AbstractArray, n::Integer)
    for i in 1:n
        for j in eachindex(A)
            A[j] += A[j] |> abs |> log |> abs |> log |> abs |> log |> abs |> log
        end
    end

    A
end

A = rand(nthreads())

@time foo!(A, 10^2);
@time bar!(A, 10^2);

@benchmark foo!($A, 100)
@benchmark bar!($A, 100)

oschulz · 2020-06-09T01:26:21Z

The difference between v1.4 and v1.5 seems to be due to the different way @threads is implemented.

Using the test below (with 64 threads on 64 cores), I get about 35 us per iteration (35 ms in total for n = 1000) in do_work with both v1.4 and v1.5.0-beta1.0. However, I get 190 us per iteration in do_work2 with v1.4, compared to only 50 us with v1.5.0-beta1.0.

Interestingly, do_work is still clearly faster than do_work2 with v1.5.0-beta1.0 (35 us vs. 50 us per iteration), even though do_work is pretty much what the internal threading_run function does. So maybe there's still a little bit of room for optimization in @threads somewhere?

using Base.Threads
using BenchmarkTools

task_func(i::Integer) = let i = i
    () -> begin
    end
end

function do_work(n::Integer)
    for i in 1:n
        # ccall(:jl_enter_threaded_region, Cvoid, ())
        try
            tasks = [Task(task_func(i)) for i in 1:nthreads()]

            for i in eachindex(tasks)
                t = tasks[i]
                t.sticky = true
                ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, i-1)
                schedule(t)
            end

            for i in eachindex(tasks)
                wait(tasks[i])
            end
        finally
            # ccall(:jl_exit_threaded_region, Cvoid, ())
        end    
    end
end

function do_work2(n::Integer)
    for i in 1:n
        @threads for i in 1:nthreads()
        end
    end
end


@time do_work(10^3)
@time do_work2(10^3)

@benchmark do_work(10^3)
@benchmark do_work2(10^3)

oschulz · 2020-06-09T01:34:57Z

Interestingly, do_work is still clearly faster than do_work2 with v1.5.0-beta1.0 (35 us vs. 50 us per iteration), even though do_work is pretty much what the internal threading_run function does. So maybe there's still a little bit of room for optimization in @threads somewhere?

The difference seems to be due to the ccall(:jl_enter_threaded_region, Cvoid, ()) and ccall(:jl_exit_threaded_region, Cvoid, ()) in threading_run. Is that something I should use when using sticky tasks, or spawned tasks in general?

vtjnash · 2022-08-10T15:30:19Z

In my measurements, these seemed to be generally all faster now

oscardssmith · 2022-08-10T15:35:35Z

Can you post the plots?

mbauman added performance Must go faster regression Regression in behavior compared to a previous version multithreading Base.Threads and related functionality labels Jul 26, 2019

JeffBezanson mentioned this issue Jul 31, 2019

[DO NOT MERGE] thread adjoint sparse matrix vector multiply #29525

Closed

antoine-levitt mentioned this issue Aug 5, 2019

Threading JuliaMolSim/DFTK.jl#15

Closed

JeffBezanson mentioned this issue Aug 7, 2019

reduce number of heaps used by partr #32822

Merged

vtjnash closed this as completed Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading performance regression #32701

Multithreading performance regression #32701

mbauman commented Jul 26, 2019 •

edited

Loading

JeffBezanson commented Jul 26, 2019

mbauman commented Jul 26, 2019 •

edited

Loading

oschulz commented Jul 27, 2019

RalphAS commented Jul 29, 2019

StefanKarpinski commented Jul 29, 2019

oschulz commented Jul 30, 2019

vtjnash commented Jul 30, 2019

oschulz commented Jul 30, 2019

RalphAS commented Aug 3, 2019 •

edited

Loading

oschulz commented Aug 24, 2019 •

edited

Loading

shipengcheng1230 commented Aug 24, 2019

awllee commented Sep 4, 2019

oschulz commented Sep 9, 2019

oschulz commented Jun 8, 2020 •

edited

Loading

oschulz commented Jun 9, 2020 •

edited

Loading

oschulz commented Jun 9, 2020

vtjnash commented Aug 10, 2022

oscardssmith commented Aug 10, 2022

Multithreading performance regression #32701

Multithreading performance regression #32701

Comments

mbauman commented Jul 26, 2019 • edited Loading

JeffBezanson commented Jul 26, 2019

mbauman commented Jul 26, 2019 • edited Loading

oschulz commented Jul 27, 2019

RalphAS commented Jul 29, 2019

StefanKarpinski commented Jul 29, 2019

oschulz commented Jul 30, 2019

vtjnash commented Jul 30, 2019

oschulz commented Jul 30, 2019

RalphAS commented Aug 3, 2019 • edited Loading

oschulz commented Aug 24, 2019 • edited Loading

shipengcheng1230 commented Aug 24, 2019

awllee commented Sep 4, 2019

oschulz commented Sep 9, 2019

oschulz commented Jun 8, 2020 • edited Loading

oschulz commented Jun 9, 2020 • edited Loading

oschulz commented Jun 9, 2020

vtjnash commented Aug 10, 2022

oscardssmith commented Aug 10, 2022

mbauman commented Jul 26, 2019 •

edited

Loading

mbauman commented Jul 26, 2019 •

edited

Loading

RalphAS commented Aug 3, 2019 •

edited

Loading

oschulz commented Aug 24, 2019 •

edited

Loading

oschulz commented Jun 8, 2020 •

edited

Loading

oschulz commented Jun 9, 2020 •

edited

Loading