Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

Closed
gasagna opened this issue Oct 25, 2019 · 16 comments · Fixed by #151
Closed

slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

gasagna opened this issue Oct 25, 2019 · 16 comments · Fixed by #151

Comments

@gasagna
Copy link

gasagna commented Oct 25, 2019

I have a package where I have different threads performing different FFTs and I observe a significant slowdown in the latest julia-1.4-DEV

Here is a MWE:

using BenchmarkTools
using LinearAlgebra
using Base.Threads
using FFTW

function setup(N, M=4)
    A = [randn(N, N) for i = 1:M]
    Â = [rfft(A[i], [2, 1]) for i = 1:M]
    plan = plan_rfft(A[1], [2, 1])
    return A, Â, plan
end

function test(A, Â, plan)
    @threads for i = 1:length(A)
        LinearAlgebra.mul!(Â[i], plan, A[i])
    end
end

for N = [32, 64, 128, 256]
    args = setup(N, 4)
    @btime test($args...)
end

On my 2016 MacBook Pro I get these times

davide@vega - Desktop$: export JULIA_NUM_THREADS=1; julia mew.jl 
  19.083 μs (11 allocations: 816 bytes)
  63.617 μs (11 allocations: 816 bytes)
  371.126 μs (11 allocations: 816 bytes)
  2.355 ms (11 allocations: 816 bytes)
davide@vega - Desktop$: export JULIA_NUM_THREADS=1; julia-master mew.jl 
  19.733 μs (9 allocations: 784 bytes)
  64.619 μs (9 allocations: 784 bytes)
  372.343 μs (9 allocations: 784 bytes)
  2.351 ms (9 allocations: 784 bytes)
davide@vega - Desktop$: export JULIA_NUM_THREADS=2; julia mew.jl 
  17.831 μs (18 allocations: 1.48 KiB)
  41.078 μs (17 allocations: 1.47 KiB)
  195.599 μs (17 allocations: 1.47 KiB)
  1.197 ms (17 allocations: 1.47 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=2; julia-master mew.jl 
  115.478 μs (1069 allocations: 84.78 KiB)
  160.183 μs (1108 allocations: 87.83 KiB)
  264.680 μs (631 allocations: 51.31 KiB)
  1.289 ms (630 allocations: 51.30 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=4; julia mew.jl 
  19.946 μs (29 allocations: 2.27 KiB)
  31.612 μs (29 allocations: 2.28 KiB)
  113.469 μs (29 allocations: 2.31 KiB)
  654.074 μs (30 allocations: 2.31 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=4; julia-master mew.jl 
  85.846 μs (1651 allocations: 133.36 KiB)
  67.317 μs (1043 allocations: 86.73 KiB)
  235.660 μs (2230 allocations: 178.41 KiB)
  798.563 μs (2478 allocations: 197.66 KiB)

i.e. there is significant overhead with multiple threads at small array dimensions.

I am not sure this belongs to FFTW.jl or julia Base, but replacing the LinearAlgebra.mul!(Â[i], plan, A[i]) line with anything else (e.g. A[i] .+= A[i]) does not incur in the same penalty, so I am submitting the issue here.

@baggepinnen
Copy link

Possibly related JuliaDSP/DSP.jl#339

@stevengj
Copy link
Member

In Julia 1.4, FFTW now uses partr threads (#105), which means that plan_rfft creates a plan that itself (potentially) uses threads. When you wrap the FFTW execution in @threads for, then partr decides how to schedule the for threads and the FFTW threads among the physical CPU threads.

Unfortunately, spawning a partr thread has a fairly large overhead (less than a physical hardward thread, but much more than e.g. a cilk thread, and far more than a subroutine call), so this leads to a slowdown for a threaded loop of small transforms.

cc @vtjnash

@baggepinnen
Copy link

Could the PARTR explain why there is a sudden jump in both execution time and memory allocations between when Julia is started with 2 and 4 threads respectively?

# 2 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.84 MiB
  allocs estimate:  8255
  --------------
  minimum time:     17.288 ms (0.00% GC)
  median time:      17.457 ms (0.00% GC)
  mean time:        17.661 ms (0.65% GC)
  maximum time:     21.014 ms (0.00% GC)
  --------------
  samples:          283
  evals/sample:     1

# 4 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  118.42 MiB
  allocs estimate:  1308803
  --------------
  minimum time:     91.844 ms (0.00% GC)
  median time:      125.869 ms (25.65% GC)
  mean time:        128.205 ms (19.04% GC)
  maximum time:     251.423 ms (14.35% GC)
  --------------
  samples:          39
  evals/sample:     1

julia> FFTW.fftw_vendor
:fftw

MWE:

using Pkg; pkg"add DSP#master"
using DSP, BenchmarkTools
img = randn(1000,1000);
kernel = randn(35,35);
typeof(img)
typeof(kernel)
@benchmark DSP.conv($img, $kernel)

Julia version info

julia> versioninfo(verbose=true)
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 18.04.1 LTS
  uname: Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz: 
              speed         user         nice          sys         idle          irq
       #1  2442 MHz     413888 s      38233 s      84143 s    3291817 s          0 s
       #2  2329 MHz     402000 s      31602 s      84474 s    1964856 s          0 s
       #3  2335 MHz     309327 s      11461 s     106453 s    2008096 s          0 s
       #4  2477 MHz     356923 s      36650 s      85812 s    1956186 s          0 s
       
  Memory: 15.564483642578125 GB (502.1171875 MB free)
  Uptime: 113146.0 sec
  Load Avg:  0.5146484375  0.63671875  0.728515625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
  JULIA_EDITOR = atom
  JULIA_NUM_THREADS = 4
  MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
  DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
  HOME = /home/fredrikb
  WINDOWPATH = 2
  TERM = xterm-256color
  PATH = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

@stevengj
Copy link
Member

Note that if you are launching your own threads and want FFTW to execute its own plans serially, you can just do FFTW.set_num_threads(1) before creating your plans.

@baggepinnen
Copy link

Thanks, setting the number of threads manually worked out wonderfully :)

@ViralBShah
Copy link
Member

cc @JeffBezanson @Keno

@galenlynch
Copy link
Contributor

galenlynch commented May 23, 2020

I think I just ran into a case where this issue causes convolution code in DSP.jl code to run 43x slower and use 85x more memory, although I'm not yet positive that the root cause is FFTW. Since this seems to be such an issue with DSP convolution, which does not use multithreading itself, I would like to better understand what the current recommendation from FFTW.jl is, or why this is such a problem. The affected function in DSP.jl, conv, performs a large number of small FFT transformations serially, and performance drops off a cliff when setting JULIA_NUM_THREADS=4 prior to starting Julia. I'm a bit hesitant to make a change with global effect, like setting FFTW.set_num_threads(1), inside of conv. Perhaps there is another workaround in this situation?

@galenlynch
Copy link
Contributor

A more MWE that captures what's happening in DSP's conv:

using LinearAlgebra, FFTW, BenchmarkTools
function foo!(input)
    s = size(input)
    fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
    p = plan_rfft(input)
    ip = plan_brfft(fbuff, s[1])
    for i in 1:53000
        mul!(fbuff, p, input)
        mul!(input, ip, fbuff)
    end
    return input
end
A = rand(8, 8);
@benchmark foo!(A)

With four threads I get:

  memory estimate:  1.05 GiB
  allocs estimate:  10489157
  --------------
  minimum time:     1.099 s (5.75% GC)
  median time:      1.101 s (5.95% GC)
  mean time:        1.105 s (5.85% GC)
  maximum time:     1.122 s (5.63% GC)
  --------------
  samples:          5
  evals/sample:     1

with one thread I get:

  memory estimate:  8.59 KiB
  allocs estimate:  124
  --------------
  minimum time:     10.278 ms (0.00% GC)
  median time:      10.889 ms (0.00% GC)
  mean time:        11.350 ms (0.00% GC)
  maximum time:     25.835 ms (0.00% GC)
  --------------
  samples:          440
  evals/sample:     1

@giordano
Copy link
Member

Julia v1.4.1:

julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads

julia> nthreads()
1

julia> function foo!(input)
           s = size(input)
           fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
           p = plan_rfft(input)
           ip = plan_brfft(fbuff, s[1])
           for i in 1:53000
               mul!(fbuff, p, input)
               mul!(input, ip, fbuff)
           end
           return input
       end
foo! (generic function with 1 method)

julia> A = rand(8, 8);

julia> FFTW.set_num_threads(1)

julia> @btime foo!($A);
  12.559 ms (124 allocations: 8.59 KiB)

julia> FFTW.set_num_threads(4)

julia> @btime foo!($A);
  2.904 s (124 allocations: 8.59 KiB)

Julia v1.0.5:

julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads

julia> nthreads()
1

julia> function foo!(input)
           s = size(input)
           fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
           p = plan_rfft(input)
           ip = plan_brfft(fbuff, s[1])
           for i in 1:53000
               mul!(fbuff, p, input)
               mul!(input, ip, fbuff)
           end
           return input
       end
foo! (generic function with 1 method)

julia> A = rand(8, 8);

julia> FFTW.set_num_threads(1)

julia> @btime foo!($A);
  12.586 ms (126 allocations: 8.75 KiB)

julia> FFTW.set_num_threads(4)

julia> @btime foo!($A);
  2.882 s (126 allocations: 8.75 KiB)

@galenlynch
Copy link
Contributor

Oh sorry, my benchmarks were all on Julia 1.4.1, with JULIA_NUM_THREADS=4 and JULIA_NUM_THREADS=1.

@galenlynch
Copy link
Contributor

Is there any way to get the current number of FFTW num_threads, so a function can reversibly alter it?

@giordano
Copy link
Member

#117

@galenlynch
Copy link
Contributor

Was FFTW's num_threads always set to the number of Julia threads, or has that changed recently?

@giordano
Copy link
Member

It's 4 times the number of Julia threads, when they are more than 1:

set_num_threads(nthreads() * 4) # spawn more tasks than threads to help load-balancing
. This was introduced in #105

@galenlynch
Copy link
Contributor

It seems like one workaround is making plans with the FFTW.PATIENT flag, which according to the FFTW3 docs will change the number of threads depending on the problem size. For the example I gave above, this rescues the performance without having to call set_num_threads.

#117 seems like it would be difficult to implement since I can't find an accessor-like function counterpart to fftw_plan_with_nthreads in FFTW3's library index. Perhaps arguments to FFTW.jl's set_num_threads could be cached by the module, allowing the creation of an accessor function to get the last number of threads passed to FFTW3.

This seems to me like a very large performance regression when using FFTW.jl "out of the box." I understand that enabling threaded plans by default, without regard to the problem size, inherently involves a trade-off in performance between small and large problems. On the one hand, a 240x slowdown of small problems might not be that noticeable when the execution runtime was so short to begin with, while a ~4x (or however many cores a user has) speed up for large problems might translate into seconds saved. However, if people are making plans, they are probably using it many times, and a two orders of magnitude slowdown for small problems can really add up.

Is set_num_threads going to be part of the supported API of this package? I don't see it in the docs.

Some mention of this performance regression in the docs might be helpful. I'll try to throw a PR together sometime if that would be helpful.

galenlynch added a commit to galenlynch/DSP.jl that referenced this issue May 24, 2020
Convolutions in DSP currently rely on FFTW.jl, and a recent change in FFTW.jl
(JuliaMath/FFTW.jl#105) has introduced a large performance regression in `conv`
whenever Julia is started with more than one thread. Since v1 of FFTW.jl, it uses multi-threaded
FFTW transformations by default whenever Julia has more than one thread. This
new default causes small FFT problems to run much more slowly and use much more
memory. Since the overlap-save method of `conv` in DSP breaks a convolutions
into small convolutions, and therefore performs a large number of small FFTW
transformations, this change can cause convolutions to be slower by two orders
of magnitude, and similarly use two orders of magnitude more memory. While
FFTW.jl does not provide an explicit way to set the number of threads used by a
FFTW plan without changing a global variable, generating the plans with the
planning flag set to `FFTW.PATIENT` (instead of the default `MEASURE`) allows
the planner to consider changing the number of threads. Adding this flag to the
plans generated by the overlap-save convolution method seems to rescue the
performance regression on multi-threaded instances of Julia.

Fixes JuliaDSP#399
Also see JuliaMath/FFTW.jl#121
@galenlynch
Copy link
Contributor

Using the FFTW.PATIENT flag turns out not to be a great workaround, for the perhaps obvious reason that it can take a long time (e.g. JuliaDSP/DSP.jl#362).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants