Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will @batch effect the Threads.@threads? #82

Open
wly2014 opened this issue Jun 22, 2022 · 6 comments
Open

Will @batch effect the Threads.@threads? #82

wly2014 opened this issue Jun 22, 2022 · 6 comments

Comments

@wly2014
Copy link

wly2014 commented Jun 22, 2022

When I used @batch and Threads.@threads concurrently, I find a confusing problem.

using TimerOutputs
using LinearAlgebra
using Polyester

function testmul_thread(A,B,C)
    Threads.@threads for i in 1:96
        mul!(C[i],A,B)
    end
end
function testmul_batch(A,B,C)
    @batch per=core for i in 1:96
        mul!(C[i],A,B)
    end
end

@show Threads.nthreads() # 6

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        @timeit "testmul_th" testmul_thread(A,B,C)
        @timeit "testmul_batch" testmul_batch(A,B,C)
    end
end

# first time call
testmuls();

reset_timer!()
testmuls()
show(TimerOutputs.get_defaulttimer());

The number of threads is 6. When I only call the function testmul_thread() in testmuls(), the time consumed only 94.6ms.

───────────────────────────────────────────────────────────────────────
                               Time                    Allocations      
                      ───────────────────────   ────────────────────────
   Tot / % measured:       106ms /  89.5%           7.55MiB /   0.8%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul_th       20   94.6ms  100.0%  4.73ms   64.4KiB  100.0%  3.22KiB
 ───────────────────────────────────────────────────────────────────────

But, when I call concurrently testmul_thread() and testmul_batch(), the time consumed of testmul_thread() will increase to 417ms:

──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations      
                         ───────────────────────   ────────────────────────
    Tot / % measured:         518ms /  99.4%           7.56MiB /   0.9%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20    417ms   81.2%  20.9ms   64.9KiB   97.7%  3.25KiB
 testmul_batch       20   97.0ms   18.8%  4.85ms   1.56KiB    2.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

Why the run of function testmul_batch will effect testmul_thread? Dose @batch change something about global configuration of julia threads?

Thank you very much.

@wly2014
Copy link
Author

wly2014 commented Jun 22, 2022

By the way, if I call the two functions separately as follows, the increment of time consuming of testmul_thread will be very little:

for i in 1:20
    @timeit "testmul_th" testmul_thread(A,B,C)
end
for i in 1:20
    @timeit "testmul_batch" testmul_batch(A,B,C)
end

@chriselrod
Copy link
Member

chriselrod commented Jun 22, 2022

This is what ThreadingUtilites.sleep_all_tasks() is for, but it does not seem to help much.

julia> using Polyester, LinearAlgebra, TimerOutputs

julia> BLAS.set_num_threads(1)

julia> function testmul_thread(A,B,C)
           Threads.@threads for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_thread (generic function with 1 method)

julia> function testmul_batch(A,B,C)
           @batch per=core for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_batch (generic function with 1 method)

julia> @show Threads.nthreads() 
Threads.nthreads() = 10
10

julia> function testmuls()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
           end
       end
testmuls (generic function with 1 method)

julia> testmuls()

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         16.4s /   0.4%           49.9MiB /   0.2%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   52.2ms   86.1%  2.61ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   8.41ms   13.9%   420μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> using PolyesterWeave

julia> using ThreadingUtilities

julia> function testmuls2()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
               ThreadingUtilities.sleep_all_tasks()
           end
       end
testmuls2 (generic function with 1 method)

julia> testmuls2()

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         3.36s /   2.9%           7.68MiB /   1.5%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   53.3ms   53.8%  2.66ms   1.56KiB    1.3%    80.0B
 testmul_th          20   45.8ms   46.2%  2.29ms    116KiB   98.7%  5.78KiB
 ──────────────────────────────────────────────────────────────────────────

In this example, it does not seem to help much

julia> function testmulsthread()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               #@timeit "testmul_batch" testmul_batch(A,B,C)
               #ThreadingUtilities.sleep_all_tasks()
           end
       end
testmulsthread (generic function with 1 method)

julia> testmulsthread()

julia> reset_timer!();

julia> testmulsthread()

julia> TimerOutputs.get_defaulttimer()
 ───────────────────────────────────────────────────────────────────────
                               Time                    Allocations
                      ───────────────────────   ────────────────────────
   Tot / % measured:       4.40s /   0.7%           7.68MiB /   1.5%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul_th       20   30.2ms  100.0%  1.51ms    116KiB  100.0%  5.82KiB
 ───────────────────────────────────────────────────────────────────────

and it hurts the performance of @batch a lot more than it helps the performance of @threads.

@wly2014
Copy link
Author

wly2014 commented Jun 23, 2022

Thank you very much.
Actually, it seems to work well for me.
On my laptop,

julia> versioninfo()
Julia Version 1.6.5
Commit 9058264a69 (2021-12-19 12:30 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 7 4800U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

When use 6 julia threads and 1 BLAS thread, the result of function:

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        @timeit "testmul_th" testmul_thread(A,B,C)
        # @timeit "testmul_batch" testmul_batch(A,B,C)
        # ThreadingUtilities.sleep_all_tasks()
    end
end

is

 ───────────────────────────────────────────────────────────────────────
                               Time                    Allocations      
                      ───────────────────────   ────────────────────────
   Tot / % measured:      60.5ms /  82.3%           7.55MiB /   0.8%

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
 testmul_th       20   49.8ms  100.0%  2.49ms   63.8KiB  100.0%  3.19KiB
 ───────────────────────────────────────────────────────────────────────

and the result of function

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        # @timeit "testmul_th" testmul_thread(A,B,C)
        @timeit "testmul_batch" testmul_batch(A,B,C)
        # ThreadingUtilities.sleep_all_tasks()
    end
end

is

 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations      
                         ───────────────────────   ────────────────────────
    Tot / % measured:        68.6ms /  64.2%           7.49MiB /   0.0%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   44.0ms  100.0%  2.20ms   1.56KiB  100.0%    80.0B
 ──────────────────────────────────────────────────────────────────────────

Finally, the result of function

function testmuls()
    A = rand(100,100)
    B = rand(100,100)
    C = [rand(100,100) for i in 1:96]
    for i in 1:20
        @timeit "testmul_th" testmul_thread(A,B,C)
        @timeit "testmul_batch" testmul_batch(A,B,C)
        ThreadingUtilities.sleep_all_tasks()
    end
end

is about

 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations      
                         ───────────────────────   ────────────────────────
    Tot / % measured:        97.0ms /  96.7%           7.56MiB /   0.8%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   49.2ms   52.5%  2.46ms   1.56KiB    2.4%    80.0B
 testmul_th          20   44.5ms   47.5%  2.23ms   63.3KiB   97.6%  3.17KiB
 ──────────────────────────────────────────────────────────────────────────

I wonder why your result of first time run of testmul_batch is only 8.41ms, which is very different to testmul_thread's 52.2ms, but the time of two function is similar on my laptop.
Thanks again.

@chriselrod
Copy link
Member

chriselrod commented Jun 23, 2022

I wonder why your result of first time run of testmul_batch is only 8.41ms, which is very different to testmul_thread's 52.2ms, but the time of two function is similar on my laptop.
Thanks again.

Hmm, I'm not sure what went wrong earlier.

julia> using Polyester, LinearAlgebra, TimerOutputs
[ Info: Precompiling Polyester [f517fe37-dbe3-4b94-8317-1923a5111588]
ROOT                      :  3.36 %   104349936
LOWERING                  :  5.43 %   168795352
PARSING                   :  4.35 %   135114722
INFERENCE                 :  0.67 %   20892370
CODEGEN                   :  0.66 %   20442216
METHOD_LOOKUP_SLOW        :  0.01 %   265948
METHOD_LOOKUP_FAST        :  0.08 %   2344030
LLVM_OPT                  :  0.84 %   26267292
LLVM_MODULE_FINISH        :  6.51 %   202523080
METHOD_MATCH              :  7.32 %   227515996
TYPE_CACHE_LOOKUP         :  0.46 %   14337602
TYPE_CACHE_INSERT         :  0.06 %   1908924
MACRO_INVOCATION          :  0.03 %   805162
AST_COMPRESS              :  0.11 %   3455434
AST_UNCOMPRESS            :  0.11 %   3338902
SYSIMG_LOAD               :  7.43 %   231040270
ADD_METHOD                : 51.56 %   1602959664
LOAD_MODULE               :  6.18 %   192237612
SAVE_MODULE               :  3.93 %   122197110
INIT_MODULE               :  0.91 %   28354874

julia> BLAS.set_num_threads(1)

julia> function testmul_thread(A,B,C)
           Threads.@threads for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_thread (generic function with 1 method)

julia> function testmul_batch(A,B,C)
           @batch per=core for i in 1:96
               mul!(C[i],A,B)
           end
       end
testmul_batch (generic function with 1 method)

julia> @show Threads.nthreads()
Threads.nthreads() = 10
10

julia> function testmuls()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
           end
       end
testmuls (generic function with 1 method)

julia> testmuls()

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         654ms /   9.0%           49.2MiB /   0.2%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   52.2ms   89.0%  2.61ms    109KiB   98.6%  5.44KiB
 testmul_batch       20   6.47ms   11.0%   323μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> using PolyesterWeave

julia> using ThreadingUtilities

julia> function testmuls2()
           A = rand(100,100)
           B = rand(100,100)
           C = [rand(100,100) for i in 1:96]
           for i in 1:20
               @timeit "testmul_th" testmul_thread(A,B,C)
               @timeit "testmul_batch" testmul_batch(A,B,C)
               ThreadingUtilities.sleep_all_tasks()
           end
       end
testmuls2 (generic function with 1 method)

julia> testmuls2()

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         4.52s /   0.4%           9.63MiB /   1.2%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   8.14ms   50.1%   407μs    120KiB   98.7%  6.01KiB
 testmul_batch       20   8.10ms   49.9%   405μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         10.0s /   0.6%           8.18MiB /   1.3%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.4%  2.52ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.0ms   16.6%   500μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:         5.98s /   0.3%           11.3MiB /   1.1%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_batch       20   9.81ms   50.4%   490μs   1.56KiB    1.3%    80.0B
 testmul_th          20   9.64ms   49.6%   482μs    120KiB   98.7%  6.02KiB
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        71.6ms /  84.4%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.2%  2.51ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.1ms   16.8%   507μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        23.3ms /  92.0%           7.64MiB /   1.5%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   11.6ms   54.3%   581μs    119KiB   98.7%  5.95KiB
 testmul_batch       20   9.78ms   45.7%   489μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        64.9ms /  92.7%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.0ms   83.1%  2.50ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.2ms   16.9%   509μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        25.4ms /  83.7%           7.64MiB /   1.6%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   11.5ms   54.3%   576μs    120KiB   98.7%  6.00KiB
 testmul_batch       20   9.71ms   45.7%   485μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        65.3ms /  92.5%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.3%  2.51ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.1ms   16.7%   504μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        25.6ms /  83.9%           7.64MiB /   1.6%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   11.7ms   54.5%   585μs    120KiB   98.7%  6.01KiB
 testmul_batch       20   9.77ms   45.5%   489μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        67.0ms /  90.4%           7.63MiB /   1.4%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   50.3ms   83.1%  2.52ms    109KiB   98.6%  5.46KiB
 testmul_batch       20   10.2ms   16.9%   511μs   1.56KiB    1.4%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> reset_timer!();

julia> testmuls2()

julia> TimerOutputs.get_defaulttimer()
 ──────────────────────────────────────────────────────────────────────────
                                  Time                    Allocations
                         ───────────────────────   ────────────────────────
    Tot / % measured:        23.7ms /  93.2%           7.64MiB /   1.6%

 Section         ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 testmul_th          20   12.4ms   56.0%   618μs    120KiB   98.7%  6.00KiB
 testmul_batch       20   9.70ms   44.0%   485μs   1.56KiB    1.3%    80.0B
 ──────────────────────────────────────────────────────────────────────────

julia> versioninfo()
Julia Version 1.9.0-DEV.824
Commit 5a0685a8a6* (2022-06-21 07:43 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, skylake-avx512)
  Threads: 10 on 28 virtual cores

Now I see a good improvement from sleep_all_threads(), without much of a penalty in tastmul_batch.

My times are much faster because my CPU has >10 physical cores and AVX512 (14 physical cores, but I only used 10 threads so that we get 2 iterations/thread).

@wly2014
Copy link
Author

wly2014 commented Jun 23, 2022

so that we get 2 iterations/thread

May be not 2 iterations/thread? Because the for iterations of function testmul_thread and testmul_batch are 96, so each thread will get about 96/10 iterations?

@chriselrod
Copy link
Member

so that we get 2 iterations/thread

May be not 2 iterations/thread? Because the for iterations of function testmul_thread and testmul_batch are 96, so each thread will get about 96/10 iterations?

Yes, 96. I should've looked at the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants