`@batch` slows down other non-@batched code on x86 #110

efaulhaber · 2023-05-30T13:52:12Z

Consider the following functions.

using Polyester

function reset_x!(x)
    x .= 0
end

function without_batch(x)
    for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

function with_batch(x)
    @batch for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

Running with_batch between reset_x! calls seems to somehow slow down reset_x! significantly:

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  92.173 μs … 426.870 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     94.362 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.776 μs ±   3.789 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▄▅▆▇██████▇▇▆▅▅▄▃▂▂▂ ▁▁▁  ▁     ▁▁▁▁▁▁▁                ▃
  ▄▄▄▆███████████████████████████▇██▇▇█▇████████████▇▆▆▇▇▆▆▆▆▆ █
  92.2 μs       Histogram: log(frequency) by time       101 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  185.404 μs … 518.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     188.394 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   189.119 μs ±   4.086 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▅▇█▇▇▇▅▅▃▂                                             
  ▁▁▁▂▃▆▇█████████████▇▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  185 μs           Histogram: frequency by time          198 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD Ryzen Threadripper 3990X 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 8 on 128 virtual cores

I also tested and reproduced this on an Intel i9-10980XE, but the difference there was only ~10% on 8 threads.

CC: @sloede @ranocha @LasNikas.

The text was updated successfully, but these errors were encountered:

chriselrod · 2023-05-30T14:38:42Z

I see the same thing with Threads.@threads.

julia> using Polyester

julia> function reset_x!(x)
           x .= 0
       end
reset_x! (generic function with 1 method)

julia> function without_batch(x)
           for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
without_batch (generic function with 1 method)

julia> function with_batch(x)
           @batch for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_batch (generic function with 1 method)

julia> function with_thread(x)
           Threads.@threads for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_thread (generic function with 1 method)

julia> 

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 2881 samples with 1 evaluation.
 Range (min … max):  272.081 μs … 319.111 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     273.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   273.863 μs ±   2.094 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▇▂▄▄██▄▅▅▇▅▁▃▃▂                                              
  ▆████████████████▆▆▇▆▇▇▆▅█▇▅▄▄▃▃▂▂▁▂▁▁▁▂▂▁▂▁▁▁▂▂▁▂▂▁▁▁▁▁▂▂▂▁▂ ▄
  272 μs           Histogram: frequency by time          280 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 2809 samples with 1 evaluation.
 Range (min … max):  312.843 μs … 350.793 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     316.288 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   316.484 μs ±   2.223 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▂      ▂▄▄▆▅▅▄▅▅▅▅▇▇▅▇███▆█▆▄▁▁▃▁▂▂▁▂ ▂ ▁ ▁ ▁               
  ▃███▆▅▆▆▄▆██████████████████████████████████████▆█▇▄▄▃▃▃▂▃▄▄▂ ▆
  313 μs           Histogram: frequency by time          321 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 2063 samples with 1 evaluation.
 Range (min … max):  313.300 μs … 392.157 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     359.390 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   358.813 μs ±  10.205 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▂▄▄▄█▄▅▄▇▅▅▄▃▂▂ ▁            
  ▂▃▁▁▂▂▂▂▁▃▂▁▂▂▂▂▃▂▂▃▂▃▃▃▄▃▃▄▄▅▅▇███████████████████▆▆▅▄▄▃▃▃▃▃ ▄
  313 μs           Histogram: frequency by time          381 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

efaulhaber · 2023-05-30T14:44:48Z

You're right. It's even worse for me:

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  91.472 μs … 414.889 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     92.852 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.118 μs ±   3.508 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▃▅▅▆▆▇▇████▇▇▆▅▅▄▃▃▂▂▁▁                            ▁     ▃
  ▆████████████████████████████▇▆▆█▇█▇█▇▇█▆▆▆▇▆▆▆▆▇▇███████▇█▇ █
  91.5 μs       Histogram: log(frequency) by time      97.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.675 μs … 533.093 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     194.245 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   194.754 μs ±   3.916 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃▅▇███▅▄▂                                             
  ▁▁▁▁▂▃▄▅███████████▆▅▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  192 μs           Histogram: frequency by time          203 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  201.894 μs … 557.973 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     205.595 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   206.066 μs ±   6.377 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▃▅▆▆█▇██▅▆▄▂                                      
  ▁▁▁▁▁▂▂▃▄▅▆███████████████▇▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁ ▃
  202 μs           Histogram: frequency by time          213 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Where would I report this to then? JuliaLang/julia?

chriselrod · 2023-05-30T14:53:37Z

It may be interesting to uncover why.
Possible ideas:

Threads take a while to go to sleep; they can hog system resources until then.
Clock speeds. Could mess with bios settings; I used to use the same speeds independent of the number of cores/disable boosting to mitigate this, but I believe I reverted to stock settings.
Even though the code is not executed, perhaps prefetchers still fetch the memory, and this causes some synchronization overhead?

You could experiment by playing with https://github.com/JuliaSIMD/ThreadingUtilities.jl/blob/e7f2f4ba725cf8862f42cb34b83916e3561c15f8/src/threadtasks.jl#L24
for Polyester, or
https://docs.julialang.org/en/v1/manual/environment-variables/#JULIA_THREAD_SLEEP_THRESHOLD
for Threads.@threads.
I tried setting it to 1 or 0, but this did not help.

efaulhaber · 2023-05-30T14:56:30Z

I tried this as well.

julia> @benchmark reset_x!($x) setup=(with_batch(x); ThreadingUtilities.sleep_all_tasks())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  197.295 μs … 540.873 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     200.054 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   200.487 μs ±   3.949 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▅▆███▅▃▁                                           
  ▁▁▁▁▂▂▃▄▅▇███████████▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  197 μs           Histogram: frequency by time          208 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Wouldn't that rule out your first idea?

chriselrod · 2023-05-30T14:57:21Z

Wouldn't that rule out your first idea?

Yes.

chriselrod · 2023-05-30T14:58:18Z

You could try counting cpu cycles with cpucycle.

efaulhaber · 2023-05-31T12:40:33Z

I'll come back to this when I have more time. For now, I opened an issue in JuliaLang/julia.

chriselrod · 2023-05-31T13:14:59Z

Note that by default, Polyester only uses one thread per physical core.
You can change this (@batch per=thread).
You could also get @threads to only use one per core, by starting Julia with only one per core.

Those experiments could tell you if the Polyester.@batch vs Threads.@threads difference is only a function of the number of threads they're using.

efaulhaber mentioned this issue May 31, 2023

@threads slows down other non-threaded code JuliaLang/julia#50012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`@batch` slows down other non-@batched code on x86 #110

`@batch` slows down other non-@batched code on x86 #110

efaulhaber commented May 30, 2023 •

edited

Loading

chriselrod commented May 30, 2023

efaulhaber commented May 30, 2023

chriselrod commented May 30, 2023

efaulhaber commented May 30, 2023

chriselrod commented May 30, 2023

chriselrod commented May 30, 2023

efaulhaber commented May 31, 2023

chriselrod commented May 31, 2023

@batch slows down other non-@batched code on x86 #110

@batch slows down other non-@batched code on x86 #110

Comments

efaulhaber commented May 30, 2023 • edited Loading

chriselrod commented May 30, 2023

efaulhaber commented May 30, 2023

chriselrod commented May 30, 2023

efaulhaber commented May 30, 2023

chriselrod commented May 30, 2023

chriselrod commented May 30, 2023

efaulhaber commented May 31, 2023

chriselrod commented May 31, 2023

`@batch` slows down other non-@batched code on x86 #110

`@batch` slows down other non-@batched code on x86 #110

efaulhaber commented May 30, 2023 •

edited

Loading