Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@batch slows down other non-@batched code on x86 #110

Open
efaulhaber opened this issue May 30, 2023 · 8 comments
Open

@batch slows down other non-@batched code on x86 #110

efaulhaber opened this issue May 30, 2023 · 8 comments

Comments

@efaulhaber
Copy link

efaulhaber commented May 30, 2023

Consider the following functions.

using Polyester

function reset_x!(x)
    x .= 0
end

function without_batch(x)
    for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

function with_batch(x)
    @batch for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

Running with_batch between reset_x! calls seems to somehow slow down reset_x! significantly:

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  92.173 μs … 426.870 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     94.362 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.776 μs ±   3.789 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▄▅▆▇██████▇▇▆▅▅▄▃▂▂▂ ▁▁▁  ▁     ▁▁▁▁▁▁▁                ▃
  ▄▄▄▆███████████████████████████▇██▇▇█▇████████████▇▆▆▇▇▆▆▆▆▆ █
  92.2 μs       Histogram: log(frequency) by time       101 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  185.404 μs … 518.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     188.394 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   189.119 μs ±   4.086 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▅▇█▇▇▇▅▅▃▂                                             
  ▁▁▁▂▃▆▇█████████████▇▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  185 μs           Histogram: frequency by time          198 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD Ryzen Threadripper 3990X 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 8 on 128 virtual cores

I also tested and reproduced this on an Intel i9-10980XE, but the difference there was only ~10% on 8 threads.

CC: @sloede @ranocha @LasNikas.

@chriselrod
Copy link
Member

I see the same thing with Threads.@threads.

julia> using Polyester

julia> function reset_x!(x)
           x .= 0
       end
reset_x! (generic function with 1 method)

julia> function without_batch(x)
           for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
without_batch (generic function with 1 method)

julia> function with_batch(x)
           @batch for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_batch (generic function with 1 method)

julia> function with_thread(x)
           Threads.@threads for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_thread (generic function with 1 method)

julia> 

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 2881 samples with 1 evaluation.
 Range (min  max):  272.081 μs  319.111 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     273.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   273.863 μs ±   2.094 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▇▂▄▄██▄▅▅▇▅▁▃▃▂                                              
  ▆████████████████▆▆▇▆▇▇▆▅█▇▅▄▄▃▃▂▂▁▂▁▁▁▂▂▁▂▁▁▁▂▂▁▂▂▁▁▁▁▁▂▂▂▁▂ ▄
  272 μs           Histogram: frequency by time          280 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 2809 samples with 1 evaluation.
 Range (min  max):  312.843 μs  350.793 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     316.288 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   316.484 μs ±   2.223 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▂      ▂▄▄▆▅▅▄▅▅▅▅▇▇▅▇███▆█▆▄▁▁▃▁▂▂▁▂ ▂ ▁ ▁ ▁               
  ▃███▆▅▆▆▄▆██████████████████████████████████████▆█▇▄▄▃▃▃▂▃▄▄▂ ▆
  313 μs           Histogram: frequency by time          321 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 2063 samples with 1 evaluation.
 Range (min  max):  313.300 μs  392.157 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     359.390 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   358.813 μs ±  10.205 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▂▄▄▄█▄▅▄▇▅▅▄▃▂▂ ▁            
  ▂▃▁▁▂▂▂▂▁▃▂▁▂▂▂▂▃▂▂▃▂▃▃▃▄▃▃▄▄▅▅▇███████████████████▆▆▅▄▄▃▃▃▃▃ ▄
  313 μs           Histogram: frequency by time          381 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@efaulhaber
Copy link
Author

You're right. It's even worse for me:

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  91.472 μs … 414.889 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     92.852 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.118 μs ±   3.508 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▃▅▅▆▆▇▇████▇▇▆▅▅▄▃▃▂▂▁▁                            ▁     ▃
  ▆████████████████████████████▇▆▆█▇█▇█▇▇█▆▆▆▇▆▆▆▆▇▇███████▇█▇ █
  91.5 μs       Histogram: log(frequency) by time      97.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.675 μs … 533.093 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     194.245 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   194.754 μs ±   3.916 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃▅▇███▅▄▂                                             
  ▁▁▁▁▂▃▄▅███████████▆▅▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  192 μs           Histogram: frequency by time          203 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  201.894 μs … 557.973 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     205.595 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   206.066 μs ±   6.377 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▃▅▆▆█▇██▅▆▄▂                                      
  ▁▁▁▁▁▂▂▃▄▅▆███████████████▇▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁ ▃
  202 μs           Histogram: frequency by time          213 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Where would I report this to then? JuliaLang/julia?

@chriselrod
Copy link
Member

It may be interesting to uncover why.
Possible ideas:

  1. Threads take a while to go to sleep; they can hog system resources until then.
  2. Clock speeds. Could mess with bios settings; I used to use the same speeds independent of the number of cores/disable boosting to mitigate this, but I believe I reverted to stock settings.
  3. Even though the code is not executed, perhaps prefetchers still fetch the memory, and this causes some synchronization overhead?

You could experiment by playing with https://github.com/JuliaSIMD/ThreadingUtilities.jl/blob/e7f2f4ba725cf8862f42cb34b83916e3561c15f8/src/threadtasks.jl#L24
for Polyester, or
https://docs.julialang.org/en/v1/manual/environment-variables/#JULIA_THREAD_SLEEP_THRESHOLD
for Threads.@threads.
I tried setting it to 1 or 0, but this did not help.

@efaulhaber
Copy link
Author

I tried this as well.

julia> @benchmark reset_x!($x) setup=(with_batch(x); ThreadingUtilities.sleep_all_tasks())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  197.295 μs … 540.873 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     200.054 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   200.487 μs ±   3.949 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▅▆███▅▃▁                                           
  ▁▁▁▁▂▂▃▄▅▇███████████▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  197 μs           Histogram: frequency by time          208 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Wouldn't that rule out your first idea?

@chriselrod
Copy link
Member

Wouldn't that rule out your first idea?

Yes.

@chriselrod
Copy link
Member

You could try counting cpu cycles with cpucycle.

@efaulhaber
Copy link
Author

I'll come back to this when I have more time. For now, I opened an issue in JuliaLang/julia.

@chriselrod
Copy link
Member

Note that by default, Polyester only uses one thread per physical core.
You can change this (@batch per=thread).
You could also get @threads to only use one per core, by starting Julia with only one per core.

Those experiments could tell you if the Polyester.@batch vs Threads.@threads difference is only a function of the number of threads they're using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants