Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable_polyester_threads #86

Merged
merged 4 commits into from
Jul 20, 2022
Merged

disable_polyester_threads #86

merged 4 commits into from
Jul 20, 2022

Conversation

Krastanov
Copy link
Contributor

As mentioned on discourse: https://discourse.julialang.org/t/nesting-threads-thread-and-polyester-batch-or-context-manager-to-limit-polyester-threads/84490/4

Benchmarks with and without it below:

using Base.Threads
using BenchmarkTools
using Revise
using Polyester

\##

function inner(x,y,j)
    for i ∈ axes(x,1)
        y[i,j] = sin(x[i,j])
    end
end

function inner_polyester(x,y,j)
    @batch for i ∈ axes(x,1)
        y[i,j] = sin(x[i,j])
    end
end

function inner_thread(x,y,j)
    @threads for i ∈ axes(x,1)
        y[i,j] = sin(x[i,j])
    end
end

function sequential_sequential(x,y)
    for j ∈ axes(x,2)
        inner(x,y,j)
    end
end

function sequential_polyester(x,y)
    for j ∈ axes(x,2)
        inner_polyester(x,y,j)
    end
end

function sequential_thread(x,y)
    for j ∈ axes(x,2)
        inner_thread(x,y,j)
    end
end

function threads_of_polyester(x,y)
    @threads for j ∈ axes(x,2)
        inner_polyester(x,y,j)
    end
end

function threads_of_polyester_inner_disable(x,y)
    @threads for j ∈ axes(x,2)
        Polyester.disable_polyester_threads() do
            inner_polyester(x,y,j)
        end
    end
end

function threads_of_thread(x,y)
    @threads for j ∈ axes(x,2)
        inner_thread(x,y,j)
    end
end

function threads_of_thread(x,y)
    @threads for j ∈ axes(x,2)
        inner_thread(x,y,j)
    end
end

function threads_of_sequential(x,y)
    @threads for j ∈ axes(x,2)
        inner(x,y,j)
    end
end

 # Big inner problem, repeated only a few times

y = rand(10000000,4);
x = rand(size(y)...);

@btime inner($x,$y,1) # 73.319 ms (0 allocations: 0 bytes)
@btime inner_polyester($x,$y,1) # 8.936 ms (0 allocations: 0 bytes)
@btime inner_thread($x,$y,1) # 11.206 ms (49 allocations: 4.56 KiB)

@btime sequential_sequential($x,$y) # 274.926 ms (0 allocations: 0 bytes)
@btime sequential_polyester($x,$y) # 36.963 ms (0 allocations: 0 bytes)
@btime sequential_thread($x,$y) # 49.373 ms (196 allocations: 18.25 KiB)

@btime threads_of_polyester($x,$y) # 78.828 ms (58 allocations: 4.84 KiB)
@btime threads_of_polyester_inner_disable($x,$y) # 70.182 ms (47 allocations: 4.50 KiB)
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 71.141 ms (47 allocations: 4.50 KiB)
@btime threads_of_sequential($x,$y) # 70.857 ms (46 allocations: 4.47 KiB)
@btime threads_of_thread($x,$y) # 45.116 ms (219 allocations: 22.00 KiB)

 # Small inner problem, repated many times

y = rand(1000,1000);
x = rand(size(y)...);

@btime inner($x,$y,1) # 7.028 μs (0 allocations: 0 bytes)
@btime inner_polyester($x,$y,1) # 1.917 μs (0 allocations: 0 bytes)
@btime inner_thread($x,$y,1) # 7.544 μs (45 allocations: 4.44 KiB)

@btime sequential_sequential($x,$y) # 6.790 ms (0 allocations: 0 bytes)
@btime sequential_polyester($x,$y) # 2.070 ms (0 allocations: 0 bytes)
@btime sequential_thread($x,$y) # 9.296 ms (49002 allocations: 4.46 MiB)

@btime threads_of_polyester($x,$y) # 2.090 ms (42 allocations: 4.34 KiB)
@btime threads_of_polyester_inner_disable($x,$y) # 1.065 ms (42 allocations: 4.34 KiB)
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 997.918 μs (49 allocations: 4.56 KiB)
@btime threads_of_sequential($x,$y) # 1.057 ms (48 allocations: 4.53 KiB)
@btime threads_of_thread($x,$y) # 4.105 ms (42059 allocations: 4.25 MiB)

Benchmarks with and without it below:

```
using Base.Threads
using BenchmarkTools
using Revise
using Polyester

\##

function inner(x,y,j)
    for i ∈ axes(x,1)
        y[i,j] = sin(x[i,j])
    end
end

function inner_polyester(x,y,j)
    @Batch for i ∈ axes(x,1)
        y[i,j] = sin(x[i,j])
    end
end

function inner_thread(x,y,j)
    @threads for i ∈ axes(x,1)
        y[i,j] = sin(x[i,j])
    end
end

function sequential_sequential(x,y)
    for j ∈ axes(x,2)
        inner(x,y,j)
    end
end

function sequential_polyester(x,y)
    for j ∈ axes(x,2)
        inner_polyester(x,y,j)
    end
end

function sequential_thread(x,y)
    for j ∈ axes(x,2)
        inner_thread(x,y,j)
    end
end

function threads_of_polyester(x,y)
    @threads for j ∈ axes(x,2)
        inner_polyester(x,y,j)
    end
end

function threads_of_polyester_inner_disable(x,y)
    @threads for j ∈ axes(x,2)
        Polyester.disable_polyester_threads() do
            inner_polyester(x,y,j)
        end
    end
end

function threads_of_thread(x,y)
    @threads for j ∈ axes(x,2)
        inner_thread(x,y,j)
    end
end

function threads_of_thread(x,y)
    @threads for j ∈ axes(x,2)
        inner_thread(x,y,j)
    end
end

function threads_of_sequential(x,y)
    @threads for j ∈ axes(x,2)
        inner(x,y,j)
    end
end

 # Big inner problem, repeated only a few times

y = rand(10000000,4);
x = rand(size(y)...);

@Btime inner($x,$y,1) # 73.319 ms (0 allocations: 0 bytes)
@Btime inner_polyester($x,$y,1) # 8.936 ms (0 allocations: 0 bytes)
@Btime inner_thread($x,$y,1) # 11.206 ms (49 allocations: 4.56 KiB)

@Btime sequential_sequential($x,$y) # 274.926 ms (0 allocations: 0 bytes)
@Btime sequential_polyester($x,$y) # 36.963 ms (0 allocations: 0 bytes)
@Btime sequential_thread($x,$y) # 49.373 ms (196 allocations: 18.25 KiB)

@Btime threads_of_polyester($x,$y) # 78.828 ms (58 allocations: 4.84 KiB)
@Btime threads_of_polyester_inner_disable($x,$y) # 70.182 ms (47 allocations: 4.50 KiB)
@Btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 71.141 ms (47 allocations: 4.50 KiB)
@Btime threads_of_sequential($x,$y) # 70.857 ms (46 allocations: 4.47 KiB)
@Btime threads_of_thread($x,$y) # 45.116 ms (219 allocations: 22.00 KiB)

 # Small inner problem, repated many times

y = rand(1000,1000);
x = rand(size(y)...);

@Btime inner($x,$y,1) # 7.028 μs (0 allocations: 0 bytes)
@Btime inner_polyester($x,$y,1) # 1.917 μs (0 allocations: 0 bytes)
@Btime inner_thread($x,$y,1) # 7.544 μs (45 allocations: 4.44 KiB)

@Btime sequential_sequential($x,$y) # 6.790 ms (0 allocations: 0 bytes)
@Btime sequential_polyester($x,$y) # 2.070 ms (0 allocations: 0 bytes)
@Btime sequential_thread($x,$y) # 9.296 ms (49002 allocations: 4.46 MiB)

@Btime threads_of_polyester($x,$y) # 2.090 ms (42 allocations: 4.34 KiB)
@Btime threads_of_polyester_inner_disable($x,$y) # 1.065 ms (42 allocations: 4.34 KiB)
@Btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 997.918 μs (49 allocations: 4.56 KiB)
@Btime threads_of_sequential($x,$y) # 1.057 ms (48 allocations: 4.53 KiB)
@Btime threads_of_thread($x,$y) # 4.105 ms (42059 allocations: 4.25 MiB)

```
@codecov
Copy link

codecov bot commented Jul 20, 2022

Codecov Report

Merging #86 (36aacb8) into master (259c20f) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #86   +/-   ##
=======================================
  Coverage   88.46%   88.46%           
=======================================
  Files           2        2           
  Lines         416      416           
=======================================
  Hits          368      368           
  Misses         48       48           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 259c20f...36aacb8. Read the comment docs.

src/utility.jl Outdated
t, r = request_threads(num_threads())
f()
foreach(free_threads!, r)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be

    t, r = request_threads(num_threads())
    try
        f()
    finally
        foreach(free_threads!, r)
    end

Also, maybe this should be in PolyesterWeave itself?
It could still be re-exported here.

Should also document that it turns off threading for LoopVectorization.@tturbo and Octavian.matmul, even though neither depend on Polyester (but they do depend on PolyesterWeave).

Do benchmark my suggestion though.
If it is slower, feel free to ignore it.

The advantage of my suggestion is that, if your code throws an error, it'll still free the threads. Otherwise, they'll be turned off until you manually call PolyesterWeave.reset_workers!().
Polyester.@batch doesn't use try/finally either, meaning it requires the manual calls after you get errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do:

  • documenting disabling threads in tturbo and matmul
  • checking whether a try block is much slower
  • move to PolyesterWeave after the above is done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After moving to PolyesterWeave, you can of course import and reexport in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks did not show any slowdown with a try block (especially and obviously if the disable command is executed once in the outermost scope).

Comment on lines +544 to +548
Polyester.disable_polyester_threads() do
inner_polyester(x,y,j)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@threads for j ∈ axes(x,2)
Polyester.disable_polyester_threads() do
inner_polyester(x,y,j)
end
end
Polyester.disable_polyester_threads() do
@threads for j ∈ axes(x,2)
inner_polyester(x,y,j)
end
end

May as well hoist the disabling and re-enabling out of the loop, to avoid doing atomic operations to the same address in parallel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that is effectively what you're benchmarking with Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end;, my concern is that people are just going to copy/paste the examples, which means they'll likely use this slightly worse form, because it is there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted! The two versions (inside and outside the loop) were given exactly so that this can be compared. I will add a note to the readme to warn people about using it outside the loop.

@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 71.141 ms (47 allocations: 4.50 KiB)
@btime threads_of_sequential($x,$y) # 70.857 ms (46 allocations: 4.47 KiB)
@btime threads_of_thread($x,$y) # 45.116 ms (219 allocations: 22.00 KiB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is fastest because you have more than 4 threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought it is a reasonable edge case to show.

function inner_thread(x,y,j)
@threads for i ∈ axes(x,1)
y[i,j] = sin(x[i,j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously just an example, but @tturbo should win these benchmarks (and would also be single threaded from this!). ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I will mention it as a caveat in the benchmarks readme.

@Krastanov
Copy link
Contributor Author

The tests will probably fail due to the dependence of the new PolyesterWeave. See here JuliaSIMD/PolyesterWeave.jl#4

@chriselrod
Copy link
Member

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants