Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable_polyester_threads #86

Merged
merged 4 commits into from
Jul 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Polyester"
uuid = "f517fe37-dbe3-4b94-8317-1923a5111588"
authors = ["Chris Elrod <elrodc@gmail.com> and contributors"]
version = "0.6.13"
version = "0.6.14"

[deps]
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"
Expand All @@ -21,7 +21,7 @@ BitTwiddlingConvenienceFunctions = "0.1"
CPUSummary = "0.1.2 - 0.1.8, 0.1.11"
IfElse = "0.1"
ManualMemory = "0.1.3"
PolyesterWeave = "0.1"
PolyesterWeave = "0.1.7"
Requires = "1"
Static = "0.7"
StrideArraysCore = "0.3.11"
Expand Down
165 changes: 164 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,8 @@ Note that `@batch` defaults to using up to one thread per physical core, instead
is because [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) currently only uses up to 1 thread per physical core, and switching the number of
threads incurs some overhead. See the docstring on `@batch` (i.e., `?@batch` in a Julia REPL) for some more discussion.

## Local per-thread storage

You also can define local storage for each thread, providing a vector containing each of the local storages at the end.

```julia
Expand Down Expand Up @@ -446,4 +448,165 @@ julia> let
end

Float16[83.0, 90.0, 27.0, 65.0]
```
```

## Disabling Polyester threads

When running many repetitions of a Polyester-multithreaded function (e.g. in an embarrassingly parallel problem that repeatedly executes a small already Polyester-multithreaded function), it can be beneficial to disable Polyester (the inner multithreaded loop) and multithread only at the outer level (e.g. with `Base.Threads`). This can be done with the `disable_polyester_threads` context manager. In the expandable section below you can see examples with benchmarks.

It is best to call `disable_polyester_threads` only once, before any `@thread` uses happen, to avoid overhead. E.g. best to do it as:
```julia
disable_polyester_threads() do
@threads for i in 1:n
f()
end
end
```
instead of doing it in the following unnecessarily slow manner:
```julia
@threads for i in 1:n # DO NOT DO THIS
disable_polyester_threads() do # IT HAS UNNECESSARY OVERHEAD
f()
end
end
```


<details>
<summary>Benchmarks of nested multi-threading with Polyester</summary>

```julia
# Big inner problem, repeated only a few times

y = rand(10000000,4);
x = rand(size(y)...);

@btime inner($x,$y,1) # 73.319 ms (0 allocations: 0 bytes)
@btime inner_polyester($x,$y,1) # 8.936 ms (0 allocations: 0 bytes)
@btime inner_thread($x,$y,1) # 11.206 ms (49 allocations: 4.56 KiB)

@btime sequential_sequential($x,$y) # 274.926 ms (0 allocations: 0 bytes)
@btime sequential_polyester($x,$y) # 36.963 ms (0 allocations: 0 bytes)
@btime sequential_thread($x,$y) # 49.373 ms (196 allocations: 18.25 KiB)

@btime threads_of_polyester($x,$y) # 78.828 ms (58 allocations: 4.84 KiB)
# the following is a purposefully suboptimal way to disable threads
@btime threads_of_polyester_inner_disable($x,$y) # 70.182 ms (47 allocations: 4.50 KiB)
# the following is a good way to disable threads (the disable call happening once in the outer scope)
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 71.141 ms (47 allocations: 4.50 KiB)
@btime threads_of_sequential($x,$y) # 70.857 ms (46 allocations: 4.47 KiB)
@btime threads_of_thread($x,$y) # 45.116 ms (219 allocations: 22.00 KiB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is fastest because you have more than 4 threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought it is a reasonable edge case to show.


# Small inner problem, repeated many times

y = rand(1000,1000);
x = rand(size(y)...);

@btime inner($x,$y,1) # 7.028 μs (0 allocations: 0 bytes)
@btime inner_polyester($x,$y,1) # 1.917 μs (0 allocations: 0 bytes)
@btime inner_thread($x,$y,1) # 7.544 μs (45 allocations: 4.44 KiB)

@btime sequential_sequential($x,$y) # 6.790 ms (0 allocations: 0 bytes)
@btime sequential_polyester($x,$y) # 2.070 ms (0 allocations: 0 bytes)
@btime sequential_thread($x,$y) # 9.296 ms (49002 allocations: 4.46 MiB)

@btime threads_of_polyester($x,$y) # 2.090 ms (42 allocations: 4.34 KiB)
# the following is a purposefully suboptimal way to disable threads
@btime threads_of_polyester_inner_disable($x,$y) # 1.065 ms (42 allocations: 4.34 KiB)
# the following is a good way to disable threads (the disable call happening once in the outer scope)
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 997.918 μs (49 allocations: 4.56 KiB)
@btime threads_of_sequential($x,$y) # 1.057 ms (48 allocations: 4.53 KiB)
@btime threads_of_thread($x,$y) # 4.105 ms (42059 allocations: 4.25 MiB)

# The tested functions
# All of these would be better implemented by just using @tturbo,
# but these suboptimal implementations serve as good test case for
# Polyster-vs-Base thread scheduling.

function inner(x,y,j)
for i ∈ axes(x,1)
y[i,j] = sin(x[i,j])
end
end

function inner_polyester(x,y,j)
@batch for i ∈ axes(x,1)
y[i,j] = sin(x[i,j])
end
end

function inner_thread(x,y,j)
@threads for i ∈ axes(x,1)
y[i,j] = sin(x[i,j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously just an example, but @tturbo should win these benchmarks (and would also be single threaded from this!). ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I will mention it as a caveat in the benchmarks readme.

end
end

function sequential_sequential(x,y)
for j ∈ axes(x,2)
inner(x,y,j)
end
end

function sequential_polyester(x,y)
for j ∈ axes(x,2)
inner_polyester(x,y,j)
end
end

function sequential_thread(x,y)
for j ∈ axes(x,2)
inner_thread(x,y,j)
end
end

function threads_of_polyester(x,y)
@threads for j ∈ axes(x,2)
inner_polyester(x,y,j)
end
end

function threads_of_polyester_inner_disable(x,y)
# XXX This is a bad way to disable Polyester threads as
# it causes unnecessary overhead for each @threads thread.
# See the benchmarks above for a better way.
@threads for j ∈ axes(x,2)
Polyester.disable_polyester_threads() do
inner_polyester(x,y,j)
end
end
Comment on lines +572 to +576
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@threads for j ∈ axes(x,2)
Polyester.disable_polyester_threads() do
inner_polyester(x,y,j)
end
end
Polyester.disable_polyester_threads() do
@threads for j ∈ axes(x,2)
inner_polyester(x,y,j)
end
end

May as well hoist the disabling and re-enabling out of the loop, to avoid doing atomic operations to the same address in parallel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that is effectively what you're benchmarking with Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end;, my concern is that people are just going to copy/paste the examples, which means they'll likely use this slightly worse form, because it is there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted! The two versions (inside and outside the loop) were given exactly so that this can be compared. I will add a note to the readme to warn people about using it outside the loop.

end

function threads_of_thread(x,y)
@threads for j ∈ axes(x,2)
inner_thread(x,y,j)
end
end

function threads_of_thread(x,y)
@threads for j ∈ axes(x,2)
inner_thread(x,y,j)
end
end

function threads_of_sequential(x,y)
@threads for j ∈ axes(x,2)
inner(x,y,j)
end
end
```
Benchmarks executed on:
```
Julia Version 1.9.0-DEV.998
Commit e1739aa42a1 (2022-07-18 10:27 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 1700 Eight-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.5 (ORCJIT, znver1)
Threads: 8 on 16 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 8
```
</details>
5 changes: 3 additions & 2 deletions src/Polyester.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,11 @@ using ManualMemory: Reference
using Static
using Requires
using PolyesterWeave:
request_threads, free_threads!, mask, UnsignedIteratorEarlyStop, assume
request_threads, free_threads!, mask, UnsignedIteratorEarlyStop, assume,
disable_polyester_threads
using CPUSummary: num_threads, num_cores

export batch, @batch, num_threads
export batch, @batch, num_threads, disable_polyester_threads


include("batch.jl")
Expand Down
94 changes: 92 additions & 2 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ println(
"Starting tests with $(Threads.nthreads()) threads out of `Sys.CPU_THREADS = $(Sys.CPU_THREADS)`...",
)
using Polyester, Aqua, ForwardDiff
using Base.Threads: @threads
using Test

function bsin!(y, x, r = eachindex(y, x))
Expand Down Expand Up @@ -395,10 +396,10 @@ end
# issue 78 (lack of support for keyword arguments using only variable names without `=`)

f(a; b=10.0, c=100.0) = a + b + c

buf = [0, 0]
b = 0.0

Threads.nthreads() == 1 && println("the issue arises only on multithreading runs")

@batch for i in 1:2
Expand All @@ -408,6 +409,95 @@ end
@test buf == [1, 2]
end

@testset "disable_polyester_threads" begin
function inner(x,y,j)
for i ∈ axes(x,1)
y[i,j] = sin(x[i,j])
end
end

function inner_polyester(x,y,j)
@batch for i ∈ axes(x,1)
y[i,j] = sin(x[i,j])
end
end

function inner_thread(x,y,j)
@threads for i ∈ axes(x,1)
y[i,j] = sin(x[i,j])
end
end

function sequential_sequential(x,y)
for j ∈ axes(x,2)
inner(x,y,j)
end
end

function sequential_polyester(x,y)
for j ∈ axes(x,2)
inner_polyester(x,y,j)
end
end

function sequential_thread(x,y)
for j ∈ axes(x,2)
inner_thread(x,y,j)
end
end

function threads_of_polyester(x,y)
@threads for j ∈ axes(x,2)
inner_polyester(x,y,j)
end
end

function threads_of_polyester_inner_disable(x,y)
@threads for j ∈ axes(x,2)
Polyester.disable_polyester_threads() do
inner_polyester(x,y,j)
end
end
end

function threads_of_thread(x,y)
@threads for j ∈ axes(x,2)
inner_thread(x,y,j)
end
end

function threads_of_sequential(x,y)
@threads for j ∈ axes(x,2)
inner(x,y,j)
end
end

y = rand(10,10); # (size of inner problem, size of outer problem)
x = rand(size(y)...);
inner(x,y,1)
good_y = copy(y)
inner_polyester(x,y,1)
@assert good_y == y
inner_thread(x,y,1)
@assert good_y == y
sequential_sequential(x,y)
good_y = copy(y)
sequential_polyester(x,y)
@assert good_y == y
sequential_thread(x,y)
@assert good_y == y
threads_of_polyester(x,y)
@assert good_y == y
threads_of_polyester_inner_disable(x,y)
@assert good_y == y
disable_polyester_threads() do; threads_of_polyester(x,y) end
@assert good_y == y
threads_of_sequential(x,y)
@assert good_y == y
threads_of_thread(x,y)
@assert good_y == y
end

if VERSION ≥ v"1.6"
println("Package tests complete. Running `Aqua` checks.")
Aqua.test_all(Polyester)
Expand Down