Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@batch uses only the first 64 threads #83

Open
Keluaa opened this issue Jun 29, 2022 · 9 comments
Open

@batch uses only the first 64 threads #83

Keluaa opened this issue Jun 29, 2022 · 9 comments

Comments

@Keluaa
Copy link
Contributor

Keluaa commented Jun 29, 2022

I am currently working with a 128 cores (64 cores x 2 sockets) CPU, and I noticed that Polyester.jl will only use the first 64 threads available :

julia> using Polyester

julia> Threads.nthreads()
128

julia> tids = zeros(Int, Threads.nthreads());

julia> @batch for _ in 1:Threads.nthreads()
           tids[Threads.threadid()] += 1
       end

julia> all(tids .== 1)
false

julia> [tids[1:64]' ; tids[65:end]']
2x64 Matrix{Int64}
2  2  2  2  2  2  2  2 ... 2  2  2  2  2  2  2  2
0  0  0  0  0  0  0  0 ... 0  0  0  0  0  0  0  0

julia> tids .= 0;

julia> Threads.@threads for _ in Threads.nthreads()
           tids[Threads.threadid()] += 1
       end

julia> all(tids .== 1)
true

julia> [tids[1:64]' ; tids[65:end]']
2x64 Matrix{Int64}
1  1  1  1  1  1  1  1 ... 1  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1 ... 1  1  1  1  1  1  1  1

Notice how the threads 65 to 128 were ignored by @batch but not by Threads.@threads.
I get similar results with hyperthreading with 256 threads.

I am quite sure that this also affects LoopVectorization.jl since I am getting the same time for some very simple benchmarks :

julia> a = rand(Float64, 10000); b = rand(Float64, 10000); c = rand(Float64, 10000);

julia> @btime @tturbo for i in 1:10000
           c[i] = a[i] * b[i]
       end

Gives ~3.0 µs for 64 or 128 threads.

I suppose that it is related to the behaviors mentioned in #22. In any case I would be happy to help resolving this issue.

@chriselrod
Copy link
Member

Hmm.
What do you get for

using Polyester
Polyester.num_threads()
length(Polyester.ThreadingUtilities.TASKS)

?

@chriselrod
Copy link
Member

The two most likely places this can be going wrong:

  1. Multiple sockets not recognized, so if you have 2x 64 cores, it thinks you only have 1x 64 cores. That's why I wanted to see Polyester.num_threads() above.
  2. PolyesterWeave uses 64-bit masks to indicate which ThreadingUtilities.TASKS are available for use. If you have >64 TASKS, it needs to use a tuple of such masks. Maybe it's only using the first element of this tuple.

@chriselrod
Copy link
Member

chriselrod commented Jun 29, 2022

Sorry about repeatedly closing this issue. Apparently C-ret to comment now chooses "Close with comment" instead of "Comment"?

@Keluaa
Copy link
Contributor Author

Keluaa commented Jun 30, 2022

I get :

julia> using Polyester

julia> Polyester.num_threads()
static(128)

julia> length(Polyester.ThreadingUtilities.TASKS)
127

So it seems that it correctly detects the number of threads.
I also tried it with 64 threads and I got static(64) and 63. With hyperthreading enabled I get static(256) and 255.
Everything seems to be okay here.

@Keluaa
Copy link
Contributor Author

Keluaa commented Aug 2, 2022

I finally found some time to try to debug this issue.

The source of the problem seems to be the function free_threads! of PolyesterWeave, as it operates only on the first element of the worker bit masks tuple. Therefore at the first call of @batch, all 128 threads are running, but only the first 64 are freed, and so the next calls will only use the first 64 threads available.

I also spotted some weird behaviors caused by this line: since Nr is unsigned, if it is less than nthread (which is always is, if I understood the code correctly) it will overflow, making the next threads iterate over a bigger range than they should because of this line.

Should I create pull requests for my fixes?

@chriselrod
Copy link
Member

The source of the problem seems to be the function free_threads! of PolyesterWeave, as it operates only on the first element of the worker bit masks tuple.

It is called in a loop here:

Polyester.jl/src/batch.jl

Lines 159 to 172 in 2926208

for (threadmask, nthread, torelease)
zip(threadmask_tuple, nthread_tuple, torelease_tuple)
tm = mask(UnsignedIteratorEarlyStop(threadmask, nthread))
tid = 0x00000000
while tm zero(tm)
# assume(tm ≠ zero(tm))
tz = trailing_zeros(tm) % UInt32
tz += 0x00000001
tm >>>= tz
tid += tz
# @show tid, ThreadingUtilities._atomic_state(tid)
ThreadingUtilities.wait(tid)
end
free_threads!(torelease)

I also spotted some weird behaviors caused by this line: since Nr is unsigned,

Yeah, that definitely looks like a bug/like I wrote it assuming Nr was signed. I guess just cast it into a signed integeger, Nr%Int.
PRs welcome!

@Keluaa
Copy link
Contributor Author

Keluaa commented Aug 2, 2022

I just tried to run the original code where I first encountered this issue, and my fixes didn't work. Turns out that somehow I was using the per=thread option while debugging (at least that's what I think happened), and without it the number of threads is limited by the number of cores, but Polyester.num_cores() == 64 instead of 128.

The node has a CPU (AMD EPYC 7H12) with 2 sockets, each with 64 cores on them, but because of hyperthreading there is 2 threads per core. Therefore Polyester.num_threads() should return 256 and Polyester.num_cores() should return 128, meaning that the second socket is not recognized, in contrary to what I said in my first reply.

I am on a x86_64 architecture, and by looking at CPUSummary.jl, it seems that the number of sockets is hardcoded, and since CpuId.cpucores() only returns the number of cores for the current socket (as per the docs), we get 64 cores...

It is called in a loop here:

But the problem is in the function itself:

https://github.com/JuliaSIMD/PolyesterWeave.jl/blob/3eb4d66fffe82c2097e142ff8dc3ad008e504ea3/src/request.jl#L20-L23

It only considers the first element of the bit mask tuple since worker_pointer() always returns the same value. Therefore the loop applies the masks torelease to the same worker mask repeatedly.

@chriselrod
Copy link
Member

At one point in time, it was important to support WINE, but I believe this is no longer the case.
using Hwloc causes Julia to crash on WINE, which meant that CPUSummary had to switch from Hwloc.jl (which can correctly get the number of sockets) to CpuId.jl, for which I do not know how to do that.

We can switch back. Feel free to make a PR reverting main to version 0.1.8 or so.

@chriselrod
Copy link
Member

since worker_pointer() always returns the same value. Therefore the loop applies the masks torelease to the same worker mask repeatedly.

Ah, you're right. That's definitely a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants