Threading #15

antoine-levitt · 2019-08-05T08:36:15Z

New developments:
https://julialang.org/blog/2019/07/multithreading
JuliaMath/FFTW.jl#105
JuliaLang/julia#32786
There's also the StridedArrays package that automatically parallelizes broadcasts.
Note that there's a significant overhead for now: https://discourse.julialang.org/t/multithreaded-broadcast/26786, which appears to be a known issue that will get better at some point in the future JuliaLang/julia#32701 (comment)

So it looks like the preferred model will be that julia's scheduler handles all the threading, and the underlying libraries use julia's threads. Essentially this means that we will be able to just set JULIA_NUM_THREADS and get threaded FFT/BLAS from there. If we find out that this is too fine-grained to yield good speedup, we can add explicit annotations (eg @threads on the loop over bands for the Hamiltonian application, or @strided on selected time-intensive broadcasts), and that should work fine.

The text was updated successfully, but these errors were encountered:

mfherbst · 2019-08-05T11:28:46Z

I agree the new partr framework seems to be the way people are heading also for threading support in the lower libraries and it only makes sense to follow along with it. Especially, since users of our code could do all sorts of things on top. Regarding @strided: I think that will really only be helpful at a few places (e.g. in the application of the non-local projectors) where a lot of classical array operations happen on all the bands at once. We'll have to benchmark of course.

antoine-levitt · 2019-11-30T22:07:43Z

So, I did some very basic experiments. For a system with 400,000 plane waves, FFTW's own threading doesn't seem to do much: setting both FFTW and BLAS threads to the number of cores on my computer gave me a 20% speedup. So we should either do #9, or do our own threading

mfherbst · 2019-12-01T11:53:23Z

Hmm 20% is surprisingly little, but maybe I misunderstand what you did.

Could you perhaps commit a small benchmark script. I think it would be good to have a few "benchmark cases" or integrate with https://github.com/JuliaCI/PkgBenchmark.jl such that one can track the performance better. What do you think?

antoine-levitt · 2019-12-01T12:04:52Z

That's set_num_threads for both FFTW and Blas, set to the max number of cores vs 1. Benchmarking is easy : take any example and have more of it (eg set supercell). I don't think we need to setup performance tracking because essentially the only thing that matters right now is how we do the FFTs and how many of them we do, which is simpler to track by hand. The top things that are important right now are convergence criteria for the eigen solver (we do way too many iterations per scf step; by comparison abinit by default does 8 in the first two iterations, and then 4), and batching / threading FFTs.

antoine-levitt · 2019-12-01T17:35:02Z

@mfherbst can you try the following benchmarking script on the machine you have? https://gist.github.com/antoine-levitt/88086895dd98f746d6c795c99a10fd9f

Here I get

4 threads
N=128, M=40
Single FFT: no threads
  26.611 ms (0 allocations: 0 bytes)
Single FFT: threads
  15.158 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
  1.080 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  631.769 ms (112 allocations: 8.06 KiB)
Multiple FFTs: auto, no threads
  1.083 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  696.880 ms (3281 allocations: 272.52 KiB)
Multiple FFTs: manual_threaded, threads
  679.694 ms (3323 allocations: 275.73 KiB)
Multiple FFTs: auto, threads
  633.797 ms (39 allocations: 3.33 KiB)

So the good news is that all methods of parallelization are esssentially the same. The bad news is that they all suck :-) It looks like FFTs are almost memory-bound, and so do not benefit much from parallelization (at least on my machine). That's on julia 1.3. I'd test on the lab's cluster, but I'm getting proxy errors...

mfherbst · 2019-12-01T17:59:45Z

My machine (julia 1.3, fftw)

4 threads
N=128, M=40
Single FFT: no threads
  19.447 ms (0 allocations: 0 bytes)
Single FFT: threads
  9.175 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
  792.030 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  373.974 ms (110 allocations: 8.03 KiB)
Multiple FFTs: auto, no threads
  792.610 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  391.993 ms (3243 allocations: 271.92 KiB)
Multiple FFTs: manual_threaded, threads
  377.248 ms (3318 allocations: 275.66 KiB)
Multiple FFTs: auto, threads
  375.433 ms (40 allocations: 3.34 KiB)

mfherbst · 2019-12-01T18:14:07Z

Cluster08 (julia 1.2, MKL)

16 threads
N=128, M=40
Single FFT: no threads
  43.748 ms (0 allocations: 0 bytes)
Single FFT: threads
  6.878 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.774 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  418.949 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, no threads
  1.781 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  370.278 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
  287.693 ms (0 allocations: 0 bytes)

and (again 1.2, MKL)

4 threads
N=128, M=40
Single FFT: no threads
  39.283 ms (0 allocations: 0 bytes)
Single FFT: threads
  11.205 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.751 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  584.677 ms (111 allocations: 7.97 KiB)
Multiple FFTs: auto, no threads
  1.765 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  549.380 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  298.015 ms (107 allocations: 7.80 KiB)
Multiple FFTs: auto, threads
  496.712 ms (0 allocations: 0 bytes)

antoine-levitt · 2019-12-01T18:14:20Z

clustern20 (with julia 1.1, I can't make 1.3 work with the proxy for some reason):

16 threads
N=128, M=40
Single FFT: no threads
  32.266 ms (0 allocations: 0 bytes)
Single FFT: threads
  4.336 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.386 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  151.490 ms (53 allocations: 3.23 KiB)
Multiple FFTs: auto, no threads
  1.396 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  248.748 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  208.934 ms (56 allocations: 3.17 KiB)
Multiple FFTs: auto, threads
  143.142 ms (0 allocations: 0 bytes)

32 threads
N=128, M=40
Single FFT: no threads
  32.257 ms (0 allocations: 0 bytes)
Single FFT: threads
  3.193 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.361 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  156.536 ms (23 allocations: 1.42 KiB)
Multiple FFTs: auto, no threads
  1.550 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  151.108 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  156.481 ms (24 allocations: 1.48 KiB)
Multiple FFTs: auto, threads
  150.511 ms (0 allocations: 0 bytes)

That's much better. I think that's consistent with FFTs being memory limited, but memory scaling differently on different machines.

Takeaways: oversubscription is fine, FFTW doesn't do better than outer threading. So my suggestion is to plan for a single (like we do now) threaded FFT (by setting FFTW.set_num_threads to JULIA_NUM_THREADS), and add our own threading on top of that. That was fine on 1.1, and should be even better on 1.3. Pity I can't test it on the cluster...

mfherbst · 2019-12-01T18:15:17Z

Be careful with the 32 threads on cluster 20 ... it has hyper threading enabled, so effectively it's only 16 cores

antoine-levitt · 2019-12-01T18:16:09Z

Yeah I know, that was basically to test oversubscription

mfherbst · 2019-12-01T18:16:14Z

Julia 1.3 has changed the way they update the registries in a way that it seems to ignore the proxy settings ... I've had the same issues.

mfherbst · 2019-12-01T18:17:32Z

For FFTW I think you are right, but for MKL's FFT the picture seems to be different.

antoine-levitt · 2019-12-01T18:20:39Z

A bit, but maybe the results are too noisy. Can you run the 16 threads test again? I want to see if

Multiple FFTs: manual_threaded, threads
  325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
  287.693 ms (0 allocations: 0 bytes)

should be trusted or not.

mfherbst · 2019-12-01T18:26:57Z

Another run:

Multiple FFTs: manual_threaded, threads
  312.535 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, threads
  261.955 ms (0 allocations: 0 bytes)

and yet one more:

Multiple FFTs: manual_threaded, threads
  368.037 ms (180 allocations: 15.16 KiB)
Multiple FFTs: auto, threads
  305.578 ms (0 allocations: 0 bytes)

and on another machine (cc09):

Multiple FFTs: manual_threaded, threads
  211.597 ms (173 allocations: 14.36 KiB)
Multiple FFTs: auto, threads
  147.225 ms (0 allocations: 0 bytes)

mfherbst · 2019-12-01T18:27:34Z

The difference is similar in each case 50 to 60 ms.

antoine-levitt · 2019-12-01T18:30:09Z

Hm. So results are inconsistent, but always in the same direction. I'm tempted to ignore... We really should see what it does with 1.3 (or even better, master). There are a few open issues on the julia github about proxies, I posted in one, but proxies are a uniform pain.

antoine-levitt · 2019-12-01T18:51:56Z

But really, what this all shows is that a single FFT is already pretty well parallelized. Meaning that we can just ignore this and not do any threading at all (ie what we have now), and it'll be within a factor of 2 of the optimal (at least for these sizes). If we just add @threads in the for loop of the FFTs, we'll probably be optimal (or very close, esp. with post-1.2 improvements to threading). Then we should run a large-ish computation on the cluster, see if new bottlenecks appear, and maybe add threading accordingly.

antoine-levitt · 2019-12-01T19:58:24Z

For proxy issues, see julia issue 33111, that fixed it for me

antoine-levitt · 2019-12-01T20:07:16Z

So 1.3 improves the manual_threaded for me:

16 threads
N=128, M=40
Single FFT: no threads
  32.412 ms (0 allocations: 0 bytes)
Single FFT: threads
  3.564 ms (298 allocations: 26.22 KiB)
Multiple FFTs: manual, no threads
  1.423 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  155.690 ms (194 allocations: 16.94 KiB)
Multiple FFTs: auto, no threads
  1.415 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  177.217 ms (12010 allocations: 1.03 MiB)
Multiple FFTs: manual_threaded, threads
  143.499 ms (12359 allocations: 1.04 MiB)
Multiple FFTs: auto, threads
  173.176 ms (453 allocations: 37.64 KiB)
32 threads
N=128, M=40
Single FFT: no threads
  34.014 ms (0 allocations: 0 bytes)
Single FFT: threads
  2.989 ms (588 allocations: 52.25 KiB)
Multiple FFTs: manual, no threads
  1.442 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  156.606 ms (306 allocations: 28.81 KiB)
Multiple FFTs: auto, no threads
  1.451 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  170.377 ms (23837 allocations: 2.05 MiB)
Multiple FFTs: manual_threaded, threads
  154.102 ms (24331 allocations: 2.08 MiB)
Multiple FFTs: auto, threads
  144.152 ms (622 allocations: 51.50 KiB)

Still a slight edge for auto FFTW on 32 cores, but that changes from benchmark to benchmark, and when I repeated it manual_threaded was faster. So let's go with #77 and not bother too much.

mfherbst · 2019-12-01T20:15:10Z

I agree. Especially since this keeps more control on our end and opens way to integrate with the developments happening in Julia in the future.

antoine-levitt · 2019-12-01T20:17:14Z

OK, let's close this one for now then. We can revisit according to profiling.

antoine-levitt · 2019-12-01T20:25:06Z

One thing is that FFTW defaults to no threading. Let's keep that manual for now, but note for later that we have to FFTW.set_num_threads, and BLAS.set_num_threads. Also, FFTW threading occurs at plan creation.

mfherbst · 2019-12-01T20:27:00Z

That is not true. For me it does.

mfherbst · 2019-12-01T20:29:32Z

See https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L59. This is activated if nthreads() > 1 and I have by default export JULIA_NUM_THREADS=4, which I think is the way to go with this issue.

antoine-levitt · 2019-12-01T20:30:57Z

Oh, you're absolutely right, I stopped at https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L41. They're really confident oversubscription is not a problem then!

mfherbst · 2019-12-01T20:31:20Z

Indeed. I just saw that, too.

mfherbst added the performance Performance regression or performance-related label Aug 5, 2019

antoine-levitt mentioned this issue Dec 1, 2019

Multiple FFTs at once #9

Closed

antoine-levitt mentioned this issue Dec 1, 2019

Thread FFT, and improve on memory locality a bit #77

Merged

antoine-levitt closed this as completed Dec 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading #15

Threading #15

antoine-levitt commented Aug 5, 2019

mfherbst commented Aug 5, 2019

antoine-levitt commented Nov 30, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019 •

edited

Loading

mfherbst commented Dec 1, 2019 •

edited

Loading

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

Threading #15

Threading #15

Comments

antoine-levitt commented Aug 5, 2019

mfherbst commented Aug 5, 2019

antoine-levitt commented Nov 30, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019 • edited Loading

mfherbst commented Dec 1, 2019 • edited Loading

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

mfherbst commented Dec 1, 2019

antoine-levitt commented Dec 1, 2019

mfherbst commented Dec 1, 2019

mfherbst commented Dec 1, 2019 •

edited

Loading

mfherbst commented Dec 1, 2019 •

edited

Loading