Skip to content

Deadlock with multiple FFTW plan construction / destruction in serial and threaded runs #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mfherbst opened this issue Mar 10, 2020 · 16 comments · Fixed by #160
Closed

Comments

@mfherbst
Copy link

Running the following code example causes a deadlock conjunction with Julia 1.3.1, DFTK 0.0.5 and FFTW 1.2.0. Stacktrace (after a Ctrl-C) indicates that the deadlock occurs in fftw. Notice, such behaviour is encountered both in serial and threaded runs.

Code example:

using DFTK    
    
function main()    
    for Ecut in 0.0 .+ collect(3:20)    
        println("Ecut = $Ecut")    
        a = 10.263141334305942    
        lattice = a / 2 .* [[0 1 1.]; [1 0 1.]; [1 1 0.]]    
        model = Model(lattice; terms=[Kinetic()], n_electrons=4)    
        basis = PlaneWaveBasis(model, Ecut)    
    end    
end    
    
main()   

Output / Stacktrace: https://gist.github.com/mfherbst/4015ff891001597fccf66648abff1aa1

Sorry about the complicated reproducer btw ... I tried to reduce as much as I could.

@mfherbst
Copy link
Author

Note: The issue also present on master as of now.

@mfherbst
Copy link
Author

Could be related to #135.

@mfherbst
Copy link
Author

mfherbst commented Mar 11, 2020

The issue disappears when setting explicitly FFTW.set_num_threads(1) (also when Julia runs with JULIA_NUM_THREADS=4).

@stevengj
Copy link
Member

stevengj commented Mar 11, 2020

@vtjnash, any ideas? FFTW has an internal pthreads mutex lock around plan creation and destruction, but I don't see how that could deadlock Julia…

Also, the interaction with FFTW.set_num_threads(1) is weird, since that doesn't affect the plan creation/destruction lock, and only changes whether FFTW creates plans that use the @spawn callback.

@EthanAnderes
Copy link

EthanAnderes commented Mar 26, 2020

I think I'm seeing the same issue. Here is a reduced toy case.

using FFTW 
import Base: * 

struct rFFT{nᵢ} end

@generated function plan(::Type{F}) where {nᵢ, F<:rFFT{nᵢ}}
    X  = zeros(Float64, nᵢ...) 
    FT = plan_rfft(X; flags=FFTW.MEASURE, timelimit=10)
    return :( $FT )
end

nᵢ = (256, 50)
FT = rFFT{nᵢ}
matx = rand(Float64, nᵢ...)

plan(FT) * matx

The last line hangs with no CPU activity coming from Julia. If, however, I replace the
last line with the following two lines

*(::Type{FT}, x::Array) where FT<:rFFT = plan(FT) * x
FT * matx

then I get this segfault.

julia> FT * matx

signal (11): Segmentation fault: 11
in expression starting at REPL[9]:1
spawn_apply at /Users/ethananderes/.julia/artifacts/686f343230e1853cc948cd0736b57067c66bca4b/lib/libfftw3.3.dylib (unknown line)
#1 at ./threadingconstructs.jl:126
unknown function (ip: 0x115f16cf4)
jl_apply at /Users/ethananderes/Software/juliaMaster/src/./julia.h:1685 [inlined]
start_task at /Users/ethananderes/Software/juliaMaster/src/task.c:687
Allocations: 7684063 (Pool: 7682495; Big: 1568); GC: 7
zsh: segmentation fault  /Users/ethananderes/Software/juliaMaster/julia -O3

I tried to track down the issue further but I hit my limit of understanding. Hope
this helps.

julia> versioninfo()
Julia Version 1.5.0-DEV.513
Commit 5ecc17f0e0 (2020-03-26 20:16 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.4.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_FFTW_PROVIDER = FFTW

@mfherbst
Copy link
Author

mfherbst commented Mar 27, 2020

I can reproduce @EthanAnderes's first example on Julia 1.4 and FFTW 1.2.0, but I do not get a segfault in your second example (with the explicitly defined * operator, only a hanging Julia process again.

As before a FFTW.set_num_threads(1) makes the code run through smoothly in both cases.

Update: I also get the segfault if I start from a fresh session.

@EthanAnderes
Copy link

Note: to get the seg fault I need to use a fresh session and not
evaluate plan(FT) * matx first.

using FFTW 
import Base: * 
struct rFFT{nᵢ} end
@generated function plan(::Type{F}) where {nᵢ, F<:rFFT{nᵢ}}
    X  = zeros(Float64, nᵢ...) 
    FT = plan_rfft(X; flags=FFTW.MEASURE, timelimit=10)
    return :( $FT )
end
nᵢ = (256, 50)
FT = rFFT{nᵢ}
matx = rand(Float64, nᵢ...)
*(::Type{FT}, x::Array) where FT<:rFFT = plan(FT) * x
FT * matx

on Julia v1.4.0 I don't get as much readout from the segfault:

julia> FT * matx
zsh: segmentation fault  /Users/ethananderes/Software/julia1.4/julia -O3

Perhaps I should also remark that by using a generated function for plan
I'm trying to memoize it for quick instantiation inside other functions.
If anyone knows of a non-segfault way to get around this I'd love to hear it.

@vtjnash
Copy link
Contributor

vtjnash commented Mar 27, 2020

Attempting to access FFTW inside @generated should be obviously invalid by the documentation for what is okay. If you want memoization, you should be writing your own, not trying to cause undefined behavior in the compiler. I suspect it should be trying to print that out as an error message, but I'm not certain why that fails.

@vtjnash
Copy link
Contributor

vtjnash commented Mar 27, 2020

@stevengj It looks like it's possibly trying to recursively hold the lock, since it's in the middle of one plan and trying to finalize a different one. I'd like to move all finalizers to a separate thread to solve this issue, but that's it's rather far away on my roadmap right now. For now, I recommend only doing trivial things in a finalizer, like appending it to a queue to be cleaned up later.

@EthanAnderes
Copy link

FFTW inside @generated should be obviously invalid by the documentation for what is okay.

I guess your referring to FFTW documentation that ftt plans have global mutable state?
So that if, instead, they just where regular structs which held information on which
fft method to apply then it would be fine?

Anyway, these generated functions for caching fft plans have been working flawlessly for me
for a number of years now and only started giving me problems with the new threading stuff.
(see here for example)

@IanButterworth
Copy link
Contributor

Has anyone found a workaround other than setting set_num_threads(1)? I'm guessing not, but thought I'd check given the performance hit

@IanButterworth
Copy link
Contributor

IanButterworth commented Jun 21, 2020

Edit: I'm realizing now that @EthanAndres's issue seems to be different to the deadlock.


@vtjnash if it's insightful at all, commenting out the finalizer here doesn't seem to change the seg fault seen with @EthanAnderes's example

finalizer(destroy_plan, p)

@IanButterworth
Copy link
Contributor

Ok, commenting out the finalizer fixes @mfherbst's main example with threaded FFTW

@stevengj
Copy link
Member

This issue should be fixed by #160.

@antoine-levitt
Copy link
Contributor

Great, thanks for the fix! Can't reproduce the original issue anymore on FFTW 1.2.2. I don't think we changed anything in DFTK in that respect, so the only difference seems to be that I'm running julia 1.5.

@mfherbst
Copy link
Author

@stevengj @vtjnash Thanks for the update and your effort on this stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants