Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IJulia kernel doesn't work for Julia 1.6 on macOS #968

Closed
carstenbauer opened this issue Dec 6, 2020 · 55 comments · Fixed by #985
Closed

IJulia kernel doesn't work for Julia 1.6 on macOS #968

carstenbauer opened this issue Dec 6, 2020 · 55 comments · Fixed by #985

Comments

@carstenbauer
Copy link
Member

carstenbauer commented Dec 6, 2020

I can't get the IJulia kernel to work when using Julia 1.6/master. Tried latest IJulia release and IJulia#master. The kernel either seems to not start all or die immediately.

Julia <= 1.5.3 kernels work just fine.

julia> versioninfo()
Julia Version 1.6.0-DEV.1661
Commit 56dd7d71a7* (2020-12-04 13:18 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = subl
  JULIA_NUM_THREADS = 6
  JULIA_PROJECT = /Users/crstnbr/.julia/environments/base
  JULIA_DQMC = /Users/crstnbr/sciebo/codes/julia-sdw-dqmc
  JULIA_PKG_SERVER = pkg.julialang.org

kernel_issues

@stevengj
Copy link
Member

stevengj commented Dec 6, 2020

Does the IJulia build succeed? Mine is failing due to JuliaLang/Pkg.jl#2270 (i.e. an unrelated Pkg issue)

@carstenbauer
Copy link
Member Author

The build succeeds.

ijulia 2020-12-06 22_03_54

@stevengj
Copy link
Member

stevengj commented Dec 6, 2020

@carstenbauer
Copy link
Member Author

Hm, I can't manage to get any useful debugging output. It seems like setting ENV["IJULIA_DEBUG"]=true doesn't do anything?

I removed all instances of IJulia (in ~/.julia/packages and ~/.julia/compiled) and used a fresh environment to add and build IJulia with the environmental variable set. When using IJulia.notebook() to start the jupyter server within the REPL I only get this output with no information whatsoever:

Screenshot 2020-12-07 at 10 38 06

Starting the notebook server from the cmd directly (jupyter notebook) I only get

Screenshot 2020-12-07 at 10 36 08

Is there something else that I could try?

@tomyun
Copy link

tomyun commented Dec 7, 2020

I have the same issue with IJulia 1.23.1 on Julia 1.6-DEV.1678.

Running jupyter notebook from terminal would get some messages like below:

[I 02:23:21.060 NotebookApp] Kernel started: 7d26ab67-be88-4e9d-ae9f-fe61be79db8b, name: julia-1.6
[I 02:23:27.059 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:23:33.068 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:23:39.073 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:23:45.079 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:23:51.088 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:23:57.097 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:24:03.108 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:24:09.116 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 02:24:15.127 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

Calling IJulia.notebook() on REPL would launch a notebook, but the kernel didn't work either.

julia> versioninfo()
Julia Version 1.6.0-DEV.1678
Commit 6eef7b69ab* (2020-12-06 20:10 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)

@fredrikekre
Copy link
Member

Julia master is currently pretty broken so I wouldn't be surprised if that is the problem here rather than IJulia.

@carstenbauer
Copy link
Member Author

carstenbauer commented Dec 10, 2020

Julia master is currently pretty broken so I wouldn't be surprised if that is the problem here rather than IJulia.

If that's the case, fair enough. However, let me note that I see the same issue on the fresh 1.6 release branch. Isn't the latter supposed to be at least "beta" stable?

@carstenbauer carstenbauer changed the title Uulia kernel for Julia 1.6 (master) doesn't work Uulia kernel doesn't work for Julia 1.6 Dec 22, 2020
@carstenbauer
Copy link
Member Author

Still seeing this for the 1.6 release branch. Any pointers to how I could investigate this further would be great.

@fredrikekre fredrikekre changed the title Uulia kernel doesn't work for Julia 1.6 IJulia kernel doesn't work for Julia 1.6 Dec 22, 2020
@fredrikekre
Copy link
Member

It works for me.

@carstenbauer
Copy link
Member Author

It works for me.

Interesting... What OS are you on and what version of IJulia are you using?

@tomyun
Copy link

tomyun commented Dec 22, 2020

I tried it again with IJulia 1.23.1 on the latest Julia nightly (1.7.0-DEV.136), but still have the same issue. I'm on macOS.

julia> versioninfo()
Julia Version 1.7.0-DEV.136
Commit 549a73b99d (2020-12-22 08:49 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)

@carstenbauer
Copy link
Member Author

Alright, I have tested julia#master (Version 1.7.0-DEV.197 (2020-12-30)) with IJulia v1.23.1 on Ubuntu 20.04, Windows 10, and macOS 11.1 (Big Sur, x86). As it turns out, everything works fine on linux and windows, so this appears to be a macOS issue. @fredrikekre I assume you're on linux or windows?

Given that @tomyun saw the same issue on macOS 10.15 Catalina I assume it's a "generic" macOS issue and not tied to a particular macOS version.

Any ideas how to debug/fix this?

@carstenbauer carstenbauer changed the title IJulia kernel doesn't work for Julia 1.6 IJulia kernel doesn't work for Julia 1.6 on macOS Dec 30, 2020
@xukai92
Copy link

xukai92 commented Dec 31, 2020

It might be security related (just a guess) because IJulia doesn't even manage to create the 1.7 kernel file in my case.

@carstenbauer
Copy link
Member Author

carstenbauer commented Dec 31, 2020

It might be security related (just a guess) because IJulia doesn't even manage to create the 1.7 kernel file in my case.

If by kernel file you mean the kernel spec (kernel.json in ~/Library/Jupyter/kernels/julia-1.7/) I don't see any issues with its creation. Try an explicit ] build IJulia. I have no issues selecting the kernel in jupyter, it just doesn't work.

@carstenbauer
Copy link
Member Author

carstenbauer commented Dec 31, 2020

I'm trying to debug this. I started by manually executing every line in IJulia.init(String[]). What I found is that this line, in which start_heartbeat is called, causes a segmentation fault on julia#master, i.e. 1.7-dev:

julia> start_heartbeat(heartbeat[])
[1]    50603 segmentation fault  $HOME/repos/julia/usr/bin/Julia

@carstenbauer
Copy link
Member Author

carstenbauer commented Dec 31, 2020

Here is a MWE which segfaults on macOS with 1.7 but works fine with 1.5.3:
(The same MWE works with Julia 1.7 on Ubuntu 20.04 and Windows 10, confirming that this is a macOS issue.)

using ZMQ # v1.2.1
using ZMQ: libzmq

const threadid = zeros(Int, 128)

function heartbeat_thread(sock::Ptr{Cvoid})
    ccall((:zmq_proxy,libzmq), Cint, (Ptr{Cvoid}, Ptr{Cvoid}, Ptr{Cvoid}),
          sock, sock, C_NULL)
    nothing
end

const heartbeat = Ref{Socket}()
heartbeat[] = Socket(ROUTER)

sock = heartbeat[]

# function start_heartbeat(sock)
heartbeat_c = @cfunction(heartbeat_thread, Cvoid, (Ptr{Cvoid},))
ccall(:uv_thread_create, Cint, (Ptr{Int}, Ptr{Cvoid}, Ptr{Cvoid}), threadid, heartbeat_c, sock) # this line segfaults

@xukai92
Copy link

xukai92 commented Dec 31, 2020

It might be security related (just a guess) because IJulia doesn't even manage to create the 1.7 kernel file in my case.

If by kernel file you mean the kernel spec (kernel.json in ~/Library/Jupyter/kernels/julia-1.7/) I don't see any issues with its creation. Try an explicit ] build IJulia. I have no issues selecting the kernel in jupyter, it just doesn't work.

Yes and I have to create them manually on my side by basically copying previous ones. But after so it doesn't work for me in a similar way to yours.

@carstenbauer
Copy link
Member Author

Yes and I have to create them manually on my side by basically copying previous ones.

Hm, strange, but I think this is an orthogonal issue.

@stevengj
Copy link
Member

stevengj commented Jan 1, 2021

The following crashes even without involving ZMQ:

const threadid = zeros(Int, 128)

function heartbeat_thread(sock::Ptr{Cvoid})
    ccall(:printf, Cint, (Cstring, Ptr{Cvoid}), "got sock = %p\n", sock)
    nothing
end

sock = Ptr{Cvoid}(0x0123456789)

# function start_heartbeat(sock)
heartbeat_c = @cfunction(heartbeat_thread, Cvoid, (Ptr{Cvoid},))

ccall(:uv_thread_create, Cint, (Ptr{Int}, Ptr{Cvoid}, Ptr{Cvoid}), threadid, heartbeat_c, sock) # this line segfaults

@Keno, has something changed in Julia 1.6 that would effect calling uv_thread_create?

Correction: the above crashes even in Julia 1.5, so may be unrelated.

@rgobbel
Copy link
Contributor

rgobbel commented Jan 9, 2021

I've been trying to chase this down, and it looks like the problem is in IJulia.jl/src/heartbeat.jl. I think the call to uv_thread_create is segfaulting. Here's the code in question (lines 20 and 21):

    result = ccall(:uv_thread_create, Cint, (Ptr{Int}, Ptr{Cvoid}, Ptr{Cvoid}),
                   threadid, heartbeat_c, sock)

I've been trying to pin it down more precisely without much success, but adding println calls all over the place reveals that the call to uv_thread_create never returns, and lldb reports a segfault. I'm continuing to look at this with the latest sources.

@carstenbauer
Copy link
Member Author

@rgobbel You don't seem to have noticed that I had already pinned it down to this ccall above.

If you want to investigate this further (which would be great) it's probably easier to focus on the MWE above.

@rgobbel
Copy link
Contributor

rgobbel commented Jan 10, 2021

My bad, I hadn't looked at this whole thread. Still digging, but it looks like the address of the called cfunction is getting clobbered, so create_pthread crashes instantly. Hopefully more details once I get a better MWE--so far I haven't been able to find one that works in 1.5 but not 1.6 without ZMQ.

@stevengj
Copy link
Member

stevengj commented Jan 15, 2021

Can we re-implement the heartbeat thread in terms of the @threadcall macro?

Update: no, @threadcall((:zmq_proxy,libzmq), Cint, (Ptr{Cvoid}, Ptr{Cvoid}, Ptr{Cvoid}), sock, sock, C_NULL) still crashes.

@rgobbel
Copy link
Contributor

rgobbel commented Jan 15, 2021

I've been trying a few things: compiling with clang on Linux, compiling with gcc on MacOS, building with the thread sanitizer. So far none of those have worked. clang builds on linux run into some "not a compile-time constant" issues, gcc on Mac runs into a problem with library versions, the thread sanitizer fails because it doesn't load soon enough in some phases of the build process. By the way, I'm also seeing that the MWE fails intermittently in 1.5.3:

Julia version = 1.5.3
Assertion failed: nbytes == sizeof (dummy) (src/signaler.cpp:391)

signal (6): Abort trap: 6
in expression starting at none:0
zsh: segmentation fault  julia mwe.jl

I'm seeing that about 20% of the time. The only change I made to the code listed above is that it prints the value of VERSION at the beginning. With that sort of intermittent failure, it seems very likely that this is a race condition of some sort.

@stevengj
Copy link
Member

We could just omit the heartbeat thread on MacOS — I think it is optional in Jupyter these days?

It's just a bit frustrating not to know why this is crashing.

@rgobbel
Copy link
Contributor

rgobbel commented Jan 16, 2021

Still trying to build with thread sanitization turned on, but it's tripping over (among other things) this (src/task.c, lines 55-65):

#if defined(JL_TSAN_ENABLED)
static inline void tsan_destroy_ctx(jl_ptls_t ptls, void *state) {
    if (state != &ptls->root_task->state) {
        __tsan_destroy_fiber(ctx->state);
    }
    ctx->state = NULL;
}
static inline void tsan_switch_to_ctx(void *state)  {
    __tsan_switch_to_fiber(state, 0);
}
#endif

ctx is not defined in tsan_destroy_ctx. How on earth is this supposed to work?

@rgobbel
Copy link
Contributor

rgobbel commented Jan 18, 2021

Breaking news: I have a working version, though it needs a bunch of cleanup to get rid of all the debugging junk I threw in. I commented out all of the heartbeat stuff, which got it to this:

ERROR: LoadError: ArgumentError: expecting stdout stream
Stacktrace:
 [1] (::Base.redirect_stdio)(io::IJulia.IJuliaStdio{Base.TTY})
   @ IJulia ~/src/IJulia.jl/src/stdio.jl:31
 [2] init(args::Vector{String})
   @ IJulia ~/src/IJulia.jl/src/init.jl:120
 [3] top-level scope
   @ ~/src/IJulia.jl/src/kernel.jl:30
in expression starting at /Users/gobbel/src/IJulia.jl/src/kernel.jl:30

For debugging purposes, I pulled apart the function definition loop in stdio.jl into three separate functions, which got me to:

WARNING: Method definition Any(IJulia.IJuliaStdio{IO_t} where IO_t<:IO) in module IJulia at /Users/gobbel/.julia/packages/IJulia/IDNmS/src/stdio.jl:38 overwritten at /Users/gobbel/.julia/packages/IJulia/IDNmS/src/stdio.jl:45.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition Any(IJulia.IJuliaStdio{IO_t} where IO_t<:IO) in module IJulia at /Users/gobbel/.julia/packages/IJulia/IDNmS/src/stdio.jl:45 overwritten at /Users/gobbel/.julia/packages/IJulia/IDNmS/src/stdio.jl:52.
  ** incremental compilation may be fatally broken for this module **

I ditched the three separate functions in favor of a single, parameterized one:

function redirect_one(io::IJuliaStdio, which::String)
    js = io[:jupyter_stream]
    js != which && throw(ArgumentError("expecting $(which) stream, got $(js)"))
    Core.eval(Base, Expr(:(=), Symbol(which), io))
    return io
end

and that works! In init.jl, the calls to redirect_stdout(), etc. didn't need any changes, so I just substituted the calls to the new parameterized version as follows:

    if capture_stdout
        read_stdout[], = redirect_stdout()
        redirect_one(IJuliaStdio(stdout,"stdout"), "stdout")
    end
    if capture_stderr
        read_stderr[], = redirect_stderr()
        redirect_one(IJuliaStdio(stderr,"stderr"), "stderr")
    end
    redirect_one(IJuliaStdio(stdin,"stdin"), "stdin")

I'll do a PR once I get this cleaned up and tested for compatibility with Linux and Windows. I also tried putting the heartbeat back in after getting a working version and verified that it's still causing a segfault, so whatever's causing that, it's still at large.

@vtjnash
Copy link
Member

vtjnash commented Feb 9, 2021

I ditched the three separate functions in favor of a single, parameterized one:

FWIW, that was the goal of the changes to base: to make it easier to consolidate the code and use less meta programming for these. To that end, there's also now Base._redirect_io_global for changing the global (instead of eval).

@stevengj
Copy link
Member

stevengj commented Feb 9, 2021

Note that the heartbeat problem still remains.

@staticfloat
Copy link
Member

Yeah, I'm looking into it. I believe it actually has nothing to do with ZMQ and is instead compiler-internal changes around what you're allowed to do inside of uv_thread_create callbacks.

@stevengj
Copy link
Member

stevengj commented Feb 11, 2021

@staticfloat, it's confusing to me that ccall-ing a thread-safe function could be disallowed in a thread. Does ccall do something thread-unsafe besides just calling the function?

(As mentioned above, this needs to be in a thread, not a task, because otherwise a long-running task that fails to yield could cause Jupyter to think that Julia has died and restart us. Also, in a task we couldn't use zmq_proxy — we'd have to implement the socket response ourselves.)

@staticfloat
Copy link
Member

Does ccall do something thread-unsafe besides just calling the function?

@vtjnash tells me that looking up global values is not allowed within a uv_thread_create callback. This includes, for instance, the lookup of libzmq from global scope. I briefly experimented with passing through a struct that contains both the sock and the result of dlsym(:zmq_proxy, libzmq) but I ran into so many segfaults due to lookups occurring that it makes me nervous that perhaps it would be better for us to not use a uv_thread_create callback at all, if there is any way we can avoid it.

As for the reason why 1.5 didn't crash here, it's because the dlsym() lookup would happen at compile-time, whereas it now occurs at first run, to allow for dynamic libname values in the (:funcname, libname) tuple.

@stevengj
Copy link
Member

stevengj commented Feb 11, 2021

(This is just a pthread under the hood on MacOS.) Why can’t a thread access a constant global, @vchuravy?

A runtime dlsym lookup seems like a more likely culprit to be writing to thread-unsafe global state. Couldn't we just pass in the dlsym pointer of the function?

@stevengj
Copy link
Member

Doing the dlsym lookup outside the thread seems to fix the problem for me. That is, this version of @crstnbr's MWE now succeeds on MacOS with Julia 1.6 for me:

using ZMQ, Libdl
const zmq_proxy = dlsym(dlopen(ZMQ.libzmq), :zmq_proxy)
const threadid = zeros(Int, 128)

function heartbeat_thread(sock::Ptr{Cvoid})
    ccall(zmq_proxy, Cint, (Ptr{Cvoid}, Ptr{Cvoid}, Ptr{Cvoid}),
          sock, sock, C_NULL)
    nothing
end

const heartbeat = Ref{Socket}()
heartbeat[] = Socket(ROUTER)

# function start_heartbeat(sock)
heartbeat_c = @cfunction(heartbeat_thread, Cvoid, (Ptr{Cvoid},))
ccall(:uv_thread_create, Cint, (Ptr{Int}, Ptr{Cvoid}, Ptr{Cvoid}), threadid, heartbeat_c, heartbeat[])

@rgobbel
Copy link
Contributor

rgobbel commented Feb 12, 2021

I'm finding that I get very different results depending on the contents of .julia/compiled. After deleting all precompiled code, the code below works in both v1.5.3 and v1.6--mostly:

using ZMQ # v1.2.1
using ZMQ, Libdl #: libzmq
const zmq_proxy = dlsym(dlopen(ZMQ.libzmq), :zmq_proxy)

const threadid = zeros(Int, 128)

ccall(:jl_safe_printf, Cint, (Ptr{UInt8},), "TRAP1\n")
@ccall jl_safe_printf("TRAP2\n"::Ptr{UInt8})::Cint


function heartbeat_thread(sock::Ptr{Cvoid})
    ccall(zmq_proxy, Cint, (Ptr{Cvoid}, Ptr{Cvoid}, Ptr{Cvoid}),
          sock, sock, C_NULL)
    ccall(:jl_safe_printf, Cint, (Ptr{UInt8},), "TRAP3\n")
    @ccall jl_safe_printf("TRAP4\n"::Ptr{UInt8})::Cint
    nothing
end

const heartbeat = Ref{Socket}()
heartbeat[] = Socket(ROUTER)

sock = heartbeat[]

# function start_heartbeat(sock)
heartbeat_c = @cfunction(heartbeat_thread, Cvoid, (Ptr{Cvoid},))
ccall(:uv_thread_create, Cint, (Ptr{Int}, Ptr{Cvoid}, Ptr{Cvoid}), threadid, heartbeat_c, sock) # this line segfaults

If I comment out one or both of the TRAP1 and TRAP2calls to jl_safe_print, I get erratic results--some segfaults, some fine. This seems to be partly dependent on what has been precompiled. I haven't managed to pin down exactly what combination makes the difference. In some cases, I've seen it alternate between running and not, with no other actions, so that has to be some sort of race condition. I think precompilation is failing sometimes, though I can't be more precise than that.

@stevengj
Copy link
Member

I think precompilation is failing sometimes

I'm confused — are you putting the MWE into a module (IJulia?) and precompiling that? This won't work because zmq_proxy is a pointer that has to be initialized at runtime.

#985 seems to work fine for me with both 1.6 and 1.5.3.

@rgobbel
Copy link
Contributor

rgobbel commented Feb 13, 2021

I think precompilation is failing sometimes

I'm confused — are you putting the MWE into a module (IJulia?) and precompiling that? This won't work because zmq_proxy is a pointer that has to be initialized at runtime.

#985 seems to work fine for me with both 1.6 and 1.5.3.

I'm still seeing an intermittent failure, even with my test case modified per #985.

Here's the complete source of what I'm running, warts and all:

using ZMQ, Libdl #: libzmq
#const zmq_proxy = dlsym(dlopen(ZMQ.libzmq), :zmq_proxy)
const zmq_proxy = Ref(C_NULL)

const threadid = zeros(Int, 128)

ccall(:jl_safe_printf, Cint, (Ptr{UInt8},), "TRAP1\n")
#@ccall jl_safe_printf("TRAP2\n"::Ptr{UInt8})::Cint


function heartbeat_thread(sock::Ptr{Cvoid})
    ccall(zmq_proxy[], Cint, (Ptr{Cvoid}, Ptr{Cvoid}, Ptr{Cvoid}),
          sock, sock, C_NULL)
    #ccall(:jl_safe_printf, Cint, (Ptr{UInt8},), "TRAP3\n")
    #@ccall jl_safe_printf("TRAP4\n"::Ptr{UInt8})::Cint
    nothing
end

const heartbeat = Ref{Socket}()
heartbeat[] = Socket(ROUTER)

sock = heartbeat[]

# function start_heartbeat(sock)
zmq_proxy[] = Libdl.dlsym(Libdl.dlopen(ZMQ.libzmq), :zmq_proxy)
heartbeat_c = @cfunction(heartbeat_thread, Cvoid, (Ptr{Cvoid},))
ccall(:uv_thread_create, Cint, (Ptr{Int}, Ptr{Cvoid}, Ptr{Cvoid}), threadid, heartbeat_c, sock) # this line segfaults

Here's what's happening when I run it:

gobbel@saul:~/src/julia-stashed/mwe$ /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia mwe1.jl
TRAP1
√ gobbel@saul:~/src/julia-stashed/mwe$ /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia mwe1.jl
TRAP1
Assertion failed: nbytes == sizeof (dummy) (src/signaler.cpp:391)

signal (6): Abort trap: 6
in expression starting at none:0
zsh: segmentation fault  /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia mwe1.jl
?139 gobbel@saul:~/src/julia-stashed/mwe$ /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia mwe1.jl
TRAP1
Assertion failed: nbytes == sizeof (dummy) (src/signaler.cpp:391)

signal (6): Abort trap: 6
in expression starting at none:0
zsh: segmentation fault  /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia mwe1.jl
?139 gobbel@saul:~/src/julia-stashed/mwe$ /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia mwe1.jl
TRAP1
√ gobbel@saul:~/src/julia-stashed/mwe$

Sometimes it runs without blowing up, sometimes not. Before the first run, I cleaned out ~/.julia/compiled, then all of these were done back-to-back, without touching the code or anything related to Julia's persistent state. I can keep doing that as long as I like, and I get an apparently random sequence of good and bad results.

@stevengj
Copy link
Member

stevengj commented Feb 13, 2021

Could it be that Julia is calling the atexit hook closing the ZMQ context (invalidating the socket) before the thread starts? Try just adding a sleep(10) at the end of your script to make sure that the thread has a chance to run before Julia exits.

@rgobbel
Copy link
Contributor

rgobbel commented Feb 13, 2021

Could it be that Julia is calling the atexit hook closing the ZMQ context (invalidating the socket) before the thread starts? Try just adding a sleep(10) at the end of your script to make sure that the thread has a chance to run before Julia exits.

Tried that, didn’t seem to make any difference. It’s still about 50/50 whether it runs to completion or crashes.

@stevengj
Copy link
Member

@rgobbel, I can't reproduce your problem (on MacOS 10.15.7 with Julia release-1.6/a58bdd9010*). It works every time for me.

@carstenbauer
Copy link
Member Author

@stevengj Unfortunately, I can reproduce on macOS 11.1 with Julia release-1.6/de69b02a48. With a sleep(10) at the end of @rgobbel 's script it failed 2 out of 5 times:

➜  segfaulttest julia-dev --project=. segfaulttest.jl
TRAP1
Assertion failed: nbytes == sizeof (dummy) (src/signaler.cpp:391)

signal (6): Abort trap: 6
in expression starting at none:0
[1]    61009 segmentation fault  $HOME/repos/julia/usr/bin/julia --project=. segfaulttest.jl

➜  segfaulttest julia-dev --project=. segfaulttest.jl
TRAP1

➜  segfaulttest julia-dev --project=. segfaulttest.jl
TRAP1
Assertion failed: nbytes == sizeof (dummy) (src/signaler.cpp:391)

signal (6): Abort trap: 6
in expression starting at none:0
[1]    61018 segmentation fault  $HOME/repos/julia/usr/bin/julia --project=. segfaulttest.jl

➜  segfaulttest julia-dev --project=. segfaulttest.jl
TRAP1

➜  segfaulttest julia-dev --project=. segfaulttest.jl
TRAP1

@stevengj
Copy link
Member

I tried release-1.6/de69b02a48 and still no luck at reproducing.

Is IJulia crashing for you with #985 too?

@carstenbauer
Copy link
Member Author

@stevengj I just tested it. #985 works for me, i.e. restarting jupyter and launching a 1.6 kernel worked 5 out of 5 times.

@stevengj
Copy link
Member

In that case I'm going to merge #985, since getting something working is an urgent issue. We can open a new issue if people see intermittent crashes in actual practice.

@stevengj
Copy link
Member

stevengj commented Feb 13, 2021

(The redirect_stdio changes described in #968 (comment) are a separate issue #986.)

@rgobbel
Copy link
Contributor

rgobbel commented Feb 13, 2021

I tried release-1.6/de69b02a48 and still no luck at reproducing.

Is IJulia crashing for you with #985 too?

After doing a deep clean of everything (rm -rf ~/.julia /Applications/Julia-* ; uninstall every possible version of libzmq ), I get a working IJulia, if and only if I run using using IJulia; notebook(). jupyter notebook from a shell command line, then running the kernel from Jupyter is crashing with complaints about ZMQ precompilation failures. I don't want to open a new issue until I can be sure this isn't the result of some quirk in my environment, but thought I should mention it. Now trying harder to get a clean install of Jupyter, so we'll see.

@stevengj
Copy link
Member

running the kernel from Jupyter is crashing with complaints about ZMQ precompilation failures

(Sounds like a separate issue in any case from this one — the problems in this issue happened in processes where ZMQ was already precompiled.)

@carstenbauer
Copy link
Member Author

carstenbauer commented Feb 13, 2021

I tested both jupyter notebook as well as using IJulia; IJulia.notebook(). As I didn't delete my old installation, I had to make sure to to start the jupyter notebook in the directory which had the correct dev --local IJulia (with #985 checked out) in the Project.toml. Otherwise I would get a warning that the kernel file specified in the kernel.json and the one provided by IJulia (in my global default environment) don't match.

@fkguo
Copy link

fkguo commented Feb 14, 2021

I have the same problem with version 1.7.0-dev. jupyter notebook gives

[I 00:09:09.745 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[I 00:09:18.752 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
[W 00:09:21.796 NotebookApp] Replacing stale connection: 4bdb1c3d-9fd4-42c3-a3d8-9ca6c416a387:00af73c1c2564009971baad4b813e97b
[I 00:09:27.759 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

Here is the versioninfo

Julia Version 1.7.0-DEV.526
Commit 6468dcb04e (2021-02-13 02:44 UTC)
Platform Info:
  OS: Linux (x86_64-conda-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

@fkguo
Copy link

fkguo commented Feb 14, 2021

Running IJulia in the debug mode gives the following error message:

 404 GET /static/components/MathJax/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf (127.0.0.1) 2.300000ms referer=http://127.0.0.1:8888/notebooks/test.ipynb
PROFILE = Dict{String, Any}("key" => "9f40421e-5dbfc8ccfee2d738844f7a2b", "transport" => "tcp", "signature_scheme" => "hmac-sha256", "shell_port" => 51045, "hb_port" => 36143, "control_port" => 40477, "ip" => "127.0.0.1", "stdin_port" => 59569, "iopub_port" => 60855, "kernel_name" => "julia-1.7")
ERROR: LoadError: ArgumentError: expecting stdout stream
Stacktrace:
 [1] (::Base.redirect_stdio)(io::IJulia.IJuliaStdio{Base.PipeEndpoint})
   @ IJulia ~/.julia/packages/IJulia/IDNmS/src/stdio.jl:30
 [2] init(args::Vector{String})
   @ IJulia ~/.julia/packages/IJulia/IDNmS/src/init.jl:109
 [3] top-level scope
   @ ~/.julia/packages/IJulia/IDNmS/src/kernel.jl:24
in expression starting at /home/name/.julia/packages/IJulia/IDNmS/src/kernel.jl:24
[I 00:35:55.311 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

@carstenbauer
Copy link
Member Author

@fkguo I would open a new issue because you are using Julia 1.7 / master while this issue is about 1.6.

(BTW, I can't reproduce on Julia#master where even IJulia precompilation fails for me.)

@fkguo
Copy link

fkguo commented Feb 15, 2021

@crstnbr this seems to be the same issue as #986 for Julia 1.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants