Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZMQ tests seem to fail with zeromq 4.1 #83

Closed
mbaz opened this issue Jun 18, 2015 · 13 comments
Closed

ZMQ tests seem to fail with zeromq 4.1 #83

mbaz opened this issue Jun 18, 2015 · 13 comments

Comments

@mbaz
Copy link

mbaz commented Jun 18, 2015

When upgrading from zeromq 4.0 to 4.1 on my system, ZMQ started to fail. I tried running its tests (with a freshly added ZMQ.jl, on Julia 0.3.9):

~/.julia/ZMQ/test $ julia ./runtests.jl
Testing with ZMQ version 4.1.2

signal (11): Segmentation fault
allocobj at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
jl_alloc_tuple_uninit at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
jl_alloc_tuple at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
jl_f_apply at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)

The entire error is at https://gist.github.com/mbaz/b1018299d025d81e4efd

@tkelman
Copy link
Contributor

tkelman commented Jun 18, 2015

Can you identify the specific test that causes the segfault? Can you get a backtrace by running julia-debug under gdb?

@mbaz
Copy link
Author

mbaz commented Jun 19, 2015

The segfault is produced at line 61:

@assert (bytestring(ZMQ.recv(s1)) == "test request")

This is a minimal test case:

using ZMQ, Compat
ctx2=Context(1)
s1=Socket(ctx2, REP)
s2=Socket(ctx2, REQ)
ZMQ.bind(s1, "tcp://*:5555")
ZMQ.connect(s2, "tcp://localhost:5555")
ZMQ.send(s2, Message("test request"))
ZMQ.recv(s1)

How do I get julia-debug? I did a backtrace with regular julia, I know it probably will be of no help but just in case:

julia> ZMQ.recv(s1)
[New Thread 0x7fffeb76a700 (LWP 22063)]
[New Thread 0x7fffebf6b700 (LWP 22062)]
[New Thread 0x7ffff1be5700 (LWP 22060)]

Program received signal SIGSEGV, Segmentation fault.
pool_alloc (p=0x7ffff7dafec0 <norm_pools+288>) at gc.c:507
507     p->freelist = p->freelist->next;
(gdb) bt
#0  pool_alloc (p=0x7ffff7dafec0 <norm_pools+288>) at gc.c:507
#1  allocobj (sz=sz@entry=56) at gc.c:1036
#2  0x00007ffff6e7dfbe in jl_method_list_insert (pml=0x4ff26b8, type=type@entry=0x27be858, 
    method=method@entry=0x57f4e40, tvars=0x68f760, check_amb=0) at gf.c:1165
#3  0x00007ffff6e7e39e in jl_method_cache_insert (mt=mt@entry=0xc39890, type=0x27be858, 
    method=0x57f4e40) at gf.c:360
#4  0x00007ffff6e7eae4 in cache_method (mt=mt@entry=0xc39890,    type=type@entry=0x27be858, 
method=<optimized out>, decl=<optimized out>, sparams=<optimized out>) at gf.c:819
#5  0x00007ffff6e7fae3 in jl_mt_assoc_by_type (mt=mt@entry=0xc39890, tt=0x27be858, 
cache=cache@entry=1, inexact=inexact@entry=0) at gf.c:950
#6  0x00007ffff6e807b2 in jl_apply_generic (F=0xc496f0, args=0x7fffffffdc20, nargs=<optimized out>)
    at gf.c:1419
#7  0x00007ffff4df394e in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x0000000000000018 in ?? ()
#10 0x00007fffffffda38 in ?? ()
#11 0x00000000056251b0 in ?? ()
#12 0x0000000000000000 in ?? ()
(gdb) 

@tkelman
Copy link
Contributor

tkelman commented Jun 19, 2015

How do I get julia-debug?

That depends how you got julia. Most binary distribution channels should include julia-debug, if not in the same package then possibly in a separate -debug package. If you built from source, do make debug. Also it's best practice to provide the output of versioninfo() with all bug reports. This is presumably linux, but which distribution and what system compiler versions?

@mbaz
Copy link
Author

mbaz commented Jun 19, 2015

Sorry for not providing all the info. Here it is. This is actually running inside virtualbox.

julia> versioninfo()
Julia Version 0.3.9
Commit 31efe69* (2015-05-30 11:24 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

@mbaz
Copy link
Author

mbaz commented Jun 19, 2015

I build julia from source; thanks for the info on builind julia-debug. I ran a backtrace but it looks the same as the one I posted above. Please let me know what you need me to do to help.

@tkelman
Copy link
Contributor

tkelman commented Jun 19, 2015

The method cache looks familiar to some stuff that saw some changes on Julia master recently, but since you're on 0.3.9 it might be unrelated. Not sure about enough of the details of how ZMQ is implemented.

Could you try with Julia nightly, if you don't mind? You can download a binary from https://status.julialang.org/download/linux-x86_64 if you don't want to build from source.

@mbaz
Copy link
Author

mbaz commented Jun 19, 2015

I don't mind trying with nightly. I'll try to get to it in the next couple of days. Thanks for looking at this issue.

@mbaz
Copy link
Author

mbaz commented Jun 23, 2015

I just tried with latest master. The segmentation fault still occurs, but it seems to hang Julia. I have to use kill -9; kill by itself is not enough.

julia> versioninfo()
Julia Version 0.4.0-dev+5530
Commit fc604d1* (2015-06-22 21:44 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

@mbaz
Copy link
Author

mbaz commented Jun 26, 2015

Problem stil exists with julia 0.3.10.

julia> versioninfo()
Julia Version 0.3.10
Commit c8ceeef* (2015-06-24 13:54 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia> include("mintestcase.jl")
Testing with ZMQ version 4.1.2

signal (11): Segmentation fault
allocobj at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 1819092798)
unknown function (ip: 1819093790)
unknown function (ip: 1819095652)
unknown function (ip: 1819099747)
jl_apply_generic at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 1819444983)
unknown function (ip: 1819610173)
unknown function (ip: 1819610334)
uv__io_poll at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
uv_run at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
process_events at ./stream.jl:537
wait at ./task.jl:287
wait at ./task.jl:194
wait_full at ./multi.jl:602
take! at ./multi.jl:776
jl_apply_generic at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
take_ref at ./multi.jl:783
jlcall_take_ref_18992 at /home/miguel/disk2/Sources/julia/usr/bin/../lib/julia/sys.so (unknown line)
jl_apply_generic at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
jl_f_apply at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
call_on_owner at ./multi.jl:749
send_to_backend at REPL.jl:573
send_to_backend at REPL.jl:570
jl_apply_generic at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
anonymous at REPL.jl:584
run_interface at ./LineEdit.jl:1379
jlcall_run_interface_18860 at /home/miguel/disk2/Sources/julia/usr/bin/../lib/julia/sys.so (unknown line)
jl_apply_generic at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
run_frontend at ./REPL.jl:818
run_repl at ./REPL.jl:169
jl_apply_generic at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
_start at ./client.jl:400
jlcall__start_17481 at /home/miguel/disk2/Sources/julia/usr/bin/../lib/julia/sys.so (unknown line)
jl_apply_generic at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 4200466)
julia_trampoline at /home/miguel/disk2/Sources/julia/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 4199453)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 4199513)
unknown function (ip: 0)
Segmentation fault (core dumped)

@yuyichao
Copy link
Contributor

yuyichao commented Jul 7, 2015

Hmm, I seems to be getting this one only after a recent upgrade on master. Will see if I can figure out anything

@yuyichao
Copy link
Contributor

yuyichao commented Jul 7, 2015

I think I've found the problem.

The segfault happens after constructin a Message. In particular, after calling zmq_msg_init_data.

From the zmq header (from 4.1.2 on ArchLinux), zmq_msg_t (the structure mirrored by ZMQ.Message) is defined as

typedef struct zmq_msg_t {unsigned char _ [64];} zmq_msg_t;

(The real definision zmq::msg_t is too long and I haven't checked yet.)

However, on 64bit machine, sizeof(Message) returns 48 so I think there's out-of-bound access during this ccall. Add some padding to Message seems to suppress the error.

@carnaval's ccall sanitizer idea can probably catch this...

Might also worth checking this http://upstream.rosalinux.ru/versions/zeromq.html

@yuyichao
Copy link
Contributor

yuyichao commented Jul 7, 2015

Actually since the size we expect zmq to write to is smaller than the size of the structure (we append a pointer at the end) the ccall sanitizer might not be able to fully catch it....

@mbaz
Copy link
Author

mbaz commented Jul 7, 2015

The problem is fixed -- thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants