Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegFault caused by tupocolypse #11313

Closed
yuyichao opened this issue May 17, 2015 · 22 comments · Fixed by #11314
Closed

SegFault caused by tupocolypse #11313

yuyichao opened this issue May 17, 2015 · 22 comments · Fixed by #11314

Comments

@yuyichao
Copy link
Contributor

Saw this when replying to a mailing list email.

Edit: See below #11313 (comment) for minimum example. First appears after #10380 is merged.

Script to reproduce

using DataArrays

xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)
xd = @data rand(1000000)

Putting in a function or run in a loop does not trigger the issue (even if the variable is declared to be global).

Backtrace

#0  jl_object_id (v=v@entry=0x7ffdf568c888) at builtins.c:1123
#1  0x00007ffff6a880d9 in typekey_compare (n=2, key=0x7fffffffc600, 
    tt=0x7ffdf3006050) at jltypes.c:1770
#2  lookup_type_idx (key=key@entry=0x7fffffffc600, n=n@entry=2, 
    ordered=ordered@entry=1, tn=0x7ffdf23a4290, tn=0x7ffdf23a4290)
    at jltypes.c:1806
#3  0x00007ffff6a8a4dc in lookup_type (n=2, key=0x7fffffffc600, 
    tn=0x7ffdf23a4290) at jltypes.c:1833
#4  inst_datatype (dt=0x7ffdf239c370, p=p@entry=0x0, 
    iparams=iparams@entry=0x7fffffffc600, ntp=ntp@entry=2, 
    cacheable=cacheable@entry=1, isabstract=isabstract@entry=0, n=0, env=0x0, 
    stack=0x0) at jltypes.c:1942
#5  0x00007ffff6a8c050 in jl_inst_concrete_tupletype_v (
    p=p@entry=0x7fffffffc600, np=np@entry=2) at jltypes.c:2066
#6  0x00007ffff6a938bc in arg_type_tuple (args=args@entry=0x7fffffffc748, 
    nargs=nargs@entry=2) at gf.c:1455
#7  0x00007ffff6a9795b in jl_apply_generic (F=0x7ffdf2a3b110, 
    args=0x7fffffffc748, nargs=<optimized out>) at gf.c:1747
#8  0x00007ffff6aea7d3 in jl_apply (nargs=2, args=0x7fffffffc748, 
    f=0x7ffdf2a3b110) at julia.h:1290
#9  do_call (f=f@entry=0x7ffdf2a3b110, args=args@entry=0x7ffdf59a5008, 
    nargs=nargs@entry=2, eval0=eval0@entry=0x0, 
    locals=locals@entry=0x7fffffffd340, nl=nl@entry=0, ngensym=1)
    at interpreter.c:65
#10 0x00007ffff6ae9b50 in eval (e=0x7ffdf5587150, 
    locals=locals@entry=0x7fffffffd340, nl=nl@entry=0, ngensym=ngensym@entry=1)
    at interpreter.c:212
#11 0x00007ffff6aea7b9 in do_call (f=0x7ffdf37940b0, 
    args=args@entry=0x7ffdf59a4f40, nargs=nargs@entry=3, 
    eval0=eval0@entry=0x7ffdf55c9750, locals=locals@entry=0x7fffffffd340, 
    nl=nl@entry=0, ngensym=1) at interpreter.c:63
#12 0x00007ffff6ae96b6 in eval (e=0x7ffdf5587110, 
    locals=locals@entry=0x7fffffffd340, nl=nl@entry=0, ngensym=ngensym@entry=1)
    at interpreter.c:214
#13 0x00007ffff6ae9971 in eval (e=e@entry=0x7ffdf55870f0, 
    locals=locals@entry=0x7fffffffd340, nl=nl@entry=0, ngensym=ngensym@entry=1)
    at interpreter.c:218
#14 0x00007ffff6aeae61 in eval_body (stmts=stmts@entry=0x7ffdf594fc00, 
    locals=locals@entry=0x7fffffffd340, nl=nl@entry=0, ngensym=ngensym@entry=1, 
    toplevel=1, start=0) at interpreter.c:579
#15 0x00007ffff6aeb2a9 in jl_toplevel_eval_body (stmts=0x7ffdf594fc00)
    at interpreter.c:512
#16 0x00007ffff6afd727 in jl_toplevel_eval_flex (e=<optimized out>, 
    fast=fast@entry=1) at toplevel.c:501
#17 0x00007ffff6afde44 in jl_toplevel_eval_flex (fast=1, e=<optimized out>)
    at toplevel.c:551
#18 jl_parse_eval_all (
    fname=fname@entry=0x7ffdf4146b20 "/home/yuyichao/tmp/julia/dataarray.jl", 
    len=<optimized out>) at toplevel.c:555
#19 0x00007ffff6afe04a in jl_load (
    fname=0x7ffdf4146b20 "/home/yuyichao/tmp/julia/dataarray.jl")
    at toplevel.c:594
#20 0x00007ffff3ea7db0 in julia_include_23643 (fname=<optimized out>)
    at boot.jl:252
#21 0x00007ffff6a978fb in jl_apply (nargs=1, args=0x7fffffffd870, 
    f=<optimized out>) at julia.h:1290
#22 jl_apply_generic (F=0x7ffdf2400330, args=0x7fffffffd870, 
    nargs=<optimized out>) at gf.c:1743
#23 0x00007ffff7e6d23a in julia_include_from_node1_44369 (path=<optimized out>)
    at loading.jl:134
#24 0x00007ffff6a979c5 in jl_apply (nargs=1, args=0x7fffffffda50, 
    f=<optimized out>) at julia.h:1290
#25 jl_apply_generic (F=0x7ffdf30dd5b0, args=0x7fffffffda50, 
    nargs=<optimized out>) at gf.c:1767
#26 0x00007ffff40dddd6 in julia_process_options_42349 (opts=<optimized out>, 
    args=<optimized out>) at client.jl:310
#27 0x00007ffff40dca73 in julia__start_42348 () at client.jl:409
#28 0x00007ffff40dd409 in jlcall.start_42348 () from /usr/lib/julia/sys.so
#29 0x00007ffff6a978fb in jl_apply (nargs=0, args=0x0, f=<optimized out>)
    at julia.h:1290
#30 jl_apply_generic (F=0x7ffdf3b32e70, args=0x0, nargs=<optimized out>)
    at gf.c:1743
#31 0x0000000000401ba3 in jl_apply (nargs=0, args=0x0, f=<optimized out>)
    at ../src/julia.h:1290
#32 true_main (argc=<optimized out>, argv=0x7fffffffdf68) at repl.c:441
#33 0x000000000040171f in main (argc=1, argv=0x7fffffffdf68) at repl.c:507

GC_VERIFY did not catch it.
Still investicating.

@yuyichao
Copy link
Contributor Author

OK

Working (SegFaulting) example without DataArrays

macro m(ex)
    quote
        tmp = $(esc(ex))
        (x->1, tmp)
    end
end

println(macroexpand(:(@m rand(1000000)))) # Adding this delays the SegFault

xd = @m rand(1000000)
xd = @m rand(1000000)
xd = @m rand(1000000) # Removing this might suppress the SegFault

might be related to macro handling.

Edit: Update to the minimum example I can come up with. Very likely related to macro/global handling because trivial change of the code changes whether it segfaults.

@yuyichao
Copy link
Contributor Author

Technically a regression since I cannot reproduce it on 0.3. (Although I guess regression or not doesn't matter that much for a SegFault.......)

@yuyichao
Copy link
Contributor Author

Goes back at least to 803193e

@yuyichao
Copy link
Contributor Author

Actualy probably not macro related (might just be some random interference in memory usage etc).

Could reproduce with this simple script

xd = begin
    tmp1 = rand(1000000)
    (x->1, tmp1)
end

@yuyichao
Copy link
Contributor Author

Backtrace for these different versions are slightly different by they all end up in jl_object_id

@yuyichao
Copy link
Contributor Author

Direct reason seems to be jl_typeof(v) is NULL where v is the argument to jl_object_id.

@yuyichao
Copy link
Contributor Author

Just went checking a few commits back and sure enough it is introduced by #10380 .

So @JeffBezanson

I finally remember the issue number of tupocolypse by heart now. Not sure if it is a good sign or a bad one......

@yuyichao yuyichao changed the title SegFault using DataArrays SegFault caused by tupocolypse May 17, 2015
@timholy
Copy link
Member

timholy commented May 17, 2015

Nice debugging, @yuyichao.

@yuyichao
Copy link
Contributor Author

@timholy I'll probably stop here for now unless no one fix it in a few days.

carnaval added a commit that referenced this issue May 17, 2015
in interpreted global var assignment. Fix #11313.
@carnaval
Copy link
Contributor

Found it. Thanks a lot. Those are so much easier to track down with a small deterministic repro.

@yuyichao
Copy link
Contributor Author

Wow. That's fast. Thanks. A little surprised that my full-collect-every-time didn't catch it.....

@carnaval
Copy link
Contributor

It should have if : you collected at each pool alloc && the rhs of the assign && this thing is not rooted anywhere else && this is the first time this particular global is assigned to.

In normal operation mode this can only happen if the rhs is allocating something big enough to go right near the collection interval, but not quite so that allocating the binding is the gc trigger.

@yuyichao
Copy link
Contributor Author

I guess I've only only run that mode for very limited cases and probably it was just (un)lucky that this pattern didn't appear....

@yuyichao
Copy link
Contributor Author

@carnaval Also curious about how did you find the fix? It's not a write barrier issue so (as I tried) GC_VERIFY didn't help. Turn off randomization and track the pointer?

@carnaval
Copy link
Contributor

yep. that's why I like short examples because since the program doesn't allocate much there is almost no addr reuse so it's easier.

@tkelman
Copy link
Contributor

tkelman commented May 17, 2015

Nice work to both of you. Are these repro examples small/fast-running enough to be worth adding as tests?

@yuyichao
Copy link
Contributor Author

Probably not this particular one.

  1. I don't think this one is sth that will easily be broken unless someone did a major rewrite (e.g. WIP: redesign of tuples and tuple types #10380) and even in that case this doesn't feel like a particularly vulnerable place (although I might be wrong)
  2. This one is also pretty sensitive to the allocation pattern and what has been done before (see @carnaval 's comment on the condition to trigger above) so it might not fit very well in a big test suit.

It is still on my todo list to clean up my GC debugging branch and make it easier to run some of these (stress) tests on buildbots. (Hopefully next week.)

@yuyichao
Copy link
Contributor Author

It is small and fast runnig though. Just paste the code I had above #11313 (comment) into a file and run it and you should reproduce this almost 100% of the time within one second. I just don't think it would be helpful to catch future bugs.

@tkelman
Copy link
Contributor

tkelman commented May 17, 2015

Okay, sounds good. Does this look related to you, by any chance? https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.4793/job/vfqa988mt0tdr9ok

It looks like it started building immediately before I hit merge on the fix though.

@yuyichao
Copy link
Contributor Author

The backtrace does look almost identical.

@yuyichao
Copy link
Contributor Author

IIRC the CI merge the master before the build? Might worth restarting it. (Saw that you've already done that)

@yuyichao
Copy link
Contributor Author

Just confirmed that it does fixes the original script (which uses DataArrays) as well. So fixed indeed =).

mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015
in interpreted global var assignment. Fix JuliaLang#11313.
tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015
in interpreted global var assignment. Fix JuliaLang#11313.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants