Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker segfault during t0015-cron.t failures on arm7l #1176

Closed
garlick opened this issue Aug 31, 2017 · 10 comments
Closed

broker segfault during t0015-cron.t failures on arm7l #1176

garlick opened this issue Aug 31, 2017 · 10 comments

Comments

@garlick
Copy link
Member

garlick commented Aug 31, 2017

On arm7l, the t0015-cron.t test fails all subtests after the first test (reproducably).

A core is generated that indicates there was a SIGSEGV but the stack is always corrupt it seems.

(gdb) bt
#0  0x76aa6630 in ?? ()
#1  0x76e3a478 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Test output

./t0015-cron.t --verbose
expecting success:
    flux module load cron

ok 1 - load cron module

expecting success:
    id=$(flux_cron interval 0.05s echo cron-test-xxx) &&
    sleep .1 &&
    flux cron stop ${id} &&
    flux dmesg | grep "cron-1.*test-xxx"  &&
    cron_entry_check ${id} stopped true

flux-cron stop: No such file or directory
flux-start: 0 (pid 31679) Segmentation fault
not ok 2 - cron interval create
#
#	    id=$(flux_cron interval 0.05s echo cron-test-xxx) &&
#	    sleep .1 &&
#	    flux cron stop ${id} &&
#	    flux dmesg | grep "cron-1.*test-xxx"  &&
#	    cron_entry_check ${id} stopped true
#

expecting success:
    flux cron list | grep "^ *1.*echo"

lua: /home/garlick/proj/flux-core/src/cmd/flux-cron:435: attempt to index upvalue 'f' (a nil value)
stack traceback:
	/home/garlick/proj/flux-core/src/cmd/flux-cron:435: in function 'handler'
	...roj/flux-core/src/bindings/lua/flux/Subcommander.lua:146: in function <...roj/flux-core/src/bindings/lua/flux/Subcommander.lua:127>
	(tail call): ?
	/home/garlick/proj/flux-core/src/cmd/flux-cron:592: in main chunk
	[C]: ?

Last few lines of t0015-cron.broker.log

2017-08-31T16:13:16.438126Z broker.info[0]: Run level 2 starting
2017-08-31T16:13:16.442506Z broker.info[0]: runlevel 2 (300.0s) timer started
2017-08-31T16:13:16.878383Z broker.debug[0]: insmod cron
2017-08-31T16:13:17.084550Z cron.info[0]: cron-1[echo]: rank=0: command="echo cron-test-xxx": "cron-test-xxx"
2017-08-31T16:13:17.133968Z cron.info[0]: cron-1[echo]: rank=0: command="echo cron-test-xxx": "cron-test-xxx"
2017-08-31T16:13:17.187336Z cron.info[0]: cron-1[echo]: rank=0: command="echo cron-test-xxx": "cron-test-xxx"
2017-08-31T16:13:17.235360Z cron.info[0]: cron-1[echo]: rank=0: command="echo cron-test-xxx": "cron-test-xxx"
2017-08-31T16:13:17.285574Z cron.info[0]: cron-1[echo]: rank=0: command="echo cron-test-xxx": "cron-test-xxx"
@grondo
Copy link
Contributor

grondo commented Aug 31, 2017

Does valgrind work on arm7l?

@garlick
Copy link
Member Author

garlick commented Aug 31, 2017

Good question. It's available and packaged but it seems to spew a lot of nonsense when driven by our sharness test, culminating in an illegal instruction and leaving a "vgcore" in the trash directory.

==32538== ERROR SUMMARY: 180685 errors from 262 contexts (suppressed: 2515 from 5)
Illegal instruction
not ok 1 - valgrind reports no new errors on single broker run

@garlick
Copy link
Member Author

garlick commented Aug 31, 2017

Many of these are printed, for example

==386== Memcheck, a memory error detector
==386== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==386== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==386== Command: /home/garlick/proj/flux-core/src/broker/.libs/lt-flux-broker --shutdown-grace=4 /home/garlick/proj/flux-core/t/valgrind/valgrind-workload.sh 10
==386==
==386== Conditional jump or move depends on uninitialised value(s)
==386==    at 0x4019AB0: index (strchr.S:99)
==386==    by 0x4008653: expand_dynamic_string_token (dl-load.c:425)
==386==    by 0x4009323: _dl_map_object (dl-load.c:2538)
==386==    by 0x4000FD3: map_doit (rtld.c:719)
==386==    by 0x401049F: _dl_catch_error (dl-error.c:187)
==386==    by 0x4000F5B: do_preload (rtld.c:908)
==386==    by 0x4001CA3: handle_ld_preload (rtld.c:1006)
==386==    by 0x40039B3: dl_main (rtld.c:1750)
==386==    by 0x401713B: _dl_sysdep_start (dl-sysdep.c:249)
==386==    by 0x400137B: _dl_start_final (rtld.c:424)
==386==    by 0x400160B: _dl_start (rtld.c:652)
==386==    by 0x4000CAF: ??? (in /lib/arm-linux-gnueabihf/ld-2.19.so)
==386==
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:index
   fun:expand_dynamic_string_token
   fun:_dl_map_object
   fun:map_doit
   fun:_dl_catch_error
   fun:do_preload
   fun:handle_ld_preload
   fun:dl_main
   fun:_dl_sysdep_start
   fun:_dl_start_final
   fun:_dl_start
   obj:/lib/arm-linux-gnueabihf/ld-2.19.so
}

@grondo
Copy link
Contributor

grondo commented Aug 31, 2017

I was thinking just replace t/valgrind/valgrind-workload.sh with a set of commands that mimic where the cron tests were failing. Then run ./t5000-valgrind.sh -d -v and see if catches some kind of stack access error or some other hint.

I'd do this myself but don't have access to arm7l system just now...

@grondo
Copy link
Contributor

grondo commented Aug 31, 2017

another idea might be just to see if we could reproduce the crash with interactive commands. If so, then start a single broker session and attach to flux-broker with gdb, then reproduce the crash.

I tried running under valgrind and poking at some cron tests, but no errors yet.

@grondo
Copy link
Contributor

grondo commented Aug 31, 2017

I was thinking just replace t/valgrind/valgrind-workload.sh ...

Oh sorry I misread your initial statement about valgrind. Ignore what I said above...

@garlick
Copy link
Member Author

garlick commented Aug 31, 2017

I've been unsuccessful trying to reproduce this running cron commands interactively. I'll maybe look at this again tonight - heading in very late now!

On the one hand, I don't want to waste our time running down problems on a "toy" platform (especially if they are merely test issues). On the other hand, sometimes the altered timing uncovers real bugs that might pop up, say, on a busy compute node, so... Wanted to at least capture this one for now.

@garlick
Copy link
Member Author

garlick commented Sep 1, 2017

Ran again and got some backtrace from the core file this time:

(gdb) bt full
#0  strlen () at ../ports/sysdeps/arm/armv6/strlen.S:26
No locals.
#1  0x76e76478 in read_string (s=s@entry=0x744feaf0, ap=ap@entry=0x744feaec,
    purpose=0x76f49f10 <__stack_chk_guard> "",
    purpose@entry=0x76e7b458 "object key", out_len=out_len@entry=0x744feaac,
    ours=ours@entry=0x744feab0) at pack_unpack.c:148
        t = <optimized out>
        strbuff = {value = 0x3 <error: Cannot access memory at address 0x3>,
          length = 1991106072, size = 0}
        str = 0x5 <error: Cannot access memory at address 0x5>
        length = <optimized out>
#2  0x76e773dc in pack_object (ap=0x744feaec, s=0x744feaf0) at pack_unpack.c:228
        key = <optimized out>
        len = 5
        ours = 0
        value = <optimized out>
        object = 0x73b0c5c0
#3  pack (s=s@entry=0x744feaf0, ap=0x744feaec, ap@entry=0x744feae4)
    at pack_unpack.c:298
No locals.
#4  0x76e77790 in json_vpack_ex (error=error@entry=0x0, flags=flags@entry=0,
    fmt=0x74f59ee8 "{ s:f, s:f, s:f, s:i, s:i, s:i, s:i, s:i, s:i }", fmt@entry=0x0,
    ap=..., ap@entry=...) at pack_unpack.c:705
        s = {start = 0x74f59ee8 "{ s:f, s:f, s:f, s:i, s:i, s:i, s:i, s:i, s:i }",
          fmt = 0x74f59f01 ", s:i, s:i, s:i, s:i }", prev_token = {line = 1,
            column = 23, pos = 23, token = 115 's'}, token = {line = 1, column = 23,
            pos = 23, token = 115 's'}, next_token = {line = 1, column = 25,
            pos = 25, token = 105 'i'}, error = 0x0, flags = 0, line = 1,
          column = 25, pos = 25}
        ap_copy = {__ap = 0x744febb4}
        value = <optimized out>
#5  0x76e77904 in json_pack (
    fmt=0x74f59ee8 "{ s:f, s:f, s:f, s:i, s:i, s:i, s:i, s:i, s:i }")
    at pack_unpack.c:739
        value = <optimized out>
        ap = {__ap = 0x744feb7c}
#6  0x74f2bfc8 in cron_stats_to_json (stats=0x73b0b6a0) at cron.c:608
No locals.
#7  cron_entry_to_json (e=e@entry=0x73b0b698) at cron.c:642
        t = <optimized out>
        o = 0x73b0b4c0
        to = <optimized out>
        tasks = <optimized out>
#8  0x74f2c304 in cron_stop_handler (h=0x73b00d40, w=<optimized out>,
    msg=0x73b0c4f8, arg=<optimized out>) at cron.c:832
        e = 0x73b0b698
        ctx = <optimized out>
        out = 0x0
        json_str = 0x0
        saved_errno = 0
        rc = <optimized out>
#9  0x76eb0d10 in call_handler (w=w@entry=0x73b0b288, msg=msg@entry=0x73b0c4f8)
    at msg_handler.c:302
        rolemask = 1
        matchtag = 1951395016
#10 0x76eb0e88 in dispatch_message (type=1, msg=0x73b0c4f8, d=0x73b0aaf0)
    at msg_handler.c:325
        w = 0x73b0b288
        match = <optimized out>
#11 handle_cb (r=0x73b0a320, hw=<optimized out>, revents=<optimized out>,
    arg=0x73b0aaf0) at msg_handler.c:391
        d = 0x73b0aaf0
        msg = 0x73b0c4f8
        rc = -1
        type = 1
        topic = 0x73b0c7a0 "cron.stop"
---Type <return> to continue, or q <return> to quit---
        __FUNCTION__ = "handle_cb"
#12 0x76ed6ca4 in ev_invoke_pending (loop=0x73b0a330) at ev.c:3314
        p = <optimized out>
#13 0x76ed9ef4 in ev_run (loop=0x73b0a330, flags=1995284212) at ev.c:3717
        flags = 1995284212
        loop = 0x73b0a330
#14 0x76eb00c8 in flux_reactor_run (r=0x73b0a320, flags=flags@entry=0)
    at reactor.c:128
        ev_flags = 0
        count = <optimized out>
#15 0x74f2d40c in mod_main (h=0x73b00d40, ac=0, av=<optimized out>) at cron.c:952
        rc = -1
#16 0x000190e8 in module_thread (arg=0x1070138) at module.c:158
        p = 0x1070138
        __PRETTY_FUNCTION__ = "module_thread"
        signal_set = {__val = {2147483647, 4294967294,
            4294967295 <repeats 30 times>}}
        errnum = 0
        uri = 0x0
        av = 0x73b0b2e8
        rankstr = 0x0
        ac = 0
        mod_main_errno = 0
        msg = <optimized out>
#17 0x76c53e90 in start_thread (arg=0x744ff450) at pthread_create.c:311
        pd = 0x744ff450
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {1951397048, 0, 2122955112, 338,
                17236320, 1989539008, 0, 1951395724, 949470388, 974984324, 0,
                1072693248, 0, -1074790400, 0, 1071644672, 1564570730, 1104567544,
                0, 0, -600214679, 1095402171, 956190287, 1075054507, 630506365,
                1420970413, 0 <repeats 38 times>}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
              canceltype = 0}}}
        not_first_call = 0
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#18 0x76b3b598 in ?? () at ../ports/sysdeps/unix/sysv/linux/arm/nptl/../clone.S:92
   from /lib/arm-linux-gnueabihf/libc.so.6
No locals.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@garlick
Copy link
Member Author

garlick commented Sep 1, 2017

Hmm, possibly due to uint64_t's being packed as "i".

This seems to fix it

diff --git a/src/modules/cron/cron.c b/src/modules/cron/cron.c
index 7eb516e..862c42d 100644
--- a/src/modules/cron/cron.c
+++ b/src/modules/cron/cron.c
@@ -605,7 +605,7 @@ error:

 static json_t *cron_stats_to_json (struct cron_stats *stats)
 {
-    return json_pack ("{ s:f, s:f, s:f, s:i, s:i, s:i, s:i, s:i, s:i }",
+    return json_pack ("{ s:f, s:f, s:f, s:I, s:I, s:I, s:I, s:I, s:I }",
                       "ctime", stats->ctime,
                       "starttime", stats->starttime,
                       "lastrun", stats->lastrun,

@grondo
Copy link
Contributor

grondo commented Sep 1, 2017

Nice find. Sorry!

garlick added a commit to garlick/flux-core that referenced this issue Sep 1, 2017
Problem: cron module segfaults during travis testing on arm7l.

Change a json_pack() format string to use "I" not "i" to
refer to uint64_t values.

Fixes flux-framework#1176
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants