Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Link to Venky fork?!? #11

Closed
johnbent opened this issue Sep 15, 2020 · 1 comment
Closed

Link to Venky fork?!? #11

johnbent opened this issue Sep 15, 2020 · 1 comment

Comments

@johnbent
Copy link
Contributor

The README is looking nice! However, it is linking to a Venky fork . . .
https://github.com/Seagate/cortx-motr/blob/dev/README.rst

Says this: "Refer Reading - list for complete information."

And then Reading list is a link to a file in Venky's fork: https://github.com/VenkyOS/cortx-motr/blob/dev/doc/reading-list.md and that file contains tons of 404 URL's.

@VenkyOS , please clean all this up. There should not be links to your fork. You should get reading-list.md added to cortx-motr repo and then you need to remove all the links to files which can't be found.

@nikitadanilov @max-seagate , please feel free to provide links to files that the community can read to learn about motr and Venky can get them into github if they aren't already.

@VenkyOS
Copy link
Contributor

VenkyOS commented Sep 16, 2020

@johnbent

Comments have been addressed and PR raised.

#373

@gauravchaudhari02 gauravchaudhari02 transferred this issue from another repository Sep 22, 2020
madhavemuri added a commit that referenced this issue Sep 22, 2020
Following crash occured occured in multi confd configuration
during restart/failure of a confd m0d, when initialisation is
in progress,

 #6  0x00007f99f255ad24 in m0_panic (ctx=ctx@entry=0x7f99f29758c0 <__pctx.20770>) at lib/assert.c:50
 #7  0x00007f99f24e16a4 in m0_confc_fini (confc=confc@entry=0x7f981c008650) at conf/confc.c:562
 #8  0x00007f99f24f2918 in rconfc_herd_link_fini (lnk=lnk@entry=0x7f981c008650) at conf/rconfc.c:1387
 #9  0x00007f99f24f6797 in rconfc_link_fom_fini (fom=0x7f981c009688) at conf/rconfc_link_fom.c:100
 #10 0x00007f99f252f141 in fom_exec (fom=0x7f981c009688) at fop/fom.c:771
 #11 loc_handler_thread (th=0xc138d0) at fop/fom.c:887
 #12 0x00007f99f256117e in m0_thread_trampoline (arg=arg@entry=0xc138d8) at lib/thread.c:116
 #13 0x00007f99f256b436 in uthread_trampoline (arg=0xc138d8) at lib/user_space/uthread.c:93
 #14 0x00007f99f349bea5 in start_thread () from /lib64/libpthread.so.0
 #15 0x00007f99f07a78cd in clone () from /lib64/libc.so.6

To fix link finalisation issue, fini is done through AST.
And link ast is posted at one time only by checking "rl_finalised"
flag.

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
madhavemuri added a commit that referenced this issue Oct 3, 2020
Following crash occured occured in multi confd configuration
during restart/failure of a confd m0d, when initialisation is
in progress,

 #6  0x00007f99f255ad24 in m0_panic (ctx=ctx@entry=0x7f99f29758c0 <__pctx.20770>) at lib/assert.c:50
 #7  0x00007f99f24e16a4 in m0_confc_fini (confc=confc@entry=0x7f981c008650) at conf/confc.c:562
 #8  0x00007f99f24f2918 in rconfc_herd_link_fini (lnk=lnk@entry=0x7f981c008650) at conf/rconfc.c:1387
 #9  0x00007f99f24f6797 in rconfc_link_fom_fini (fom=0x7f981c009688) at conf/rconfc_link_fom.c:100
 #10 0x00007f99f252f141 in fom_exec (fom=0x7f981c009688) at fop/fom.c:771
 #11 loc_handler_thread (th=0xc138d0) at fop/fom.c:887
 #12 0x00007f99f256117e in m0_thread_trampoline (arg=arg@entry=0xc138d8) at lib/thread.c:116
 #13 0x00007f99f256b436 in uthread_trampoline (arg=0xc138d8) at lib/user_space/uthread.c:93
 #14 0x00007f99f349bea5 in start_thread () from /lib64/libpthread.so.0
 #15 0x00007f99f07a78cd in clone () from /lib64/libc.so.6

To fix link finalisation issue, fini is done through AST.
And link ast is posted at one time only by checking "rl_finalised"
flag.

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
madhavemuri added a commit that referenced this issue Oct 5, 2020
Currently be seg size MOTR_M0D_IOS_BESEG_SIZE from /etc/sysconfig/motr
not being used and default value of 8GB is used due to a bug in
motr-server script, which checks for "^ios" regex in systemd unit file,
which is no longer used with hare based cluster.
The following crash is observed,

 #3  0x00007fca296780b4 in m0_panic (ctx=ctx@entry=0x7fca29a72020 <__pctx.8247>) at lib/assert.c:52
 #4  0x00007fca295bdf20 in mem_alloc (zonemask=2, size=<optimized out>, tx=0x7fc84c2266b8, btree=0x40000015f210) at be/btree.c:127
 #5  btree_save (tree=tree@entry=0x40000015f210, tx=tx@entry=0x7fc84c2266b8, op=op@entry=0x7fc92bffe460,
    key=key@entry=0x7fc92bffe6d0, val=val@entry=0x7fc92bffe6e0, anchor=anchor@entry=0x0, optype=optype@entry=BTREE_SAVE_INSERT,
    zonemask=zonemask@entry=2) at be/btree.c:1453
 #6  0x00007fca295bdfe8 in be_btree_insert (tree=tree@entry=0x40000015f210, tx=tx@entry=0x7fc84c2266b8,
    op=op@entry=0x7fc92bffe460, key=key@entry=0x7fc92bffe6d0, val=val@entry=0x7fc92bffe6e0, anchor=anchor@entry=0x0,
    zonemask=zonemask@entry=2) at be/btree.c:1907
 #7  0x00007fca295bf018 in m0_be_btree_insert (tree=tree@entry=0x40000015f210, tx=tx@entry=0x7fc84c2266b8,
    op=op@entry=0x7fc92bffe460, key=key@entry=0x7fc92bffe6d0, val=val@entry=0x7fc92bffe6e0) at be/btree.c:1938
 #8  0x00007fca295f7931 in cob_table_insert (tree=0x40000015f210, tx=tx@entry=0x7fc84c2266b8, key=key@entry=0x7fc92bffe6d0,
    val=val@entry=0x7fc92bffe6e0) at cob/cob.c:721
 #9  0x00007fca295fa5cf in m0_cob_name_add (cob=cob@entry=0x7fc9241ffc50, nskey=nskey@entry=0x7fc9241ffdc0,
    nsrec=nsrec@entry=0x7fc92bffe8e0, tx=tx@entry=0x7fc84c2266b8) at cob/cob.c:1640
 #10 0x00007fca295fa85a in m0_cob_create (cob=0x7fc9241ffc50, nskey=0x7fc9241ffdc0, nsrec=nsrec@entry=0x7fc92bffe8e0,
    fabrec=fabrec@entry=0x0, omgrec=omgrec@entry=0x0, tx=tx@entry=0x7fc84c2266b8) at cob/cob.c:1428
 #11 0x00007fca2966b097 in m0_cc_cob_setup (cc=cc@entry=0x7fc84c2265d0, cdom=<optimized out>, attr=attr@entry=0x7fc92bffec20,
    ctx=ctx@entry=0x7fc84c2266b8) at ioservice/cob_foms.c:1058
 #12 0x00007fca2966b275 in cc_cob_create (attr=0x7fc92bffec20, cc=0x7fc84c2265d0, fom=0x7fc84c2265f0) at ioservice/cob_foms.c:1016
 #13 cob_stob_create (fom=fom@entry=0x7fc84c2265f0, attr=attr@entry=0x7fc92bffec20) at ioservice/cob_foms.c:517
 #14 0x00007fca2966c40b in cob_ops_fom_tick (fom=0x7fc84c2265f0) at ioservice/cob_foms.c:767
 #15 0x00007fca2964c24b in fom_exec (fom=0x7fc84c2265f0) at fop/fom.c:785
 #16 loc_handler_thread (th=0x1b5c1e0) at fop/fom.c:925
 #17 0x00007fca2967e50e in m0_thread_trampoline (arg=arg@entry=0x1b5c1e8) at lib/thread.c:117
 #18 0x00007fca2968875d in uthread_trampoline (arg=0x1b5c1e8) at lib/user_space/uthread.c:98
 #19 0x00007fca28e36ea5 in start_thread () from /lib64/libpthread.so.0
 #20 0x00007fca27f278cd in clone () from /lib64/libc.so.6

So for LDR R1, as LVM paths are passed only for ioservices, at the
same place be seg size is applied.

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
madhavemuri added a commit that referenced this issue Oct 5, 2020
Currently be seg size MOTR_M0D_IOS_BESEG_SIZE from /etc/sysconfig/motr
not being used and default value of 8GB is used due to a bug in
motr-server script, which checks for "^ios" regex in systemd unit file,
which is no longer used with hare based cluster.
The following crash is observed,

 #3  0x00007fca296780b4 in m0_panic (ctx=ctx@entry=0x7fca29a72020 <__pctx.8247>) at lib/assert.c:52
 #4  0x00007fca295bdf20 in mem_alloc (zonemask=2, size=<optimized out>, tx=0x7fc84c2266b8, btree=0x40000015f210) at be/btree.c:127
 #5  btree_save (tree=tree@entry=0x40000015f210, tx=tx@entry=0x7fc84c2266b8, op=op@entry=0x7fc92bffe460,
    key=key@entry=0x7fc92bffe6d0, val=val@entry=0x7fc92bffe6e0, anchor=anchor@entry=0x0, optype=optype@entry=BTREE_SAVE_INSERT,
    zonemask=zonemask@entry=2) at be/btree.c:1453
 #6  0x00007fca295bdfe8 in be_btree_insert (tree=tree@entry=0x40000015f210, tx=tx@entry=0x7fc84c2266b8,
    op=op@entry=0x7fc92bffe460, key=key@entry=0x7fc92bffe6d0, val=val@entry=0x7fc92bffe6e0, anchor=anchor@entry=0x0,
    zonemask=zonemask@entry=2) at be/btree.c:1907
 #7  0x00007fca295bf018 in m0_be_btree_insert (tree=tree@entry=0x40000015f210, tx=tx@entry=0x7fc84c2266b8,
    op=op@entry=0x7fc92bffe460, key=key@entry=0x7fc92bffe6d0, val=val@entry=0x7fc92bffe6e0) at be/btree.c:1938
 #8  0x00007fca295f7931 in cob_table_insert (tree=0x40000015f210, tx=tx@entry=0x7fc84c2266b8, key=key@entry=0x7fc92bffe6d0,
    val=val@entry=0x7fc92bffe6e0) at cob/cob.c:721
 #9  0x00007fca295fa5cf in m0_cob_name_add (cob=cob@entry=0x7fc9241ffc50, nskey=nskey@entry=0x7fc9241ffdc0,
    nsrec=nsrec@entry=0x7fc92bffe8e0, tx=tx@entry=0x7fc84c2266b8) at cob/cob.c:1640
 #10 0x00007fca295fa85a in m0_cob_create (cob=0x7fc9241ffc50, nskey=0x7fc9241ffdc0, nsrec=nsrec@entry=0x7fc92bffe8e0,
    fabrec=fabrec@entry=0x0, omgrec=omgrec@entry=0x0, tx=tx@entry=0x7fc84c2266b8) at cob/cob.c:1428
 #11 0x00007fca2966b097 in m0_cc_cob_setup (cc=cc@entry=0x7fc84c2265d0, cdom=<optimized out>, attr=attr@entry=0x7fc92bffec20,
    ctx=ctx@entry=0x7fc84c2266b8) at ioservice/cob_foms.c:1058
 #12 0x00007fca2966b275 in cc_cob_create (attr=0x7fc92bffec20, cc=0x7fc84c2265d0, fom=0x7fc84c2265f0) at ioservice/cob_foms.c:1016
 #13 cob_stob_create (fom=fom@entry=0x7fc84c2265f0, attr=attr@entry=0x7fc92bffec20) at ioservice/cob_foms.c:517
 #14 0x00007fca2966c40b in cob_ops_fom_tick (fom=0x7fc84c2265f0) at ioservice/cob_foms.c:767
 #15 0x00007fca2964c24b in fom_exec (fom=0x7fc84c2265f0) at fop/fom.c:785
 #16 loc_handler_thread (th=0x1b5c1e0) at fop/fom.c:925
 #17 0x00007fca2967e50e in m0_thread_trampoline (arg=arg@entry=0x1b5c1e8) at lib/thread.c:117
 #18 0x00007fca2968875d in uthread_trampoline (arg=0x1b5c1e8) at lib/user_space/uthread.c:98
 #19 0x00007fca28e36ea5 in start_thread () from /lib64/libpthread.so.0
 #20 0x00007fca27f278cd in clone () from /lib64/libc.so.6

So for LDR R1, as LVM paths are passed only for ioservices, at the
same place be seg size is applied.

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
yeshpal-jain-seagate pushed a commit that referenced this issue Oct 7, 2020
Following crash occured occured in multi confd configuration
during restart/failure of a confd m0d, when initialisation is
in progress,

 #6  0x00007f99f255ad24 in m0_panic (ctx=ctx@entry=0x7f99f29758c0 <__pctx.20770>) at lib/assert.c:50
 #7  0x00007f99f24e16a4 in m0_confc_fini (confc=confc@entry=0x7f981c008650) at conf/confc.c:562
 #8  0x00007f99f24f2918 in rconfc_herd_link_fini (lnk=lnk@entry=0x7f981c008650) at conf/rconfc.c:1387
 #9  0x00007f99f24f6797 in rconfc_link_fom_fini (fom=0x7f981c009688) at conf/rconfc_link_fom.c:100
 #10 0x00007f99f252f141 in fom_exec (fom=0x7f981c009688) at fop/fom.c:771
 #11 loc_handler_thread (th=0xc138d0) at fop/fom.c:887
 #12 0x00007f99f256117e in m0_thread_trampoline (arg=arg@entry=0xc138d8) at lib/thread.c:116
 #13 0x00007f99f256b436 in uthread_trampoline (arg=0xc138d8) at lib/user_space/uthread.c:93
 #14 0x00007f99f349bea5 in start_thread () from /lib64/libpthread.so.0
 #15 0x00007f99f07a78cd in clone () from /lib64/libc.so.6

To fix link finalisation issue, fini is done through AST.
And link ast is posted at one time only by checking "rl_finalised"
flag.

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
madhavemuri added a commit that referenced this issue Oct 9, 2020
)

Following crash occurred in multi confd configuration
during restart/failure of a confd m0d, when initialization is in progress,

 #6  0x00007f99f255ad24 in m0_panic (ctx=ctx@entry=0x7f99f29758c0 <__pctx.20770>) at lib/assert.c:50
 #7  0x00007f99f24e16a4 in m0_confc_fini (confc=confc@entry=0x7f981c008650) at conf/confc.c:562
 #8  0x00007f99f24f2918 in rconfc_herd_link_fini (lnk=lnk@entry=0x7f981c008650) at conf/rconfc.c:1387
 #9  0x00007f99f24f6797 in rconfc_link_fom_fini (fom=0x7f981c009688) at conf/rconfc_link_fom.c:100
 #10 0x00007f99f252f141 in fom_exec (fom=0x7f981c009688) at fop/fom.c:771
 #11 loc_handler_thread (th=0xc138d0) at fop/fom.c:887
 #12 0x00007f99f256117e in m0_thread_trampoline (arg=arg@entry=0xc138d8) at lib/thread.c:116
 #13 0x00007f99f256b436 in uthread_trampoline (arg=0xc138d8) at lib/user_space/uthread.c:93
 #14 0x00007f99f349bea5 in start_thread () from /lib64/libpthread.so.0
 #15 0x00007f99f07a78cd in clone () from /lib64/libc.so.6

To fix link finalization issue, fini is done through AST.
And link ast is posted at one time only by checking "rl_finalised" flag.
* Handle sm fini from wait for reply state
* use different AST for rconfc_cctx_fini()
* rmservice fid fetch can fail in case of confd quorum is not possible.
   To resolve this returning from initlift_resource_manager() when
   confc in not initialized.

Co-authored-by: Hua Huang <hua.huang@seagate.com>

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
madhavemuri added a commit that referenced this issue Oct 9, 2020
)

Following crash occurred in multi confd configuration
during restart/failure of a confd m0d, when initialization is in progress,

 #6  0x00007f99f255ad24 in m0_panic (ctx=ctx@entry=0x7f99f29758c0 <__pctx.20770>) at lib/assert.c:50
 #7  0x00007f99f24e16a4 in m0_confc_fini (confc=confc@entry=0x7f981c008650) at conf/confc.c:562
 #8  0x00007f99f24f2918 in rconfc_herd_link_fini (lnk=lnk@entry=0x7f981c008650) at conf/rconfc.c:1387
 #9  0x00007f99f24f6797 in rconfc_link_fom_fini (fom=0x7f981c009688) at conf/rconfc_link_fom.c:100
 #10 0x00007f99f252f141 in fom_exec (fom=0x7f981c009688) at fop/fom.c:771
 #11 loc_handler_thread (th=0xc138d0) at fop/fom.c:887
 #12 0x00007f99f256117e in m0_thread_trampoline (arg=arg@entry=0xc138d8) at lib/thread.c:116
 #13 0x00007f99f256b436 in uthread_trampoline (arg=0xc138d8) at lib/user_space/uthread.c:93
 #14 0x00007f99f349bea5 in start_thread () from /lib64/libpthread.so.0
 #15 0x00007f99f07a78cd in clone () from /lib64/libc.so.6

To fix link finalization issue, fini is done through AST.
And link ast is posted at one time only by checking "rl_finalised" flag.
* Handle sm fini from wait for reply state
* use different AST for rconfc_cctx_fini()
* rmservice fid fetch can fail in case of confd quorum is not possible.
   To resolve this returning from initlift_resource_manager() when
   confc in not initialized.

Co-authored-by: Hua Huang <hua.huang@seagate.com>

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
madhavemuri added a commit that referenced this issue Oct 9, 2020
)

Following crash occurred in multi confd configuration
during restart/failure of a confd m0d, when initialization is in progress,

 #6  0x00007f99f255ad24 in m0_panic (ctx=ctx@entry=0x7f99f29758c0 <__pctx.20770>) at lib/assert.c:50
 #7  0x00007f99f24e16a4 in m0_confc_fini (confc=confc@entry=0x7f981c008650) at conf/confc.c:562
 #8  0x00007f99f24f2918 in rconfc_herd_link_fini (lnk=lnk@entry=0x7f981c008650) at conf/rconfc.c:1387
 #9  0x00007f99f24f6797 in rconfc_link_fom_fini (fom=0x7f981c009688) at conf/rconfc_link_fom.c:100
 #10 0x00007f99f252f141 in fom_exec (fom=0x7f981c009688) at fop/fom.c:771
 #11 loc_handler_thread (th=0xc138d0) at fop/fom.c:887
 #12 0x00007f99f256117e in m0_thread_trampoline (arg=arg@entry=0xc138d8) at lib/thread.c:116
 #13 0x00007f99f256b436 in uthread_trampoline (arg=0xc138d8) at lib/user_space/uthread.c:93
 #14 0x00007f99f349bea5 in start_thread () from /lib64/libpthread.so.0
 #15 0x00007f99f07a78cd in clone () from /lib64/libc.so.6

To fix link finalization issue, fini is done through AST.
And link ast is posted at one time only by checking "rl_finalised" flag.
* Handle sm fini from wait for reply state
* use different AST for rconfc_cctx_fini()
* rmservice fid fetch can fail in case of confd quorum is not possible.
   To resolve this returning from initlift_resource_manager() when
   confc in not initialized.

Co-authored-by: Hua Huang <hua.huang@seagate.com>

Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
max-seagate added a commit that referenced this issue Sep 29, 2021
It was possible for the op, passed to m0_be_io_sched_add(), to become
DONE before sched lock is taken in be_io_sched_cb() because parent op
could become DONE before the child op M0_BTS_GC callback is called.
This situation led to failed invariant, assertion failures [1] or just
spinning in infinite loop in m0_list invariant due to a broken list [2].

The patch fixes the issue by moving the op (passed to
m0_be_io_sched_add()) to DONE only after finishing processing of the
corresponding m0_be_io.

The bug was discovered during EOS-25005 patch testing.

[1]

be-ut
  io_sched  motr[15961]:  e4b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() (be/io_sched.c:132)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.15961
Motr panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() be/io_sched.c:132 (errno: 0) (last failed: none) [git: 2.0.0-307-37-gd35074f6-dirty] pid: 15961  /w/var/m0ut/m0trace.15961
...

[2]

  (gdb) bt
  #0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  #1  0x00007f432af53e1b in _L_lock_812 () from /lib64/libpthread.so.0
  #2  0x00007f432af53ce8 in __GI___pthread_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at ../nptl/pthread_mutex_lock.c:79
  #3  0x00007f432b8c3520 in m0_arch_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/user_space/umutex.c:56
  #4  0x00007f432b8aee6f in m0_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/mutex.c:55
  #5  0x00007f432b76673b in m0_be_io_sched_lock (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>) at be/io_sched.c:72
  #6  0x00007f432b766a75 in be_io_sched_cb (op=0x7f4288000998, param=0x7f42880008c0) at be/io_sched.c:123
  #7  0x00007f432b771bb3 in be_op_done (op=0x7f4288000998) at be/op.c:257
  #8  0x00007f432b771cb6 in m0_be_op_done (op=0x7f4288000998) at be/op.c:276
  #9  0x00007f432b765715 in be_io_finished (bio=0x7f42880008c0) at be/io.c:555
  #10 0x00007f432b7658e1 in be_io_cb (link=0x7f4288001040) at be/io.c:587
  #11 0x00007f432b8a53a0 in clink_signal (clink=0x7f4288001040) at lib/chan.c:135
  #12 0x00007f432b8a54ff in chan_signal_nr (chan=0x7f4288000f38, nr=0) at lib/chan.c:154
  #13 0x00007f432b8a55d4 in m0_chan_broadcast (chan=0x7f4288000f38) at lib/chan.c:174
  #14 0x00007f432b8a55fa in m0_chan_broadcast_lock (chan=0x7f4288000f38) at lib/chan.c:181
  #15 0x00007f432ba03cfe in ioq_complete (ioq=0x1367bc0, qev=0x7f42886a3fe0, res=8, res2=0) at stob/ioq.c:587
  #16 0x00007f432ba041d1 in stob_ioq_thread (ioq=0x1367bc0) at stob/ioq.c:669
  #17 0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x1367be0) at lib/thread.c:117
  #18 0x00007f432b8c37fc in uthread_trampoline (arg=0x1367be0) at lib/user_space/uthread.c:98
  #19 0x00007f432af51e25 in start_thread (arg=0x7f42ccff9700) at pthread_create.c:308
  #20 0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7f428ffff700 (LWP 8730))]
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  253                     if (!M0_CHECK_EX(m0_tlink_invariant(d, amb(d, scan))))
  (gdb) bt
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  #1  0x00007f432b8b1f13 in m0_tlist_add (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, obj=0x7f42880008c0) at lib/tlist.c:128
  #2  0x00007f432b7664c8 in sched_io_tlist_add (list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, amb=0x7f42880008c0) at be/io_sched.c:48
  #3  0x00007f432b766d88 in be_io_sched_insert (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0) at be/io_sched.c:161
  #4  0x00007f432b767144 in m0_be_io_sched_add (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0, ext=0x7f428fffed00, op=0x7f4288000c00) at be/io_sched.c:193
  #5  0x00007f432cf8a008 in be_ut_io_sched_thread (param=0x136a2b0) at be/ut/io_sched.c:241
  #6  0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x136bad0) at lib/thread.c:117
  #7  0x00007f432b8c37fc in uthread_trampoline (arg=0x136bad0) at lib/user_space/uthread.c:98
  #8  0x00007f432af51e25 in start_thread (arg=0x7f428ffff700) at pthread_create.c:308
  #9  0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
max-seagate added a commit that referenced this issue Sep 29, 2021
It was possible for the op, passed to m0_be_io_sched_add(), to become
DONE before sched lock is taken in be_io_sched_cb() because parent op
could become DONE before the child op M0_BTS_GC callback is called.
This situation led to failed invariant, assertion failures [2] or just
spinning in infinite loop in m0_list invariant due to a broken list [3].

The patch fixes the issue by moving the op (passed to
m0_be_io_sched_add()) to DONE only after finishing processing of the
corresponding m0_be_io.

[1]

be-ut
  io_sched  motr[15961]:  e4b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() (be/io_sched.c:132)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.15961
Motr panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() be/io_sched.c:132 (errno: 0) (last failed: none) [git: 2.0.0-307-37-gd35074f6-dirty] pid: 15961  /w/var/m0ut/m0trace.15961
...

[2]

  (gdb) bt
  #0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  #1  0x00007f432af53e1b in _L_lock_812 () from /lib64/libpthread.so.0
  #2  0x00007f432af53ce8 in __GI___pthread_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at ../nptl/pthread_mutex_lock.c:79
  #3  0x00007f432b8c3520 in m0_arch_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/user_space/umutex.c:56
  #4  0x00007f432b8aee6f in m0_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/mutex.c:55
  #5  0x00007f432b76673b in m0_be_io_sched_lock (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>) at be/io_sched.c:72
  #6  0x00007f432b766a75 in be_io_sched_cb (op=0x7f4288000998, param=0x7f42880008c0) at be/io_sched.c:123
  #7  0x00007f432b771bb3 in be_op_done (op=0x7f4288000998) at be/op.c:257
  #8  0x00007f432b771cb6 in m0_be_op_done (op=0x7f4288000998) at be/op.c:276
  #9  0x00007f432b765715 in be_io_finished (bio=0x7f42880008c0) at be/io.c:555
  #10 0x00007f432b7658e1 in be_io_cb (link=0x7f4288001040) at be/io.c:587
  #11 0x00007f432b8a53a0 in clink_signal (clink=0x7f4288001040) at lib/chan.c:135
  #12 0x00007f432b8a54ff in chan_signal_nr (chan=0x7f4288000f38, nr=0) at lib/chan.c:154
  #13 0x00007f432b8a55d4 in m0_chan_broadcast (chan=0x7f4288000f38) at lib/chan.c:174
  #14 0x00007f432b8a55fa in m0_chan_broadcast_lock (chan=0x7f4288000f38) at lib/chan.c:181
  #15 0x00007f432ba03cfe in ioq_complete (ioq=0x1367bc0, qev=0x7f42886a3fe0, res=8, res2=0) at stob/ioq.c:587
  #16 0x00007f432ba041d1 in stob_ioq_thread (ioq=0x1367bc0) at stob/ioq.c:669
  #17 0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x1367be0) at lib/thread.c:117
  #18 0x00007f432b8c37fc in uthread_trampoline (arg=0x1367be0) at lib/user_space/uthread.c:98
  #19 0x00007f432af51e25 in start_thread (arg=0x7f42ccff9700) at pthread_create.c:308
  #20 0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7f428ffff700 (LWP 8730))]
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  253                     if (!M0_CHECK_EX(m0_tlink_invariant(d, amb(d, scan))))
  (gdb) bt
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  #1  0x00007f432b8b1f13 in m0_tlist_add (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, obj=0x7f42880008c0) at lib/tlist.c:128
  #2  0x00007f432b7664c8 in sched_io_tlist_add (list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, amb=0x7f42880008c0) at be/io_sched.c:48
  #3  0x00007f432b766d88 in be_io_sched_insert (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0) at be/io_sched.c:161
  #4  0x00007f432b767144 in m0_be_io_sched_add (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0, ext=0x7f428fffed00, op=0x7f4288000c00) at be/io_sched.c:193
  #5  0x00007f432cf8a008 in be_ut_io_sched_thread (param=0x136a2b0) at be/ut/io_sched.c:241
  #6  0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x136bad0) at lib/thread.c:117
  #7  0x00007f432b8c37fc in uthread_trampoline (arg=0x136bad0) at lib/user_space/uthread.c:98
  #8  0x00007f432af51e25 in start_thread (arg=0x7f428ffff700) at pthread_create.c:308
  #9  0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
max-seagate added a commit that referenced this issue Sep 30, 2021
It was possible for the op, passed to m0_be_io_sched_add(), to become
DONE before sched lock is taken in be_io_sched_cb() because parent op
could become DONE before the child op M0_BTS_GC callback is called.
This situation led to failed invariant, assertion failures [1] or just
spinning in infinite loop in m0_list invariant due to a broken list [2].

The patch fixes the issue by moving the op (passed to
m0_be_io_sched_add()) to DONE only after finishing processing of the
corresponding m0_be_io.

The bug was discovered during EOS-25005 patch testing.

[1]

be-ut
  io_sched  motr[15961]:  e4b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() (be/io_sched.c:132)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.15961
Motr panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() be/io_sched.c:132 (errno: 0) (last failed: none) [git: 2.0.0-307-37-gd35074f6-dirty] pid: 15961  /w/var/m0ut/m0trace.15961
...

[2]

  (gdb) bt
  #0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  #1  0x00007f432af53e1b in _L_lock_812 () from /lib64/libpthread.so.0
  #2  0x00007f432af53ce8 in __GI___pthread_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at ../nptl/pthread_mutex_lock.c:79
  #3  0x00007f432b8c3520 in m0_arch_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/user_space/umutex.c:56
  #4  0x00007f432b8aee6f in m0_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/mutex.c:55
  #5  0x00007f432b76673b in m0_be_io_sched_lock (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>) at be/io_sched.c:72
  #6  0x00007f432b766a75 in be_io_sched_cb (op=0x7f4288000998, param=0x7f42880008c0) at be/io_sched.c:123
  #7  0x00007f432b771bb3 in be_op_done (op=0x7f4288000998) at be/op.c:257
  #8  0x00007f432b771cb6 in m0_be_op_done (op=0x7f4288000998) at be/op.c:276
  #9  0x00007f432b765715 in be_io_finished (bio=0x7f42880008c0) at be/io.c:555
  #10 0x00007f432b7658e1 in be_io_cb (link=0x7f4288001040) at be/io.c:587
  #11 0x00007f432b8a53a0 in clink_signal (clink=0x7f4288001040) at lib/chan.c:135
  #12 0x00007f432b8a54ff in chan_signal_nr (chan=0x7f4288000f38, nr=0) at lib/chan.c:154
  #13 0x00007f432b8a55d4 in m0_chan_broadcast (chan=0x7f4288000f38) at lib/chan.c:174
  #14 0x00007f432b8a55fa in m0_chan_broadcast_lock (chan=0x7f4288000f38) at lib/chan.c:181
  #15 0x00007f432ba03cfe in ioq_complete (ioq=0x1367bc0, qev=0x7f42886a3fe0, res=8, res2=0) at stob/ioq.c:587
  #16 0x00007f432ba041d1 in stob_ioq_thread (ioq=0x1367bc0) at stob/ioq.c:669
  #17 0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x1367be0) at lib/thread.c:117
  #18 0x00007f432b8c37fc in uthread_trampoline (arg=0x1367be0) at lib/user_space/uthread.c:98
  #19 0x00007f432af51e25 in start_thread (arg=0x7f42ccff9700) at pthread_create.c:308
  #20 0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7f428ffff700 (LWP 8730))]
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  253                     if (!M0_CHECK_EX(m0_tlink_invariant(d, amb(d, scan))))
  (gdb) bt
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  #1  0x00007f432b8b1f13 in m0_tlist_add (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, obj=0x7f42880008c0) at lib/tlist.c:128
  #2  0x00007f432b7664c8 in sched_io_tlist_add (list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, amb=0x7f42880008c0) at be/io_sched.c:48
  #3  0x00007f432b766d88 in be_io_sched_insert (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0) at be/io_sched.c:161
  #4  0x00007f432b767144 in m0_be_io_sched_add (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0, ext=0x7f428fffed00, op=0x7f4288000c00) at be/io_sched.c:193
  #5  0x00007f432cf8a008 in be_ut_io_sched_thread (param=0x136a2b0) at be/ut/io_sched.c:241
  #6  0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x136bad0) at lib/thread.c:117
  #7  0x00007f432b8c37fc in uthread_trampoline (arg=0x136bad0) at lib/user_space/uthread.c:98
  #8  0x00007f432af51e25 in start_thread (arg=0x7f428ffff700) at pthread_create.c:308
  #9  0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
max-seagate added a commit that referenced this issue Sep 30, 2021
m0_be_log_discard_flush() may finish waiting after io is done, but
before lds_discard_ast() is posted. It would lead to
be_log_discard_wait() to not to wait for the ast to be finished, which
leads to an assertion failure [1]. In this case traces look like [2] and
stack backtrace looks like [3].

This patch moves flush op to DONE at the end of the ast, which makes
flush() wait until the ast is actually executed.

The bug was discovered during EOS-25005 patch testing.

[1]

motr[24531]:  a3b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-72-g0e307f7b-dirty] /w/var/m0ut/m0trace.24531
Motr panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() be/log_discard.c:143 (errno: 4) (last failed: none) [git: 2.0.0-307-72-g0e307f7b-dirty] pid: 24531  /w/var/m0ut/m0trace.24531

[2]

92998540  157294.289.381909  7f86d97f98e0  DEBUG   be/log_discard.c:288:be_log_discard_sync_done_cb    ld=0x7f86f709d8f8
92998541  157294.289.382245  7f86d97f98b0  CALL    be/op.c:154:m0_be_op_reset                          > op=0x7f86f709da40
...
92998549  157294.289.431085  7f86d97f9c40  DEBUG   sm/sm.c:158:m0_sm_asts_run                          grp=0x22d96b0 forkq_pop: 0x7f86f709de10 <- 0x7f86f6049da0
92998550  157294.289.432156  7f86d97f9c00  CALL    be/log_discard.c:256:be_log_discard_ast             > ld=0x7f86f709d8f8
92998551  157294.289.433624  7ffd2e9e10c0  FATAL   lib/assert.c:50:m0_panic                            panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.22409
92998552  157294.289.435003  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bf40
92998553  157294.289.436948  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bf40 ext=[0, 7000)
92998554  157294.289.439155  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=0 lg_free=134082560) size=28672
...
92998588  157294.289.570768  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bb80
92998589  157294.289.571102  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bb80 ext=[1b000, 1d000)
92998590  157294.289.571601  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=110592 lg_free=134193152) size=8192
92998591  157294.289.572156  7f86d97f9ad0  DEBUG   be/log.c:545:m0_be_log_record_discard               log=(lg_current=118784 lg_discarded=118784 lg_free=134201344) size=8192
92998592  157294.289.572524  7f86d97f9a90  CALL    be/engine.c:865:m0_be_engine_got_log_space_cb       > en=0x7f86f709a6c8 log=0x7f86f709a838
92998593  157294.289.573643  7f86d97f9b70  DEBUG   be/log_discard.c:451:m0_be_log_discard_item_put     ld=0x7f86f709d8f8 ldi=0x7f862039bb80

[3]

  Thread 1 (Thread 0x7f8701248840 (LWP 22409)):
  #0  0x00007f86f310f277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #1  0x00007f86f3110968 in __GI_abort () at abort.c:90
  #2  0x00007f86f50ffe1e in m0_arch_panic (c=0x7f86f55d4840 <__pctx.7291>, ap=0x7ffd2e9e1168) at lib/user_space/uassert.c:131
  #3  0x00007f86f50e0a78 in m0_panic (ctx=0x7f86f55d4840 <__pctx.7291>) at lib/assert.c:52
  #4  0x00007f86f4fab1a2 in m0_be_log_discard_fini (ld=0x7f86f709d8f8 <be+17944>) at be/log_discard.c:143
  #5  0x00007f86f4f96403 in be_domain_level_leave (module=0x7f86f70992e0 <be>) at be/domain.c:906
  #6  0x00007f86f5171e09 in m0_module_fini (module=0x7f86f70992e0 <be>, level=-1) at module/module.c:154
  #7  0x00007f86f67c61f3 in m0_be_ut_backend_fini (ut_be=0x7f86f70992e0 <be>) at be/ut/helper.c:448
  #8  0x00007f86f6816645 in iter_ut_fini (svc=0x7f857252b168) at dix/cm/ut/iter_ut.c:708
  #9  0x00007f86f6816a39 in empty_cctg () at dix/cm/ut/iter_ut.c:786
  #10 0x00007f86f69a3736 in run_test (test=0x64a708 <dix_cm_iter_ut+4552>, max_name_len=43) at ut/ut.c:390
  #11 0x00007f86f69a3a28 in run_suite (suite=0x649540 <dix_cm_iter_ut>, max_name_len=43) at ut/ut.c:459
  #12 0x00007f86f69a3ccc in tests_run_all (m=0x7f870102f4a0 <ut.12684>) at ut/ut.c:513
  #13 0x00007f86f69a3d2a in m0_ut_run () at ut/ut.c:539
  #14 0x0000000000404a42 in main (argc=3, argv=0x7ffd2e9e1d58) at ut/m0ut.c:525

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
max-seagate added a commit that referenced this issue Sep 30, 2021
m0_be_log_discard_flush() may finish waiting after io is done, but
before lds_discard_ast() is posted. It would lead to
be_log_discard_wait() to not to wait for the ast to be finished, which
leads to an assertion failure [1]. In this case traces look like [2] and
stack backtrace looks like [3].

This patch moves flush op to DONE at the end of the ast, which makes
flush() wait until the ast is actually executed.

The bug was discovered during EOS-25005 patch testing.

[1]

motr[24531]:  a3b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-72-g0e307f7b-dirty] /w/var/m0ut/m0trace.24531
Motr panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() be/log_discard.c:143 (errno: 4) (last failed: none) [git: 2.0.0-307-72-g0e307f7b-dirty] pid: 24531  /w/var/m0ut/m0trace.24531

[2]

92998540  157294.289.381909  7f86d97f98e0  DEBUG   be/log_discard.c:288:be_log_discard_sync_done_cb    ld=0x7f86f709d8f8
92998541  157294.289.382245  7f86d97f98b0  CALL    be/op.c:154:m0_be_op_reset                          > op=0x7f86f709da40
...
92998549  157294.289.431085  7f86d97f9c40  DEBUG   sm/sm.c:158:m0_sm_asts_run                          grp=0x22d96b0 forkq_pop: 0x7f86f709de10 <- 0x7f86f6049da0
92998550  157294.289.432156  7f86d97f9c00  CALL    be/log_discard.c:256:be_log_discard_ast             > ld=0x7f86f709d8f8
92998551  157294.289.433624  7ffd2e9e10c0  FATAL   lib/assert.c:50:m0_panic                            panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.22409
92998552  157294.289.435003  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bf40
92998553  157294.289.436948  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bf40 ext=[0, 7000)
92998554  157294.289.439155  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=0 lg_free=134082560) size=28672
...
92998588  157294.289.570768  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bb80
92998589  157294.289.571102  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bb80 ext=[1b000, 1d000)
92998590  157294.289.571601  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=110592 lg_free=134193152) size=8192
92998591  157294.289.572156  7f86d97f9ad0  DEBUG   be/log.c:545:m0_be_log_record_discard               log=(lg_current=118784 lg_discarded=118784 lg_free=134201344) size=8192
92998592  157294.289.572524  7f86d97f9a90  CALL    be/engine.c:865:m0_be_engine_got_log_space_cb       > en=0x7f86f709a6c8 log=0x7f86f709a838
92998593  157294.289.573643  7f86d97f9b70  DEBUG   be/log_discard.c:451:m0_be_log_discard_item_put     ld=0x7f86f709d8f8 ldi=0x7f862039bb80

[3]

  Thread 1 (Thread 0x7f8701248840 (LWP 22409)):
  #0  0x00007f86f310f277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #1  0x00007f86f3110968 in __GI_abort () at abort.c:90
  #2  0x00007f86f50ffe1e in m0_arch_panic (c=0x7f86f55d4840 <__pctx.7291>, ap=0x7ffd2e9e1168) at lib/user_space/uassert.c:131
  #3  0x00007f86f50e0a78 in m0_panic (ctx=0x7f86f55d4840 <__pctx.7291>) at lib/assert.c:52
  #4  0x00007f86f4fab1a2 in m0_be_log_discard_fini (ld=0x7f86f709d8f8 <be+17944>) at be/log_discard.c:143
  #5  0x00007f86f4f96403 in be_domain_level_leave (module=0x7f86f70992e0 <be>) at be/domain.c:906
  #6  0x00007f86f5171e09 in m0_module_fini (module=0x7f86f70992e0 <be>, level=-1) at module/module.c:154
  #7  0x00007f86f67c61f3 in m0_be_ut_backend_fini (ut_be=0x7f86f70992e0 <be>) at be/ut/helper.c:448
  #8  0x00007f86f6816645 in iter_ut_fini (svc=0x7f857252b168) at dix/cm/ut/iter_ut.c:708
  #9  0x00007f86f6816a39 in empty_cctg () at dix/cm/ut/iter_ut.c:786
  #10 0x00007f86f69a3736 in run_test (test=0x64a708 <dix_cm_iter_ut+4552>, max_name_len=43) at ut/ut.c:390
  #11 0x00007f86f69a3a28 in run_suite (suite=0x649540 <dix_cm_iter_ut>, max_name_len=43) at ut/ut.c:459
  #12 0x00007f86f69a3ccc in tests_run_all (m=0x7f870102f4a0 <ut.12684>) at ut/ut.c:513
  #13 0x00007f86f69a3d2a in m0_ut_run () at ut/ut.c:539
  #14 0x0000000000404a42 in main (argc=3, argv=0x7ffd2e9e1d58) at ut/m0ut.c:525

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
max-seagate added a commit that referenced this issue Oct 4, 2021
It was possible for the op, passed to m0_be_io_sched_add(), to become
DONE before sched lock is taken in be_io_sched_cb() because parent op
could become DONE before the child op M0_BTS_GC callback is called.
This situation led to failed invariant, assertion failures [1] or just
spinning in infinite loop in m0_list invariant due to a broken list [2].

The patch fixes the issue by moving the op (passed to
m0_be_io_sched_add()) to DONE only after finishing processing of the
corresponding m0_be_io.

The bug was discovered during EOS-25005 patch testing.

[1]

be-ut
  io_sched  motr[15961]:  e4b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() (be/io_sched.c:132)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.15961
Motr panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() be/io_sched.c:132 (errno: 0) (last failed: none) [git: 2.0.0-307-37-gd35074f6-dirty] pid: 15961  /w/var/m0ut/m0trace.15961
...

[2]

  (gdb) bt
  #0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  #1  0x00007f432af53e1b in _L_lock_812 () from /lib64/libpthread.so.0
  #2  0x00007f432af53ce8 in __GI___pthread_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at ../nptl/pthread_mutex_lock.c:79
  #3  0x00007f432b8c3520 in m0_arch_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/user_space/umutex.c:56
  #4  0x00007f432b8aee6f in m0_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/mutex.c:55
  #5  0x00007f432b76673b in m0_be_io_sched_lock (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>) at be/io_sched.c:72
  #6  0x00007f432b766a75 in be_io_sched_cb (op=0x7f4288000998, param=0x7f42880008c0) at be/io_sched.c:123
  #7  0x00007f432b771bb3 in be_op_done (op=0x7f4288000998) at be/op.c:257
  #8  0x00007f432b771cb6 in m0_be_op_done (op=0x7f4288000998) at be/op.c:276
  #9  0x00007f432b765715 in be_io_finished (bio=0x7f42880008c0) at be/io.c:555
  #10 0x00007f432b7658e1 in be_io_cb (link=0x7f4288001040) at be/io.c:587
  #11 0x00007f432b8a53a0 in clink_signal (clink=0x7f4288001040) at lib/chan.c:135
  #12 0x00007f432b8a54ff in chan_signal_nr (chan=0x7f4288000f38, nr=0) at lib/chan.c:154
  #13 0x00007f432b8a55d4 in m0_chan_broadcast (chan=0x7f4288000f38) at lib/chan.c:174
  #14 0x00007f432b8a55fa in m0_chan_broadcast_lock (chan=0x7f4288000f38) at lib/chan.c:181
  #15 0x00007f432ba03cfe in ioq_complete (ioq=0x1367bc0, qev=0x7f42886a3fe0, res=8, res2=0) at stob/ioq.c:587
  #16 0x00007f432ba041d1 in stob_ioq_thread (ioq=0x1367bc0) at stob/ioq.c:669
  #17 0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x1367be0) at lib/thread.c:117
  #18 0x00007f432b8c37fc in uthread_trampoline (arg=0x1367be0) at lib/user_space/uthread.c:98
  #19 0x00007f432af51e25 in start_thread (arg=0x7f42ccff9700) at pthread_create.c:308
  #20 0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7f428ffff700 (LWP 8730))]
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  253                     if (!M0_CHECK_EX(m0_tlink_invariant(d, amb(d, scan))))
  (gdb) bt
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  #1  0x00007f432b8b1f13 in m0_tlist_add (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, obj=0x7f42880008c0) at lib/tlist.c:128
  #2  0x00007f432b7664c8 in sched_io_tlist_add (list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, amb=0x7f42880008c0) at be/io_sched.c:48
  #3  0x00007f432b766d88 in be_io_sched_insert (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0) at be/io_sched.c:161
  #4  0x00007f432b767144 in m0_be_io_sched_add (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0, ext=0x7f428fffed00, op=0x7f4288000c00) at be/io_sched.c:193
  #5  0x00007f432cf8a008 in be_ut_io_sched_thread (param=0x136a2b0) at be/ut/io_sched.c:241
  #6  0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x136bad0) at lib/thread.c:117
  #7  0x00007f432b8c37fc in uthread_trampoline (arg=0x136bad0) at lib/user_space/uthread.c:98
  #8  0x00007f432af51e25 in start_thread (arg=0x7f428ffff700) at pthread_create.c:308
  #9  0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
max-seagate added a commit that referenced this issue Oct 4, 2021
m0_be_log_discard_flush() may finish waiting after io is done, but
before lds_discard_ast() is posted. It would lead to
be_log_discard_wait() to not to wait for the ast to be finished, which
leads to an assertion failure [1]. In this case traces look like [2] and
stack backtrace looks like [3].

This patch moves flush op to DONE at the end of the ast, which makes
flush() wait until the ast is actually executed.

The bug was discovered during EOS-25005 patch testing.

[1]

motr[24531]:  a3b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-72-g0e307f7b-dirty] /w/var/m0ut/m0trace.24531
Motr panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() be/log_discard.c:143 (errno: 4) (last failed: none) [git: 2.0.0-307-72-g0e307f7b-dirty] pid: 24531  /w/var/m0ut/m0trace.24531

[2]

92998540  157294.289.381909  7f86d97f98e0  DEBUG   be/log_discard.c:288:be_log_discard_sync_done_cb    ld=0x7f86f709d8f8
92998541  157294.289.382245  7f86d97f98b0  CALL    be/op.c:154:m0_be_op_reset                          > op=0x7f86f709da40
...
92998549  157294.289.431085  7f86d97f9c40  DEBUG   sm/sm.c:158:m0_sm_asts_run                          grp=0x22d96b0 forkq_pop: 0x7f86f709de10 <- 0x7f86f6049da0
92998550  157294.289.432156  7f86d97f9c00  CALL    be/log_discard.c:256:be_log_discard_ast             > ld=0x7f86f709d8f8
92998551  157294.289.433624  7ffd2e9e10c0  FATAL   lib/assert.c:50:m0_panic                            panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.22409
92998552  157294.289.435003  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bf40
92998553  157294.289.436948  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bf40 ext=[0, 7000)
92998554  157294.289.439155  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=0 lg_free=134082560) size=28672
...
92998588  157294.289.570768  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bb80
92998589  157294.289.571102  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bb80 ext=[1b000, 1d000)
92998590  157294.289.571601  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=110592 lg_free=134193152) size=8192
92998591  157294.289.572156  7f86d97f9ad0  DEBUG   be/log.c:545:m0_be_log_record_discard               log=(lg_current=118784 lg_discarded=118784 lg_free=134201344) size=8192
92998592  157294.289.572524  7f86d97f9a90  CALL    be/engine.c:865:m0_be_engine_got_log_space_cb       > en=0x7f86f709a6c8 log=0x7f86f709a838
92998593  157294.289.573643  7f86d97f9b70  DEBUG   be/log_discard.c:451:m0_be_log_discard_item_put     ld=0x7f86f709d8f8 ldi=0x7f862039bb80

[3]

  Thread 1 (Thread 0x7f8701248840 (LWP 22409)):
  #0  0x00007f86f310f277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #1  0x00007f86f3110968 in __GI_abort () at abort.c:90
  #2  0x00007f86f50ffe1e in m0_arch_panic (c=0x7f86f55d4840 <__pctx.7291>, ap=0x7ffd2e9e1168) at lib/user_space/uassert.c:131
  #3  0x00007f86f50e0a78 in m0_panic (ctx=0x7f86f55d4840 <__pctx.7291>) at lib/assert.c:52
  #4  0x00007f86f4fab1a2 in m0_be_log_discard_fini (ld=0x7f86f709d8f8 <be+17944>) at be/log_discard.c:143
  #5  0x00007f86f4f96403 in be_domain_level_leave (module=0x7f86f70992e0 <be>) at be/domain.c:906
  #6  0x00007f86f5171e09 in m0_module_fini (module=0x7f86f70992e0 <be>, level=-1) at module/module.c:154
  #7  0x00007f86f67c61f3 in m0_be_ut_backend_fini (ut_be=0x7f86f70992e0 <be>) at be/ut/helper.c:448
  #8  0x00007f86f6816645 in iter_ut_fini (svc=0x7f857252b168) at dix/cm/ut/iter_ut.c:708
  #9  0x00007f86f6816a39 in empty_cctg () at dix/cm/ut/iter_ut.c:786
  #10 0x00007f86f69a3736 in run_test (test=0x64a708 <dix_cm_iter_ut+4552>, max_name_len=43) at ut/ut.c:390
  #11 0x00007f86f69a3a28 in run_suite (suite=0x649540 <dix_cm_iter_ut>, max_name_len=43) at ut/ut.c:459
  #12 0x00007f86f69a3ccc in tests_run_all (m=0x7f870102f4a0 <ut.12684>) at ut/ut.c:513
  #13 0x00007f86f69a3d2a in m0_ut_run () at ut/ut.c:539
  #14 0x0000000000404a42 in main (argc=3, argv=0x7ffd2e9e1d58) at ut/m0ut.c:525

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
mehjoshi pushed a commit that referenced this issue Oct 5, 2021
It was possible for the op, passed to m0_be_io_sched_add(), to become
DONE before sched lock is taken in be_io_sched_cb() because parent op
could become DONE before the child op M0_BTS_GC callback is called.
This situation led to failed invariant, assertion failures [1] or just
spinning in infinite loop in m0_list invariant due to a broken list [2].

The patch fixes the issue by moving the op (passed to
m0_be_io_sched_add()) to DONE only after finishing processing of the
corresponding m0_be_io.

The bug was discovered during EOS-25005 patch testing.

[1]

be-ut
  io_sched  motr[15961]:  e4b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() (be/io_sched.c:132)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.15961
Motr panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() be/io_sched.c:132 (errno: 0) (last failed: none) [git: 2.0.0-307-37-gd35074f6-dirty] pid: 15961  /w/var/m0ut/m0trace.15961
...

[2]

  (gdb) bt
  #0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  #1  0x00007f432af53e1b in _L_lock_812 () from /lib64/libpthread.so.0
  #2  0x00007f432af53ce8 in __GI___pthread_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at ../nptl/pthread_mutex_lock.c:79
  #3  0x00007f432b8c3520 in m0_arch_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/user_space/umutex.c:56
  #4  0x00007f432b8aee6f in m0_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/mutex.c:55
  #5  0x00007f432b76673b in m0_be_io_sched_lock (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>) at be/io_sched.c:72
  #6  0x00007f432b766a75 in be_io_sched_cb (op=0x7f4288000998, param=0x7f42880008c0) at be/io_sched.c:123
  #7  0x00007f432b771bb3 in be_op_done (op=0x7f4288000998) at be/op.c:257
  #8  0x00007f432b771cb6 in m0_be_op_done (op=0x7f4288000998) at be/op.c:276
  #9  0x00007f432b765715 in be_io_finished (bio=0x7f42880008c0) at be/io.c:555
  #10 0x00007f432b7658e1 in be_io_cb (link=0x7f4288001040) at be/io.c:587
  #11 0x00007f432b8a53a0 in clink_signal (clink=0x7f4288001040) at lib/chan.c:135
  #12 0x00007f432b8a54ff in chan_signal_nr (chan=0x7f4288000f38, nr=0) at lib/chan.c:154
  #13 0x00007f432b8a55d4 in m0_chan_broadcast (chan=0x7f4288000f38) at lib/chan.c:174
  #14 0x00007f432b8a55fa in m0_chan_broadcast_lock (chan=0x7f4288000f38) at lib/chan.c:181
  #15 0x00007f432ba03cfe in ioq_complete (ioq=0x1367bc0, qev=0x7f42886a3fe0, res=8, res2=0) at stob/ioq.c:587
  #16 0x00007f432ba041d1 in stob_ioq_thread (ioq=0x1367bc0) at stob/ioq.c:669
  #17 0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x1367be0) at lib/thread.c:117
  #18 0x00007f432b8c37fc in uthread_trampoline (arg=0x1367be0) at lib/user_space/uthread.c:98
  #19 0x00007f432af51e25 in start_thread (arg=0x7f42ccff9700) at pthread_create.c:308
  #20 0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7f428ffff700 (LWP 8730))]
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  253                     if (!M0_CHECK_EX(m0_tlink_invariant(d, amb(d, scan))))
  (gdb) bt
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  #1  0x00007f432b8b1f13 in m0_tlist_add (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, obj=0x7f42880008c0) at lib/tlist.c:128
  #2  0x00007f432b7664c8 in sched_io_tlist_add (list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, amb=0x7f42880008c0) at be/io_sched.c:48
  #3  0x00007f432b766d88 in be_io_sched_insert (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0) at be/io_sched.c:161
  #4  0x00007f432b767144 in m0_be_io_sched_add (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0, ext=0x7f428fffed00, op=0x7f4288000c00) at be/io_sched.c:193
  #5  0x00007f432cf8a008 in be_ut_io_sched_thread (param=0x136a2b0) at be/ut/io_sched.c:241
  #6  0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x136bad0) at lib/thread.c:117
  #7  0x00007f432b8c37fc in uthread_trampoline (arg=0x136bad0) at lib/user_space/uthread.c:98
  #8  0x00007f432af51e25 in start_thread (arg=0x7f428ffff700) at pthread_create.c:308
  #9  0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
mehjoshi pushed a commit that referenced this issue Oct 5, 2021
m0_be_log_discard_flush() may finish waiting after io is done, but
before lds_discard_ast() is posted. It would lead to
be_log_discard_wait() to not to wait for the ast to be finished, which
leads to an assertion failure [1]. In this case traces look like [2] and
stack backtrace looks like [3].

This patch moves flush op to DONE at the end of the ast, which makes
flush() wait until the ast is actually executed.

The bug was discovered during EOS-25005 patch testing.

[1]

motr[24531]:  a3b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-72-g0e307f7b-dirty] /w/var/m0ut/m0trace.24531
Motr panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() be/log_discard.c:143 (errno: 4) (last failed: none) [git: 2.0.0-307-72-g0e307f7b-dirty] pid: 24531  /w/var/m0ut/m0trace.24531

[2]

92998540  157294.289.381909  7f86d97f98e0  DEBUG   be/log_discard.c:288:be_log_discard_sync_done_cb    ld=0x7f86f709d8f8
92998541  157294.289.382245  7f86d97f98b0  CALL    be/op.c:154:m0_be_op_reset                          > op=0x7f86f709da40
...
92998549  157294.289.431085  7f86d97f9c40  DEBUG   sm/sm.c:158:m0_sm_asts_run                          grp=0x22d96b0 forkq_pop: 0x7f86f709de10 <- 0x7f86f6049da0
92998550  157294.289.432156  7f86d97f9c00  CALL    be/log_discard.c:256:be_log_discard_ast             > ld=0x7f86f709d8f8
92998551  157294.289.433624  7ffd2e9e10c0  FATAL   lib/assert.c:50:m0_panic                            panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.22409
92998552  157294.289.435003  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bf40
92998553  157294.289.436948  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bf40 ext=[0, 7000)
92998554  157294.289.439155  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=0 lg_free=134082560) size=28672
...
92998588  157294.289.570768  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bb80
92998589  157294.289.571102  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bb80 ext=[1b000, 1d000)
92998590  157294.289.571601  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=110592 lg_free=134193152) size=8192
92998591  157294.289.572156  7f86d97f9ad0  DEBUG   be/log.c:545:m0_be_log_record_discard               log=(lg_current=118784 lg_discarded=118784 lg_free=134201344) size=8192
92998592  157294.289.572524  7f86d97f9a90  CALL    be/engine.c:865:m0_be_engine_got_log_space_cb       > en=0x7f86f709a6c8 log=0x7f86f709a838
92998593  157294.289.573643  7f86d97f9b70  DEBUG   be/log_discard.c:451:m0_be_log_discard_item_put     ld=0x7f86f709d8f8 ldi=0x7f862039bb80

[3]

  Thread 1 (Thread 0x7f8701248840 (LWP 22409)):
  #0  0x00007f86f310f277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #1  0x00007f86f3110968 in __GI_abort () at abort.c:90
  #2  0x00007f86f50ffe1e in m0_arch_panic (c=0x7f86f55d4840 <__pctx.7291>, ap=0x7ffd2e9e1168) at lib/user_space/uassert.c:131
  #3  0x00007f86f50e0a78 in m0_panic (ctx=0x7f86f55d4840 <__pctx.7291>) at lib/assert.c:52
  #4  0x00007f86f4fab1a2 in m0_be_log_discard_fini (ld=0x7f86f709d8f8 <be+17944>) at be/log_discard.c:143
  #5  0x00007f86f4f96403 in be_domain_level_leave (module=0x7f86f70992e0 <be>) at be/domain.c:906
  #6  0x00007f86f5171e09 in m0_module_fini (module=0x7f86f70992e0 <be>, level=-1) at module/module.c:154
  #7  0x00007f86f67c61f3 in m0_be_ut_backend_fini (ut_be=0x7f86f70992e0 <be>) at be/ut/helper.c:448
  #8  0x00007f86f6816645 in iter_ut_fini (svc=0x7f857252b168) at dix/cm/ut/iter_ut.c:708
  #9  0x00007f86f6816a39 in empty_cctg () at dix/cm/ut/iter_ut.c:786
  #10 0x00007f86f69a3736 in run_test (test=0x64a708 <dix_cm_iter_ut+4552>, max_name_len=43) at ut/ut.c:390
  #11 0x00007f86f69a3a28 in run_suite (suite=0x649540 <dix_cm_iter_ut>, max_name_len=43) at ut/ut.c:459
  #12 0x00007f86f69a3ccc in tests_run_all (m=0x7f870102f4a0 <ut.12684>) at ut/ut.c:513
  #13 0x00007f86f69a3d2a in m0_ut_run () at ut/ut.c:539
  #14 0x0000000000404a42 in main (argc=3, argv=0x7ffd2e9e1d58) at ut/m0ut.c:525

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
andriytk pushed a commit to andriytk/cortx-motr that referenced this issue Oct 6, 2021
It was possible for the op, passed to m0_be_io_sched_add(), to become
DONE before sched lock is taken in be_io_sched_cb() because parent op
could become DONE before the child op M0_BTS_GC callback is called.
This situation led to failed invariant, assertion failures [1] or just
spinning in infinite loop in m0_list invariant due to a broken list [2].

The patch fixes the issue by moving the op (passed to
m0_be_io_sched_add()) to DONE only after finishing processing of the
corresponding m0_be_io.

The bug was discovered during EOS-25005 patch testing.

[1]

be-ut
  io_sched  motr[15961]:  e4b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() (be/io_sched.c:132)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.15961
Motr panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() be/io_sched.c:132 (errno: 0) (last failed: none) [git: 2.0.0-307-37-gd35074f6-dirty] pid: 15961  /w/var/m0ut/m0trace.15961
...

[2]

  (gdb) bt
  #0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  Seagate#1  0x00007f432af53e1b in _L_lock_812 () from /lib64/libpthread.so.0
  Seagate#2  0x00007f432af53ce8 in __GI___pthread_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at ../nptl/pthread_mutex_lock.c:79
  Seagate#3  0x00007f432b8c3520 in m0_arch_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/user_space/umutex.c:56
  Seagate#4  0x00007f432b8aee6f in m0_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/mutex.c:55
  Seagate#5  0x00007f432b76673b in m0_be_io_sched_lock (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>) at be/io_sched.c:72
  Seagate#6  0x00007f432b766a75 in be_io_sched_cb (op=0x7f4288000998, param=0x7f42880008c0) at be/io_sched.c:123
  Seagate#7  0x00007f432b771bb3 in be_op_done (op=0x7f4288000998) at be/op.c:257
  Seagate#8  0x00007f432b771cb6 in m0_be_op_done (op=0x7f4288000998) at be/op.c:276
  Seagate#9  0x00007f432b765715 in be_io_finished (bio=0x7f42880008c0) at be/io.c:555
  Seagate#10 0x00007f432b7658e1 in be_io_cb (link=0x7f4288001040) at be/io.c:587
  Seagate#11 0x00007f432b8a53a0 in clink_signal (clink=0x7f4288001040) at lib/chan.c:135
  Seagate#12 0x00007f432b8a54ff in chan_signal_nr (chan=0x7f4288000f38, nr=0) at lib/chan.c:154
  Seagate#13 0x00007f432b8a55d4 in m0_chan_broadcast (chan=0x7f4288000f38) at lib/chan.c:174
  Seagate#14 0x00007f432b8a55fa in m0_chan_broadcast_lock (chan=0x7f4288000f38) at lib/chan.c:181
  Seagate#15 0x00007f432ba03cfe in ioq_complete (ioq=0x1367bc0, qev=0x7f42886a3fe0, res=8, res2=0) at stob/ioq.c:587
  Seagate#16 0x00007f432ba041d1 in stob_ioq_thread (ioq=0x1367bc0) at stob/ioq.c:669
  Seagate#17 0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x1367be0) at lib/thread.c:117
  Seagate#18 0x00007f432b8c37fc in uthread_trampoline (arg=0x1367be0) at lib/user_space/uthread.c:98
  Seagate#19 0x00007f432af51e25 in start_thread (arg=0x7f42ccff9700) at pthread_create.c:308
  Seagate#20 0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7f428ffff700 (LWP 8730))]
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  253                     if (!M0_CHECK_EX(m0_tlink_invariant(d, amb(d, scan))))
  (gdb) bt
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  Seagate#1  0x00007f432b8b1f13 in m0_tlist_add (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, obj=0x7f42880008c0) at lib/tlist.c:128
  Seagate#2  0x00007f432b7664c8 in sched_io_tlist_add (list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, amb=0x7f42880008c0) at be/io_sched.c:48
  Seagate#3  0x00007f432b766d88 in be_io_sched_insert (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0) at be/io_sched.c:161
  Seagate#4  0x00007f432b767144 in m0_be_io_sched_add (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0, ext=0x7f428fffed00, op=0x7f4288000c00) at be/io_sched.c:193
  Seagate#5  0x00007f432cf8a008 in be_ut_io_sched_thread (param=0x136a2b0) at be/ut/io_sched.c:241
  Seagate#6  0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x136bad0) at lib/thread.c:117
  Seagate#7  0x00007f432b8c37fc in uthread_trampoline (arg=0x136bad0) at lib/user_space/uthread.c:98
  Seagate#8  0x00007f432af51e25 in start_thread (arg=0x7f428ffff700) at pthread_create.c:308
  Seagate#9  0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
andriytk pushed a commit to andriytk/cortx-motr that referenced this issue Oct 6, 2021
…gate#1092)

m0_be_log_discard_flush() may finish waiting after io is done, but
before lds_discard_ast() is posted. It would lead to
be_log_discard_wait() to not to wait for the ast to be finished, which
leads to an assertion failure [1]. In this case traces look like [2] and
stack backtrace looks like [3].

This patch moves flush op to DONE at the end of the ast, which makes
flush() wait until the ast is actually executed.

The bug was discovered during EOS-25005 patch testing.

[1]

motr[24531]:  a3b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-72-g0e307f7b-dirty] /w/var/m0ut/m0trace.24531
Motr panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() be/log_discard.c:143 (errno: 4) (last failed: none) [git: 2.0.0-307-72-g0e307f7b-dirty] pid: 24531  /w/var/m0ut/m0trace.24531

[2]

92998540  157294.289.381909  7f86d97f98e0  DEBUG   be/log_discard.c:288:be_log_discard_sync_done_cb    ld=0x7f86f709d8f8
92998541  157294.289.382245  7f86d97f98b0  CALL    be/op.c:154:m0_be_op_reset                          > op=0x7f86f709da40
...
92998549  157294.289.431085  7f86d97f9c40  DEBUG   sm/sm.c:158:m0_sm_asts_run                          grp=0x22d96b0 forkq_pop: 0x7f86f709de10 <- 0x7f86f6049da0
92998550  157294.289.432156  7f86d97f9c00  CALL    be/log_discard.c:256:be_log_discard_ast             > ld=0x7f86f709d8f8
92998551  157294.289.433624  7ffd2e9e10c0  FATAL   lib/assert.c:50:m0_panic                            panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.22409
92998552  157294.289.435003  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bf40
92998553  157294.289.436948  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bf40 ext=[0, 7000)
92998554  157294.289.439155  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=0 lg_free=134082560) size=28672
...
92998588  157294.289.570768  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bb80
92998589  157294.289.571102  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bb80 ext=[1b000, 1d000)
92998590  157294.289.571601  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=110592 lg_free=134193152) size=8192
92998591  157294.289.572156  7f86d97f9ad0  DEBUG   be/log.c:545:m0_be_log_record_discard               log=(lg_current=118784 lg_discarded=118784 lg_free=134201344) size=8192
92998592  157294.289.572524  7f86d97f9a90  CALL    be/engine.c:865:m0_be_engine_got_log_space_cb       > en=0x7f86f709a6c8 log=0x7f86f709a838
92998593  157294.289.573643  7f86d97f9b70  DEBUG   be/log_discard.c:451:m0_be_log_discard_item_put     ld=0x7f86f709d8f8 ldi=0x7f862039bb80

[3]

  Thread 1 (Thread 0x7f8701248840 (LWP 22409)):
  #0  0x00007f86f310f277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  Seagate#1  0x00007f86f3110968 in __GI_abort () at abort.c:90
  Seagate#2  0x00007f86f50ffe1e in m0_arch_panic (c=0x7f86f55d4840 <__pctx.7291>, ap=0x7ffd2e9e1168) at lib/user_space/uassert.c:131
  Seagate#3  0x00007f86f50e0a78 in m0_panic (ctx=0x7f86f55d4840 <__pctx.7291>) at lib/assert.c:52
  Seagate#4  0x00007f86f4fab1a2 in m0_be_log_discard_fini (ld=0x7f86f709d8f8 <be+17944>) at be/log_discard.c:143
  Seagate#5  0x00007f86f4f96403 in be_domain_level_leave (module=0x7f86f70992e0 <be>) at be/domain.c:906
  Seagate#6  0x00007f86f5171e09 in m0_module_fini (module=0x7f86f70992e0 <be>, level=-1) at module/module.c:154
  Seagate#7  0x00007f86f67c61f3 in m0_be_ut_backend_fini (ut_be=0x7f86f70992e0 <be>) at be/ut/helper.c:448
  Seagate#8  0x00007f86f6816645 in iter_ut_fini (svc=0x7f857252b168) at dix/cm/ut/iter_ut.c:708
  Seagate#9  0x00007f86f6816a39 in empty_cctg () at dix/cm/ut/iter_ut.c:786
  Seagate#10 0x00007f86f69a3736 in run_test (test=0x64a708 <dix_cm_iter_ut+4552>, max_name_len=43) at ut/ut.c:390
  Seagate#11 0x00007f86f69a3a28 in run_suite (suite=0x649540 <dix_cm_iter_ut>, max_name_len=43) at ut/ut.c:459
  Seagate#12 0x00007f86f69a3ccc in tests_run_all (m=0x7f870102f4a0 <ut.12684>) at ut/ut.c:513
  Seagate#13 0x00007f86f69a3d2a in m0_ut_run () at ut/ut.c:539
  Seagate#14 0x0000000000404a42 in main (argc=3, argv=0x7ffd2e9e1d58) at ut/m0ut.c:525

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
mehjoshi pushed a commit that referenced this issue Oct 8, 2021
It was possible for the op, passed to m0_be_io_sched_add(), to become
DONE before sched lock is taken in be_io_sched_cb() because parent op
could become DONE before the child op M0_BTS_GC callback is called.
This situation led to failed invariant, assertion failures [1] or just
spinning in infinite loop in m0_list invariant due to a broken list [2].

The patch fixes the issue by moving the op (passed to
m0_be_io_sched_add()) to DONE only after finishing processing of the
corresponding m0_be_io.

The bug was discovered during EOS-25005 patch testing.

[1]

be-ut
  io_sched  motr[15961]:  e4b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() (be/io_sched.c:132)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.15961
Motr panic: (io->bio_ext.e_start == sched->bis_pos) at be_io_sched_cb() be/io_sched.c:132 (errno: 0) (last failed: none) [git: 2.0.0-307-37-gd35074f6-dirty] pid: 15961  /w/var/m0ut/m0trace.15961
...

[2]

  (gdb) bt
  #0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  #1  0x00007f432af53e1b in _L_lock_812 () from /lib64/libpthread.so.0
  #2  0x00007f432af53ce8 in __GI___pthread_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at ../nptl/pthread_mutex_lock.c:79
  #3  0x00007f432b8c3520 in m0_arch_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/user_space/umutex.c:56
  #4  0x00007f432b8aee6f in m0_mutex_lock (mutex=0x7f432d714a80 <be_ut_io_sched_scheduler+32>) at lib/mutex.c:55
  #5  0x00007f432b76673b in m0_be_io_sched_lock (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>) at be/io_sched.c:72
  #6  0x00007f432b766a75 in be_io_sched_cb (op=0x7f4288000998, param=0x7f42880008c0) at be/io_sched.c:123
  #7  0x00007f432b771bb3 in be_op_done (op=0x7f4288000998) at be/op.c:257
  #8  0x00007f432b771cb6 in m0_be_op_done (op=0x7f4288000998) at be/op.c:276
  #9  0x00007f432b765715 in be_io_finished (bio=0x7f42880008c0) at be/io.c:555
  #10 0x00007f432b7658e1 in be_io_cb (link=0x7f4288001040) at be/io.c:587
  #11 0x00007f432b8a53a0 in clink_signal (clink=0x7f4288001040) at lib/chan.c:135
  #12 0x00007f432b8a54ff in chan_signal_nr (chan=0x7f4288000f38, nr=0) at lib/chan.c:154
  #13 0x00007f432b8a55d4 in m0_chan_broadcast (chan=0x7f4288000f38) at lib/chan.c:174
  #14 0x00007f432b8a55fa in m0_chan_broadcast_lock (chan=0x7f4288000f38) at lib/chan.c:181
  #15 0x00007f432ba03cfe in ioq_complete (ioq=0x1367bc0, qev=0x7f42886a3fe0, res=8, res2=0) at stob/ioq.c:587
  #16 0x00007f432ba041d1 in stob_ioq_thread (ioq=0x1367bc0) at stob/ioq.c:669
  #17 0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x1367be0) at lib/thread.c:117
  #18 0x00007f432b8c37fc in uthread_trampoline (arg=0x1367be0) at lib/user_space/uthread.c:98
  #19 0x00007f432af51e25 in start_thread (arg=0x7f42ccff9700) at pthread_create.c:308
  #20 0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7f428ffff700 (LWP 8730))]
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  253                     if (!M0_CHECK_EX(m0_tlink_invariant(d, amb(d, scan))))
  (gdb) bt
  #0  0x00007f432b8b270b in m0_tlist_invariant (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>) at lib/tlist.c:253
  #1  0x00007f432b8b1f13 in m0_tlist_add (d=0x7f432bd94600 <sched_io_tl>, list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, obj=0x7f42880008c0) at lib/tlist.c:128
  #2  0x00007f432b7664c8 in sched_io_tlist_add (list=0x7f432d714a68 <be_ut_io_sched_scheduler+8>, amb=0x7f42880008c0) at be/io_sched.c:48
  #3  0x00007f432b766d88 in be_io_sched_insert (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0) at be/io_sched.c:161
  #4  0x00007f432b767144 in m0_be_io_sched_add (sched=0x7f432d714a60 <be_ut_io_sched_scheduler>, io=0x7f42880008c0, ext=0x7f428fffed00, op=0x7f4288000c00) at be/io_sched.c:193
  #5  0x00007f432cf8a008 in be_ut_io_sched_thread (param=0x136a2b0) at be/ut/io_sched.c:241
  #6  0x00007f432b8b01b8 in m0_thread_trampoline (arg=0x136bad0) at lib/thread.c:117
  #7  0x00007f432b8c37fc in uthread_trampoline (arg=0x136bad0) at lib/user_space/uthread.c:98
  #8  0x00007f432af51e25 in start_thread (arg=0x7f428ffff700) at pthread_create.c:308
  #9  0x00007f4329999bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
mehjoshi pushed a commit that referenced this issue Oct 8, 2021
m0_be_log_discard_flush() may finish waiting after io is done, but
before lds_discard_ast() is posted. It would lead to
be_log_discard_wait() to not to wait for the ast to be finished, which
leads to an assertion failure [1]. In this case traces look like [2] and
stack backtrace looks like [3].

This patch moves flush op to DONE at the end of the ast, which makes
flush() wait until the ast is actually executed.

The bug was discovered during EOS-25005 patch testing.

[1]

motr[24531]:  a3b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-72-g0e307f7b-dirty] /w/var/m0ut/m0trace.24531
Motr panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() be/log_discard.c:143 (errno: 4) (last failed: none) [git: 2.0.0-307-72-g0e307f7b-dirty] pid: 24531  /w/var/m0ut/m0trace.24531

[2]

92998540  157294.289.381909  7f86d97f98e0  DEBUG   be/log_discard.c:288:be_log_discard_sync_done_cb    ld=0x7f86f709d8f8
92998541  157294.289.382245  7f86d97f98b0  CALL    be/op.c:154:m0_be_op_reset                          > op=0x7f86f709da40
...
92998549  157294.289.431085  7f86d97f9c40  DEBUG   sm/sm.c:158:m0_sm_asts_run                          grp=0x22d96b0 forkq_pop: 0x7f86f709de10 <- 0x7f86f6049da0
92998550  157294.289.432156  7f86d97f9c00  CALL    be/log_discard.c:256:be_log_discard_ast             > ld=0x7f86f709d8f8
92998551  157294.289.433624  7ffd2e9e10c0  FATAL   lib/assert.c:50:m0_panic                            panic: (!ld->lds_sync_in_progress) at m0_be_log_discard_fini() (be/log_discard.c:143)  [git: 2.0.0-307-37-gd35074f6-dirty] /w/var/m0ut/m0trace.22409
92998552  157294.289.435003  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bf40
92998553  157294.289.436948  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bf40 ext=[0, 7000)
92998554  157294.289.439155  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=0 lg_free=134082560) size=28672
...
92998588  157294.289.570768  7f86d97f9ba0  CALL    be/log_discard.c:229:be_log_discard_item_discard    > ld=0x7f86f709d8f8 ldi=0x7f862039bb80
92998589  157294.289.571102  7f86d97f9b30  DEBUG   be/tx_group_format.c:549:m0_be_group_format_discard   log=0x7f86f709a838 ldi=0x7f862039bb80 ext=[1b000, 1d000)
92998590  157294.289.571601  7f86d97f9ad0  CALL    be/log.c:526:m0_be_log_record_discard               > log=(lg_current=118784 lg_discarded=110592 lg_free=134193152) size=8192
92998591  157294.289.572156  7f86d97f9ad0  DEBUG   be/log.c:545:m0_be_log_record_discard               log=(lg_current=118784 lg_discarded=118784 lg_free=134201344) size=8192
92998592  157294.289.572524  7f86d97f9a90  CALL    be/engine.c:865:m0_be_engine_got_log_space_cb       > en=0x7f86f709a6c8 log=0x7f86f709a838
92998593  157294.289.573643  7f86d97f9b70  DEBUG   be/log_discard.c:451:m0_be_log_discard_item_put     ld=0x7f86f709d8f8 ldi=0x7f862039bb80

[3]

  Thread 1 (Thread 0x7f8701248840 (LWP 22409)):
  #0  0x00007f86f310f277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #1  0x00007f86f3110968 in __GI_abort () at abort.c:90
  #2  0x00007f86f50ffe1e in m0_arch_panic (c=0x7f86f55d4840 <__pctx.7291>, ap=0x7ffd2e9e1168) at lib/user_space/uassert.c:131
  #3  0x00007f86f50e0a78 in m0_panic (ctx=0x7f86f55d4840 <__pctx.7291>) at lib/assert.c:52
  #4  0x00007f86f4fab1a2 in m0_be_log_discard_fini (ld=0x7f86f709d8f8 <be+17944>) at be/log_discard.c:143
  #5  0x00007f86f4f96403 in be_domain_level_leave (module=0x7f86f70992e0 <be>) at be/domain.c:906
  #6  0x00007f86f5171e09 in m0_module_fini (module=0x7f86f70992e0 <be>, level=-1) at module/module.c:154
  #7  0x00007f86f67c61f3 in m0_be_ut_backend_fini (ut_be=0x7f86f70992e0 <be>) at be/ut/helper.c:448
  #8  0x00007f86f6816645 in iter_ut_fini (svc=0x7f857252b168) at dix/cm/ut/iter_ut.c:708
  #9  0x00007f86f6816a39 in empty_cctg () at dix/cm/ut/iter_ut.c:786
  #10 0x00007f86f69a3736 in run_test (test=0x64a708 <dix_cm_iter_ut+4552>, max_name_len=43) at ut/ut.c:390
  #11 0x00007f86f69a3a28 in run_suite (suite=0x649540 <dix_cm_iter_ut>, max_name_len=43) at ut/ut.c:459
  #12 0x00007f86f69a3ccc in tests_run_all (m=0x7f870102f4a0 <ut.12684>) at ut/ut.c:513
  #13 0x00007f86f69a3d2a in m0_ut_run () at ut/ut.c:539
  #14 0x0000000000404a42 in main (argc=3, argv=0x7ffd2e9e1d58) at ut/m0ut.c:525

Signed-off-by: Maksym Medvied <maksym.medvied@seagate.com>
madhavemuri pushed a commit that referenced this issue Nov 18, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
madhavemuri pushed a commit that referenced this issue Nov 29, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
madhavemuri pushed a commit that referenced this issue Dec 6, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
madhavemuri pushed a commit that referenced this issue Dec 9, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
madhavemuri pushed a commit that referenced this issue Dec 10, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
madhavemuri pushed a commit that referenced this issue Dec 13, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
madhavemuri pushed a commit that referenced this issue Dec 13, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
madhavemuri pushed a commit that referenced this issue Dec 13, 2021
#1231)

To fix the following panic, increased the size of FOL_REC_MAXSIZE,
#3 0x00007f901fbdcd24 in m0_panic (ctx=ctx@entry=0x7f9020060320 <__pctx.10675>) at lib/assert.c:52
#4 0x00007f901fbaa1c4 in fol_record_pack_size (rec=rec@entry=0x7f8fc842a6f8) at fol/fol.c:254
#5 0x00007f901fbaa979 in m0_fol_rec_encode (rec=rec@entry=0x7f8fc842a6f8, at=at@entry=0x7f8fc842a578) at fol/fol.c:322
#6 0x00007f901fb2f6aa in m0_be_tx_fol_add (tx=tx@entry=0x7f8fc842a3e0, rec=rec@entry=0x7f8fc842a6f8) at be/tx.c:635
#7 0x00007f901fb917f5 in m0_dtx_fol_add (tx=tx@entry=0x7f8fc842a3d8) at dtm/dtm.c:171
#8 0x00007f901fbaf224 in m0_fom_fol_rec_add (fom=fom@entry=0x7f8fc842a310) at fop/fom.c:1739
#9 0x00007f901fbafd10 in fom_fol_rec_add (fom=0x7f8fc842a310) at fop/fom_generic.c:413
#10 0x00007f901fbb0112 in m0_fom_tick_generic (fom=fom@entry=0x7f8fc842a310) at fop/fom_generic.c:860
#11 0x00007f901fb3e00f in cas_fom_tick (fom0=0x7f8fc842a310) at cas/service.c:1218
#12 0x00007f901fbae61b in fom_exec (fom=0x7f8fc842a310) at fop/fom.c:791
#13 loc_handler_thread (th=0x1a0cbb0) at fop/fom.c:931

Signed-off-by: Yeshpal Jain <yeshpal.jain@seagate.com>
andriytk added a commit to andriytk/cortx-motr that referenced this issue Apr 27, 2022
All m0d-ios processes may crash if some of them is crashed during write
i/o. Here is the scenario, for example:

setup: 1 node, 5 ios processes configured with 2 disks each.

1) Start write:

    m0cp -l <local-endpoint> -H <hax-endpoint> -p <profile-fid> -P <process-fid> -s 32k -c 1024 -L 2 /dev/zero -o 0x12345678:0x678900202

2) After about 5 secs (to allow m0_client_init() to finish and start the i/o)
   kill -9 <PID> one of m0d-ios processes.

Panic msg: `((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)`

Stack:

    Seagate#3  0x00007f9454ccbe37 in m0_panic (ctx=ctx@entry=0x7f94550e15a0 <__pctx.8664>) at lib/assert.c:52
    Seagate#4  0x00007f9454c0747c in btree_save (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90, op=op@entry=0x7f944ca77ec0,
        val=val@entry=0x7f944ca77e70, anchor=0x0, optype=BTREE_SAVE_UPDATE, zonemask=2, key=<optimized out>, key=<optimized out>)
        at be/btree.c:339
    Seagate#5  0x00007f9454c086f7 in m0_be_btree_update (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90,
        op=op@entry=0x7f944ca77ec0, key=key@entry=0x7f944ca77e60, val=val@entry=0x7f944ca77e70) at be/btree.c:1952
    Seagate#6  0x00007f9454bfd62b in btree_update_sync (val=0x7f944ca77e70, key=0x7f944ca77e60, tx=0x7f94200a7f90, tree=0x40000010ed80)
        at balloc/balloc.c:95
    Seagate#7  balloc_gi_sync (cb=cb@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90, gi=gi@entry=0x13ff860) at balloc/balloc.c:928
    Seagate#8  0x00007f9454bfe36e in balloc_free_db_update (motr=motr@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90,
        grp=grp@entry=0x13ff860, tgt=tgt@entry=0x7f944ca78470, alloc_flag=<optimized out>) at balloc/balloc.c:1934
    Seagate#9  0x00007f9454bff9c6 in balloc_free_internal (req=<synthetic pointer>, req=<synthetic pointer>, tx=0x7f94200a7f90,
        ctx=0x40000010eb40) at balloc/balloc.c:2716
    Seagate#10 balloc_free (ballroom=0x40000010ec68, tx=0x7f94200a7f88, ext=0x7f944ca78560) at balloc/balloc.c:2929
    Seagate#11 0x00007f9454d97681 in stob_ad_bfree (adom=<optimized out>, adom=<optimized out>, ext=0x7f944ca78530, ext=0x7f944ca78530,
        tx=0x7f94200a7f88) at stob/ad.c:1098
    Seagate#12 stob_ad_seg_free (tx=0x7f94200a7f88, adom=<optimized out>, ext=ext@entry=0x7f944ca79160, val=1594, seg=<optimized out>)
        at stob/ad.c:1647
    Seagate#13 0x00007f9454d9783d in __lambda (seg=0x7f944ca79150) at stob/ad.c:1719
    Seagate#14 0x00007f9454c10802 in m0_be_emap_paste (it=it@entry=0x7f944ca79140, tx=0x7f94200a7f90, ext=ext@entry=0x7f944ca78a90,
        val=1794, del=del@entry=0x7f944ca78b1c, cut_left=cut_left@entry=0x7f944ca78b38, cut_right=0x7f944ca78b54)
        at be/extmap.c:628
    Seagate#15 0x00007f9454d9a546 in stob_ad_write_map_ext (orig=<optimized out>, off=464, adom=0x4000001120d8) at stob/ad.c:1731
    Seagate#16 stob_ad_write_map (map=0x7f944ca78900, frags=18, wc=0x7f944ca78920, dst=0x7f944ca789b0, adom=0x4000001120d8,
        io=0x7f94340b4298) at stob/ad.c:1858
    Seagate#17 stob_ad_write_prepare (map=0x7f944ca78900, src=0x7f944ca78970, adom=0x4000001120d8, io=<optimized out>) at stob/ad.c:2006
    Seagate#18 stob_ad_io_launch_prepare (io=<optimized out>) at stob/ad.c:2052
    Seagate#19 0x00007f9454d9ca47 in m0_stob_io_prepare (io=io@entry=0x7f94340b4298, obj=obj@entry=0x7f94341170a0,
        tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:178
    Seagate#20 0x00007f9454d9ce92 in m0_stob_io_prepare_and_launch (io=io@entry=0x7f94340b4298, obj=0x7f94341170a0,
        tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:226
    Seagate#21 0x00007f9454cb702c in io_launch (fom=0x7f94200a7ec0) at ioservice/io_foms.c:1837
    Seagate#22 0x00007f9454cb47a0 in m0_io_fom_cob_rw_tick (fom=0x7f94200a7ec0) at ioservice/io_foms.c:2333
    Seagate#23 0x00007f9454c9edf1 in fom_exec (fom=0x7f94200a7ec0) at fop/fom.c:791
    Seagate#24 loc_handler_thread (th=0x11ed150) at fop/fom.c:931

RCA: the regression of BE credit calculation in
stob/ad.c:stob_ad_write_credit() code was introduced at
commit ab22d23.

Solution: rollback the change in stob/ad.c:stob_ad_write_credit()
introduced at commit ab22d23.

Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
andriytk added a commit to andriytk/cortx-motr that referenced this issue Apr 27, 2022
Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack:

    Seagate#3  0x00007f9454ccbe37 in m0_panic (ctx=ctx@entry=0x7f94550e15a0 <__pctx.8664>) at lib/assert.c:52
    Seagate#4  0x00007f9454c0747c in btree_save (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90, op=op@entry=0x7f944ca77ec0,
        val=val@entry=0x7f944ca77e70, anchor=0x0, optype=BTREE_SAVE_UPDATE, zonemask=2, key=<optimized out>, key=<optimized out>)
        at be/btree.c:339
    Seagate#5  0x00007f9454c086f7 in m0_be_btree_update (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90,
        op=op@entry=0x7f944ca77ec0, key=key@entry=0x7f944ca77e60, val=val@entry=0x7f944ca77e70) at be/btree.c:1952
    Seagate#6  0x00007f9454bfd62b in btree_update_sync (val=0x7f944ca77e70, key=0x7f944ca77e60, tx=0x7f94200a7f90, tree=0x40000010ed80)
        at balloc/balloc.c:95
    Seagate#7  balloc_gi_sync (cb=cb@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90, gi=gi@entry=0x13ff860) at balloc/balloc.c:928
    Seagate#8  0x00007f9454bfe36e in balloc_free_db_update (motr=motr@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90,
        grp=grp@entry=0x13ff860, tgt=tgt@entry=0x7f944ca78470, alloc_flag=<optimized out>) at balloc/balloc.c:1934
    Seagate#9  0x00007f9454bff9c6 in balloc_free_internal (req=<synthetic pointer>, req=<synthetic pointer>, tx=0x7f94200a7f90,
        ctx=0x40000010eb40) at balloc/balloc.c:2716
    Seagate#10 balloc_free (ballroom=0x40000010ec68, tx=0x7f94200a7f88, ext=0x7f944ca78560) at balloc/balloc.c:2929
    Seagate#11 0x00007f9454d97681 in stob_ad_bfree (adom=<optimized out>, adom=<optimized out>, ext=0x7f944ca78530, ext=0x7f944ca78530,
        tx=0x7f94200a7f88) at stob/ad.c:1098
    Seagate#12 stob_ad_seg_free (tx=0x7f94200a7f88, adom=<optimized out>, ext=ext@entry=0x7f944ca79160, val=1594, seg=<optimized out>)
        at stob/ad.c:1647
    Seagate#13 0x00007f9454d9783d in __lambda (seg=0x7f944ca79150) at stob/ad.c:1719
    Seagate#14 0x00007f9454c10802 in m0_be_emap_paste (it=it@entry=0x7f944ca79140, tx=0x7f94200a7f90, ext=ext@entry=0x7f944ca78a90,
        val=1794, del=del@entry=0x7f944ca78b1c, cut_left=cut_left@entry=0x7f944ca78b38, cut_right=0x7f944ca78b54)
        at be/extmap.c:628
    Seagate#15 0x00007f9454d9a546 in stob_ad_write_map_ext (orig=<optimized out>, off=464, adom=0x4000001120d8) at stob/ad.c:1731
    Seagate#16 stob_ad_write_map (map=0x7f944ca78900, frags=18, wc=0x7f944ca78920, dst=0x7f944ca789b0, adom=0x4000001120d8,
        io=0x7f94340b4298) at stob/ad.c:1858
    Seagate#17 stob_ad_write_prepare (map=0x7f944ca78900, src=0x7f944ca78970, adom=0x4000001120d8, io=<optimized out>) at stob/ad.c:2006
    Seagate#18 stob_ad_io_launch_prepare (io=<optimized out>) at stob/ad.c:2052
    Seagate#19 0x00007f9454d9ca47 in m0_stob_io_prepare (io=io@entry=0x7f94340b4298, obj=obj@entry=0x7f94341170a0,
        tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:178
    Seagate#20 0x00007f9454d9ce92 in m0_stob_io_prepare_and_launch (io=io@entry=0x7f94340b4298, obj=0x7f94341170a0,
        tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:226
    Seagate#21 0x00007f9454cb702c in io_launch (fom=0x7f94200a7ec0) at ioservice/io_foms.c:1837
    Seagate#22 0x00007f9454cb47a0 in m0_io_fom_cob_rw_tick (fom=0x7f94200a7ec0) at ioservice/io_foms.c:2333
    Seagate#23 0x00007f9454c9edf1 in fom_exec (fom=0x7f94200a7ec0) at fop/fom.c:791
    Seagate#24 loc_handler_thread (th=0x11ed150) at fop/fom.c:931

Setup: 1-node, 5 ios processes configured with 2 disks each,
       4+2+0 EC data pool.

Scenario: write the same object twice like this:

    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: regression of BE credit calculation in stob_ad_write_credit()
code was introduced at commit ab22d23.

Solution: rollback the regression change.

Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
andriytk added a commit to andriytk/cortx-motr that referenced this issue Apr 27, 2022
Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack:

    Seagate#3  0x00007f9454ccbe37 in m0_panic (ctx=ctx@entry=0x7f94550e15a0 <__pctx.8664>) at lib/assert.c:52
    Seagate#4  0x00007f9454c0747c in btree_save (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90, op=op@entry=0x7f944ca77ec0,
        val=val@entry=0x7f944ca77e70, anchor=0x0, optype=BTREE_SAVE_UPDATE, zonemask=2, key=<optimized out>, key=<optimized out>)
        at be/btree.c:339
    Seagate#5  0x00007f9454c086f7 in m0_be_btree_update (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90,
        op=op@entry=0x7f944ca77ec0, key=key@entry=0x7f944ca77e60, val=val@entry=0x7f944ca77e70) at be/btree.c:1952
    Seagate#6  0x00007f9454bfd62b in btree_update_sync (val=0x7f944ca77e70, key=0x7f944ca77e60, tx=0x7f94200a7f90, tree=0x40000010ed80)
        at balloc/balloc.c:95
    Seagate#7  balloc_gi_sync (cb=cb@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90, gi=gi@entry=0x13ff860) at balloc/balloc.c:928
    Seagate#8  0x00007f9454bfe36e in balloc_free_db_update (motr=motr@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90,
        grp=grp@entry=0x13ff860, tgt=tgt@entry=0x7f944ca78470, alloc_flag=<optimized out>) at balloc/balloc.c:1934
    Seagate#9  0x00007f9454bff9c6 in balloc_free_internal (req=<synthetic pointer>, req=<synthetic pointer>, tx=0x7f94200a7f90,
        ctx=0x40000010eb40) at balloc/balloc.c:2716
    Seagate#10 balloc_free (ballroom=0x40000010ec68, tx=0x7f94200a7f88, ext=0x7f944ca78560) at balloc/balloc.c:2929
    Seagate#11 0x00007f9454d97681 in stob_ad_bfree (adom=<optimized out>, adom=<optimized out>, ext=0x7f944ca78530, ext=0x7f944ca78530,
        tx=0x7f94200a7f88) at stob/ad.c:1098
    Seagate#12 stob_ad_seg_free (tx=0x7f94200a7f88, adom=<optimized out>, ext=ext@entry=0x7f944ca79160, val=1594, seg=<optimized out>)
        at stob/ad.c:1647
    Seagate#13 0x00007f9454d9783d in __lambda (seg=0x7f944ca79150) at stob/ad.c:1719
    Seagate#14 0x00007f9454c10802 in m0_be_emap_paste (it=it@entry=0x7f944ca79140, tx=0x7f94200a7f90, ext=ext@entry=0x7f944ca78a90,
        val=1794, del=del@entry=0x7f944ca78b1c, cut_left=cut_left@entry=0x7f944ca78b38, cut_right=0x7f944ca78b54)
        at be/extmap.c:628
    Seagate#15 0x00007f9454d9a546 in stob_ad_write_map_ext (orig=<optimized out>, off=464, adom=0x4000001120d8) at stob/ad.c:1731
    Seagate#16 stob_ad_write_map (map=0x7f944ca78900, frags=18, wc=0x7f944ca78920, dst=0x7f944ca789b0, adom=0x4000001120d8,
        io=0x7f94340b4298) at stob/ad.c:1858
    Seagate#17 stob_ad_write_prepare (map=0x7f944ca78900, src=0x7f944ca78970, adom=0x4000001120d8, io=<optimized out>) at stob/ad.c:2006
    Seagate#18 stob_ad_io_launch_prepare (io=<optimized out>) at stob/ad.c:2052
    Seagate#19 0x00007f9454d9ca47 in m0_stob_io_prepare (io=io@entry=0x7f94340b4298, obj=obj@entry=0x7f94341170a0,
        tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:178
    Seagate#20 0x00007f9454d9ce92 in m0_stob_io_prepare_and_launch (io=io@entry=0x7f94340b4298, obj=0x7f94341170a0,
        tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:226
    Seagate#21 0x00007f9454cb702c in io_launch (fom=0x7f94200a7ec0) at ioservice/io_foms.c:1837
    Seagate#22 0x00007f9454cb47a0 in m0_io_fom_cob_rw_tick (fom=0x7f94200a7ec0) at ioservice/io_foms.c:2333
    Seagate#23 0x00007f9454c9edf1 in fom_exec (fom=0x7f94200a7ec0) at fop/fom.c:791
    Seagate#24 loc_handler_thread (th=0x11ed150) at fop/fom.c:931

Setup: 1-node, 4+2+0 EC data pool with 10 disks.
Scenario: write the same object twice like this:

    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: regression of BE credit calculation in stob_ad_write_credit()
code was introduced at commit ab22d23.

Solution: rollback the regression change.

Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
andriytk added a commit to andriytk/cortx-motr that referenced this issue Apr 27, 2022
Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack:

    Seagate#3  m0_panic() at lib/assert.c:52
    Seagate#4  btree_save() at be/btree.c:339
    Seagate#5  m0_be_btree_update() at be/btree.c:1952
    Seagate#6  btree_update_sync() at balloc/balloc.c:95
    Seagate#7  balloc_gi_sync() at balloc/balloc.c:928
    Seagate#8  balloc_free_db_update() at balloc/balloc.c:1934
    Seagate#9  balloc_free_internal() at balloc/balloc.c:2716
    Seagate#10 balloc_free() at balloc/balloc.c:2929
    Seagate#11 stob_ad_bfree() at stob/ad.c:1098
    Seagate#12 stob_ad_seg_free (val=1594) at stob/ad.c:1647
    Seagate#13 __lambda() at stob/ad.c:1719
    Seagate#14 m0_be_emap_paste(val=1794) at be/extmap.c:628
    Seagate#15 stob_ad_write_map_ext(off=464) at stob/ad.c:1731
    Seagate#16 stob_ad_write_map(frags=18) at stob/ad.c:1858
    Seagate#17 stob_ad_write_prepare() at stob/ad.c:2006
    Seagate#18 stob_ad_io_launch_prepare() at stob/ad.c:2052
    Seagate#19 m0_stob_io_prepare() at stob/io.c:178
    Seagate#20 m0_stob_io_prepare_and_launch() at stob/io.c:226
    Seagate#21 io_launch() at ioservice/io_foms.c:1837
    Seagate#22 m0_io_fom_cob_rw_tick() at ioservice/io_foms.c:2333
    Seagate#23 fom_exec() at fop/fom.c:791
    Seagate#24 loc_handler_thread() at fop/fom.c:931

Setup: 1 node, 4+2+0 EC data pool with 10 disks.
Scenario: write the same object twice like this:

    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: regression of BE credit calculation in stob_ad_write_credit()
code was introduced at commit ab22d23.

Solution: rollback the regression change.

Co-authored-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
andriytk added a commit to andriytk/cortx-motr that referenced this issue Apr 27, 2022
Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack:

    Seagate#3  m0_panic() at lib/assert.c:52
    Seagate#4  btree_save() at be/btree.c:339
    Seagate#5  m0_be_btree_update() at be/btree.c:1952
    Seagate#6  btree_update_sync() at balloc/balloc.c:95
    Seagate#7  balloc_gi_sync() at balloc/balloc.c:928
    Seagate#8  balloc_free_db_update() at balloc/balloc.c:1934
    Seagate#9  balloc_free_internal() at balloc/balloc.c:2716
    Seagate#10 balloc_free() at balloc/balloc.c:2929
    Seagate#11 stob_ad_bfree() at stob/ad.c:1098
    Seagate#12 stob_ad_seg_free (val=1594) at stob/ad.c:1647
    Seagate#13 __lambda() at stob/ad.c:1719
    Seagate#14 m0_be_emap_paste(val=1794) at be/extmap.c:628
    Seagate#15 stob_ad_write_map_ext(off=464) at stob/ad.c:1731
    Seagate#16 stob_ad_write_map(frags=18) at stob/ad.c:1858
    Seagate#17 stob_ad_write_prepare() at stob/ad.c:2006
    Seagate#18 stob_ad_io_launch_prepare() at stob/ad.c:2052
    Seagate#19 m0_stob_io_prepare() at stob/io.c:178
    Seagate#20 m0_stob_io_prepare_and_launch() at stob/io.c:226
    Seagate#21 io_launch() at ioservice/io_foms.c:1837
    Seagate#22 m0_io_fom_cob_rw_tick() at ioservice/io_foms.c:2333
    Seagate#23 fom_exec() at fop/fom.c:791
    Seagate#24 loc_handler_thread() at fop/fom.c:931

Setup: 1 node, 4+2+0 EC data pool with 10 disks.
Scenario: write the same object twice like this:

    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: regression of BE credit calculation in stob_ad_write_credit()
code was introduced at commit ab22d23.

Solution: rollback the regression change.

Co-authored-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
madhavemuri pushed a commit that referenced this issue May 4, 2022
Problem: Following panic occurs if update operation is done,
Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack:

    #3  m0_panic() at lib/assert.c:52
    #4  btree_save() at be/btree.c:339
    #5  m0_be_btree_update() at be/btree.c:1952
    #6  btree_update_sync() at balloc/balloc.c:95
    #7  balloc_gi_sync() at balloc/balloc.c:928
    #8  balloc_free_db_update() at balloc/balloc.c:1934
    #9  balloc_free_internal() at balloc/balloc.c:2716
    #10 balloc_free() at balloc/balloc.c:2929
    #11 stob_ad_bfree() at stob/ad.c:1098
    #12 stob_ad_seg_free (val=1594) at stob/ad.c:1647
    #13 __lambda() at stob/ad.c:1719
    #14 m0_be_emap_paste(val=1794) at be/extmap.c:628
    #15 stob_ad_write_map_ext(off=464) at stob/ad.c:1731
    #16 stob_ad_write_map(frags=18) at stob/ad.c:1858
    #17 stob_ad_write_prepare() at stob/ad.c:2006
    #18 stob_ad_io_launch_prepare() at stob/ad.c:2052
    #19 m0_stob_io_prepare() at stob/io.c:178
    #20 m0_stob_io_prepare_and_launch() at stob/io.c:226
    #21 io_launch() at ioservice/io_foms.c:1837
    #22 m0_io_fom_cob_rw_tick() at ioservice/io_foms.c:2333
    #23 fom_exec() at fop/fom.c:791
    #24 loc_handler_thread() at fop/fom.c:931

Setup: 1 node, 4+2+0 EC data pool with 10 disks.
Scenario: write the same object twice like this:

    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: regression of BE credit calculation in stob_ad_write_credit()
code was introduced at commit ab22d23.

Solution: rollback the regression change.

Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
andriytk added a commit to andriytk/cortx-motr that referenced this issue May 10, 2022
Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack:

    Seagate#3  m0_panic() at lib/assert.c:52
    Seagate#4  btree_save() at be/btree.c:339
    Seagate#5  m0_be_btree_update() at be/btree.c:1952
    Seagate#6  btree_update_sync() at balloc/balloc.c:95
    Seagate#7  balloc_gi_sync() at balloc/balloc.c:928
    Seagate#8  balloc_free_db_update() at balloc/balloc.c:1934
    Seagate#9  balloc_free_internal() at balloc/balloc.c:2716
    Seagate#10 balloc_free() at balloc/balloc.c:2929
    Seagate#11 stob_ad_bfree() at stob/ad.c:1098
    Seagate#12 stob_ad_seg_free (val=1594) at stob/ad.c:1647
    Seagate#13 __lambda() at stob/ad.c:1719
    Seagate#14 m0_be_emap_paste(val=1794) at be/extmap.c:628
    Seagate#15 stob_ad_write_map_ext(off=464) at stob/ad.c:1731
    Seagate#16 stob_ad_write_map(frags=18) at stob/ad.c:1858
    Seagate#17 stob_ad_write_prepare() at stob/ad.c:2006
    Seagate#18 stob_ad_io_launch_prepare() at stob/ad.c:2052
    Seagate#19 m0_stob_io_prepare() at stob/io.c:178
    Seagate#20 m0_stob_io_prepare_and_launch() at stob/io.c:226
    Seagate#21 io_launch() at ioservice/io_foms.c:1837
    Seagate#22 m0_io_fom_cob_rw_tick() at ioservice/io_foms.c:2333
    Seagate#23 fom_exec() at fop/fom.c:791
    Seagate#24 loc_handler_thread() at fop/fom.c:931

Setup: 1 node, 4+2+0 EC data pool with 10 disks.
Scenario: write the same object twice like this:

    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: regression of BE credit calculation in stob_ad_write_credit()
code was introduced at commit ab22d23.

Solution: rollback the regression change.

Co-authored-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
mehjoshi pushed a commit that referenced this issue Jun 7, 2022
… gc callback of (#1820)

Problem : In m0_be_op_fini, when bos_tlink_fini is performed then its expected that bo_set_link should not have link for link for parent's m0_be_op::bo_children.

State seen at the time of crash:
Two gft_pd_io in progress state, with corresponding two bio in sched queue; crash is hit while performing the gc callback processing for gft whhose gft_pd_io is in progress state and bio is queued behind an active io.

Panic:
2022-04-24 11:19:15,672 - motr[00107]: e2e0 FATAL [lib/assert.c:50:m0_panic] panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() (lib/list.c:178) [git: 2.0.0-670-27-g0012fe90] /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107
2022-04-24 11:19:15,672 - Motr panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() lib/list.c:178 (errno: 0) (last failed: none) [git: 2.0.0-670-27-g0012fe90] pid: 107 /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7f7514e79c83]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7f7514e79e59]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_panic+0x13d)[0x7f7514e6890d]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x3895f6)[0x7f7514e6c5f6]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_be_op_fini+0x1f)[0x7f7514dae66f]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c4c5b)[0x7f7514da7c5b]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c300a)[0x7f7514da600a]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c3119)[0x7f7514da6119]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386f7f)[0x7f7514e69f7f]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386ffa)[0x7f7514e69ffa]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(m0_chan_broadcast_lock+0x1d)[0x7f7514e6a08d]

Backtrace:
(gdb) bt
#0 0x00007f7512d8938f in raise () from /lib64/libc.so.6
#1 0x00007f7512d73dc5 in abort () from /lib64/libc.so.6
#2 0x00007f7514e79e63 in m0_arch_panic (c=c@entry=0x7f751531ade0 <__pctx.4611>, ap=ap@entry=0x7f74afffe390)
at lib/user_space/uassert.c:131
#3 0x00007f7514e6890d in m0_panic (ctx=ctx@entry=0x7f751531ade0 <__pctx.4611>) at lib/assert.c:52
#4 0x00007f7514e6c5f6 in m0_list_link_fini (link=) at lib/list.c:178
#5 0x00007f7514e70310 in m0_tlink_fini (d=d@entry=0x7f75152880a0 <bos_tl>, obj=obj@entry=0x56523e641a90) at lib/tlist.c:283
#6 0x00007f7514dae66f in bos_tlink_fini (amb=0x56523e641a90) at be/op.c:109
#7 m0_be_op_fini (op=0x56523e641a90) at be/op.c:109
#8 0x00007f7514dae826 in be_op_state_change (op=, state=state@entry=M0_BOS_DONE) at be/op.c:213
#9 0x00007f7514daea17 in m0_be_op_done (op=) at be/op.c:231
#10 0x00007f7514da7c5b in be_io_sched_cb (op=op@entry=0x56523e5f7870, param=param@entry=0x56523e5f7798) at be/io_sched.c:141
#11 0x00007f7514dae826 in be_op_state_change (op=op@entry=0x56523e5f7870, state=state@entry=M0_BOS_DONE) at be/op.c:213
#12 0x00007f7514daea17 in m0_be_op_done (op=op@entry=0x56523e5f7870) at be/op.c:231
#13 0x00007f7514da600a in be_io_finished (bio=bio@entry=0x56523e5f7798) at be/io.c:555
#14 0x00007f7514da6119 in be_io_cb (link=0x56523e61ac60) at be/io.c:587
#15 0x00007f7514e69f7f in clink_signal (clink=clink@entry=0x56523e61ac60) at lib/chan.c:135
#16 0x00007f7514e69ffa in chan_signal_nr (chan=chan@entry=0x56523e61ab58, nr=0) at lib/chan.c:154
#17 0x00007f7514e6a06c in m0_chan_broadcast (chan=chan@entry=0x56523e61ab58) at lib/chan.c:174
#18 0x00007f7514e6a08d in m0_chan_broadcast_lock (chan=chan@entry=0x56523e61ab58) at lib/chan.c:181
#19 0x00007f7514f4209a in ioq_complete (res2=, res=, qev=, ioq=0x56523e5de610)
at stob/ioq.c:587
#20 stob_ioq_thread (ioq=0x56523e5de610) at stob/ioq.c:669
#21 0x00007f7514e6f49e in m0_thread_trampoline (arg=arg@entry=0x56523e5de6e8) at lib/thread.c:117
#22 0x00007f7514e7ab11 in uthread_trampoline (arg=0x56523e5de6e8) at lib/user_space/uthread.c:98
#23 0x00007f751454915a in start_thread () from /lib64/libpthread.so.0
#24 0x00007f7512e4edd3 in clone () from /lib64/libc.so.6

RCA - Sequence of Events:

be_tx_group_format_seg_io_op_gc invoked for gft_pd_io_op of tx_group_fom_1 (last_child is false)
(gdb) p &((struct m0_be_group_format *)cb_gc_param)->gft_pd_io_op
$29 = (struct m0_be_op *) 0x56523e641a90

be_tx_group_format_seg_io_op_gc handling of gft_pd_io_op invokes m0_be_op_done for gft_tmp_op (no callbacks for gft_tmp_op) but now last_child is set true for parent as its both child (gft_tmp_op and gft_pd_io_op) op dones have been invoked

m0_be_op_done handling of gft_tmp_op invokes be_op_state_change with M0_BOS_DONE for parent(tgf_op)

During be_op_state_change processing for main parent tgf_op, m0_sm_state_set will update bo_sm state and it will unblock the tx_group_fom_1 by triggering op->bo_sm.sm_chan
This recursive callback processing happens in context of stob_ioq_thread which is initialized on M0_STOB_IOQ_NR_THREADS.
Here due to invocation of gft_tmp_op (i.e peer) child done processing from gft_pd_io_op child gc processing results in their parent early callback invocation.

Parent Callback Prcoseeing:
6. This now unblocks tx_group_fom_1 which will lead to m0_be_pd_io_put in m0_be_group_format_reset and and tx_group_fom_1 will move to TGS_OPEN.
So pd_io and tx_group_fom_1 is now ready for reuse.

Problem window:
7. problem will now occur in window if remaining gc callback processing of gft_pd_io_op
i.e.
m0_be_op_fini(&gft->gft_tmp_op);
m0_be_op_fini(op);
is being done if the pd_io and/or tx_group_fom_1 is reused with new context.

Solution:
Removal of gft_tmp_op altogether will ensure that parent callback processing never invoked ahead of its child callback processing
This way tx_group_fom will always be notifed of seg io completion only after all the relevent child calbback processing is completed and thereby
will avoid the crashes seen in the gc callback processing(be_tx_group_format_seg_io_op_gc) after m0_be_op_done(&gft->gft_tmp_op);
In proposed solution main parent op is made active at the start at the same place where gft_tmp_op was being activated in order to put this parent
into active state; there by making gft_tmp_op redundent and avoiding the out of order execution of child/parent callback executions;
RCA: Due to recursive calls to be_op_state_change where gc callback of gft_op i.e. child1 invokes done callback of gft_tmp_op i.e. child 2 which subsequently results in invocation of parent be_op_state_change. This results in group fom getting completed ahead of child op callback processing. so the subsequently crash is observed when group is reused before child callback processing is finished.

Signed-off-by: Vidyadhar Pinglikar vidyadhar.pinglikar@seagate.com
t7ko-seagate added a commit to t7ko-seagate/cortx-motr that referenced this issue Jun 10, 2022
…ter merging domain/recovered, see Seagate#1744).

Problem: client-ut fails:

```
$ sudo ./utils/m0run -- m0ut -t client-ut
START Iteration: 1 out of 1
client-ut
  m0_client_init  motr[648725]:  6ba0  FATAL  [lib/assert.c:50:m0_panic]  panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226)  [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725  /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
```

```
(gdb) bt
#0  0x00007fffe9ba837f in raise () from /lib64/libc.so.6
Seagate#1  0x00007fffe9b92db5 in abort () from /lib64/libc.so.6
Seagate#2  0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131
Seagate#3  0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52
Seagate#4  0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226
Seagate#5  0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233
Seagate#6  0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259
Seagate#7  0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674
Seagate#8  0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55
Seagate#9  0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255
Seagate#10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390
Seagate#11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459
Seagate#12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513
Seagate#13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539
Seagate#14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533
```

Solution: check that conf is initialized before accessing it.

Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com>

# Please enter the commit message for your changes. Lines starting
# with '#' will be kept; you may remove them yourself if you want to.
# An empty message aborts the commit.
#
# Date:      Fri Jun 10 04:00:37 2022 -0600
#
# On branch br/it/fix-client-ut-after-domain-recovered
# Changes to be committed:
#	modified:   dtm0/domain.c
#
# Changes not staged for commit:
#	modified:   scripts/install-build-deps
#
# Untracked files:
#	run_st.sh
#
t7ko-seagate added a commit to t7ko-seagate/cortx-motr that referenced this issue Jun 10, 2022
…ter merging domain/recovered, see Seagate#1744).

Problem: client-ut fails:

```
$ sudo ./utils/m0run -- m0ut -t client-ut
START Iteration: 1 out of 1
client-ut
  m0_client_init  motr[648725]:  6ba0  FATAL  [lib/assert.c:50:m0_panic]  panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226)  [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725  /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
```

```
(gdb) bt
#0  0x00007fffe9ba837f in raise () from /lib64/libc.so.6
Seagate#1  0x00007fffe9b92db5 in abort () from /lib64/libc.so.6
Seagate#2  0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131
Seagate#3  0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52
Seagate#4  0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226
Seagate#5  0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233
Seagate#6  0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259
Seagate#7  0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674
Seagate#8  0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55
Seagate#9  0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255
Seagate#10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390
Seagate#11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459
Seagate#12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513
Seagate#13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539
Seagate#14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533
```

Solution: check that conf is initialized before accessing it.

Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com>
mehjoshi pushed a commit that referenced this issue Jun 14, 2022
…ter merging domain/recovered, see #1744). (#1871)

Problem: client-ut fails:

```
$ sudo ./utils/m0run -- m0ut -t client-ut
START Iteration: 1 out of 1
client-ut
  m0_client_init  motr[648725]:  6ba0  FATAL  [lib/assert.c:50:m0_panic]  panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226)  [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725  /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
```

```
(gdb) bt
#0  0x00007fffe9ba837f in raise () from /lib64/libc.so.6
#1  0x00007fffe9b92db5 in abort () from /lib64/libc.so.6
#2  0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131
#3  0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52
#4  0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226
#5  0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233
#6  0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259
#7  0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674
#8  0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55
#9  0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255
#10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390
#11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459
#12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513
#13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539
#14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533
```

Solution: check that conf is initialized before accessing it.

Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com>
yatin-mahajan added a commit to yatin-mahajan/cortx-motr-1 that referenced this issue Jul 12, 2022
Problem:
  While session was getting canceled some other thread is trying to terminate
  that session and moves session state to M0_RPC_SESSION_TERMINATING, which
  lead to hit assert M0_POST(session->s_sm.sm_state == M0_RPC_SESSION_IDLE)
  in rpc_session_cancel().

  Seagate#5  in m0_arch_panic (c=c@entry=0x7f7276a91b40 <__pctx.14289>, ap=ap@entry=0x7f7268c05ce0) at lib/user_space/uassert.c:131
  Seagate#6  in m0_panic (ctx=ctx@entry=0x7f7276a91b40 <__pctx.14289>) at lib/assert.c:52
  Seagate#7  in m0_rpc_session_cancel (session=session@entry=0x56283c7c13d8) at rpc/session.c:863
  Seagate#8  in m0_rpc_conn_sessions_cancel (conn=conn@entry=0x56283c7c1030) at rpc/conn.c:1333
  Seagate#9  in rpc_conn__on_service_event_cb (clink=<optimized out>) at rpc/conn.c:1364
  Seagate#10 in clink_signal (clink=clink@entry=0x56283c7c12c0) at lib/chan.c:135
  Seagate#11 in chan_signal_nr (chan=chan@entry=0x56283c6a8768, nr=1) at lib/chan.c:154
  Seagate#12 in m0_chan_broadcast (chan=chan@entry=0x56283c6a8768) at lib/chan.c:174
  Seagate#13 in ha_state_accept (ignore_same_state=1, note=0x7f7268c06060, confc=0x56283816b028) at ha/note.c:18

Solution:
  It is possible that some other thread is trying to terminate the same session
  while session is getting cancelled, only the IDLE/BUSY sessions are allowed to
  cancel. Updated pre check to return from m0_rpc_cancel instead of panic/assert.
  Also replaced M0_POST()/assert with  proper debug log.

Signed-off-by: Yatin Mahajan <yatin.mahajan@seagate.com>
mehjoshi pushed a commit to mehjoshi/cortx-motr that referenced this issue Jul 18, 2022
… gc callback of (Seagate#1820)

Problem : In m0_be_op_fini, when bos_tlink_fini is performed then its expected that bo_set_link should not have link for link for parent's m0_be_op::bo_children.

State seen at the time of crash:
Two gft_pd_io in progress state, with corresponding two bio in sched queue; crash is hit while performing the gc callback processing for gft whhose gft_pd_io is in progress state and bio is queued behind an active io.

Panic:
2022-04-24 11:19:15,672 - motr[00107]: e2e0 FATAL [lib/assert.c:50:m0_panic] panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() (lib/list.c:178) [git: 2.0.0-670-27-g0012fe90] /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107
2022-04-24 11:19:15,672 - Motr panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() lib/list.c:178 (errno: 0) (last failed: none) [git: 2.0.0-670-27-g0012fe90] pid: 107 /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7f7514e79c83]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7f7514e79e59]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_panic+0x13d)[0x7f7514e6890d]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x3895f6)[0x7f7514e6c5f6]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_be_op_fini+0x1f)[0x7f7514dae66f]
2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c4c5b)[0x7f7514da7c5b]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c300a)[0x7f7514da600a]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c3119)[0x7f7514da6119]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386f7f)[0x7f7514e69f7f]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386ffa)[0x7f7514e69ffa]
2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(m0_chan_broadcast_lock+0x1d)[0x7f7514e6a08d]

Backtrace:
(gdb) bt
#0 0x00007f7512d8938f in raise () from /lib64/libc.so.6
Seagate#1 0x00007f7512d73dc5 in abort () from /lib64/libc.so.6
Seagate#2 0x00007f7514e79e63 in m0_arch_panic (c=c@entry=0x7f751531ade0 <__pctx.4611>, ap=ap@entry=0x7f74afffe390)
at lib/user_space/uassert.c:131
Seagate#3 0x00007f7514e6890d in m0_panic (ctx=ctx@entry=0x7f751531ade0 <__pctx.4611>) at lib/assert.c:52
Seagate#4 0x00007f7514e6c5f6 in m0_list_link_fini (link=) at lib/list.c:178
Seagate#5 0x00007f7514e70310 in m0_tlink_fini (d=d@entry=0x7f75152880a0 <bos_tl>, obj=obj@entry=0x56523e641a90) at lib/tlist.c:283
Seagate#6 0x00007f7514dae66f in bos_tlink_fini (amb=0x56523e641a90) at be/op.c:109
Seagate#7 m0_be_op_fini (op=0x56523e641a90) at be/op.c:109
Seagate#8 0x00007f7514dae826 in be_op_state_change (op=, state=state@entry=M0_BOS_DONE) at be/op.c:213
Seagate#9 0x00007f7514daea17 in m0_be_op_done (op=) at be/op.c:231
Seagate#10 0x00007f7514da7c5b in be_io_sched_cb (op=op@entry=0x56523e5f7870, param=param@entry=0x56523e5f7798) at be/io_sched.c:141
Seagate#11 0x00007f7514dae826 in be_op_state_change (op=op@entry=0x56523e5f7870, state=state@entry=M0_BOS_DONE) at be/op.c:213
Seagate#12 0x00007f7514daea17 in m0_be_op_done (op=op@entry=0x56523e5f7870) at be/op.c:231
Seagate#13 0x00007f7514da600a in be_io_finished (bio=bio@entry=0x56523e5f7798) at be/io.c:555
Seagate#14 0x00007f7514da6119 in be_io_cb (link=0x56523e61ac60) at be/io.c:587
Seagate#15 0x00007f7514e69f7f in clink_signal (clink=clink@entry=0x56523e61ac60) at lib/chan.c:135
Seagate#16 0x00007f7514e69ffa in chan_signal_nr (chan=chan@entry=0x56523e61ab58, nr=0) at lib/chan.c:154
Seagate#17 0x00007f7514e6a06c in m0_chan_broadcast (chan=chan@entry=0x56523e61ab58) at lib/chan.c:174
Seagate#18 0x00007f7514e6a08d in m0_chan_broadcast_lock (chan=chan@entry=0x56523e61ab58) at lib/chan.c:181
Seagate#19 0x00007f7514f4209a in ioq_complete (res2=, res=, qev=, ioq=0x56523e5de610)
at stob/ioq.c:587
Seagate#20 stob_ioq_thread (ioq=0x56523e5de610) at stob/ioq.c:669
Seagate#21 0x00007f7514e6f49e in m0_thread_trampoline (arg=arg@entry=0x56523e5de6e8) at lib/thread.c:117
Seagate#22 0x00007f7514e7ab11 in uthread_trampoline (arg=0x56523e5de6e8) at lib/user_space/uthread.c:98
Seagate#23 0x00007f751454915a in start_thread () from /lib64/libpthread.so.0
Seagate#24 0x00007f7512e4edd3 in clone () from /lib64/libc.so.6

RCA - Sequence of Events:

be_tx_group_format_seg_io_op_gc invoked for gft_pd_io_op of tx_group_fom_1 (last_child is false)
(gdb) p &((struct m0_be_group_format *)cb_gc_param)->gft_pd_io_op
$29 = (struct m0_be_op *) 0x56523e641a90

be_tx_group_format_seg_io_op_gc handling of gft_pd_io_op invokes m0_be_op_done for gft_tmp_op (no callbacks for gft_tmp_op) but now last_child is set true for parent as its both child (gft_tmp_op and gft_pd_io_op) op dones have been invoked

m0_be_op_done handling of gft_tmp_op invokes be_op_state_change with M0_BOS_DONE for parent(tgf_op)

During be_op_state_change processing for main parent tgf_op, m0_sm_state_set will update bo_sm state and it will unblock the tx_group_fom_1 by triggering op->bo_sm.sm_chan
This recursive callback processing happens in context of stob_ioq_thread which is initialized on M0_STOB_IOQ_NR_THREADS.
Here due to invocation of gft_tmp_op (i.e peer) child done processing from gft_pd_io_op child gc processing results in their parent early callback invocation.

Parent Callback Prcoseeing:
6. This now unblocks tx_group_fom_1 which will lead to m0_be_pd_io_put in m0_be_group_format_reset and and tx_group_fom_1 will move to TGS_OPEN.
So pd_io and tx_group_fom_1 is now ready for reuse.

Problem window:
7. problem will now occur in window if remaining gc callback processing of gft_pd_io_op
i.e.
m0_be_op_fini(&gft->gft_tmp_op);
m0_be_op_fini(op);
is being done if the pd_io and/or tx_group_fom_1 is reused with new context.

Solution:
Removal of gft_tmp_op altogether will ensure that parent callback processing never invoked ahead of its child callback processing
This way tx_group_fom will always be notifed of seg io completion only after all the relevent child calbback processing is completed and thereby
will avoid the crashes seen in the gc callback processing(be_tx_group_format_seg_io_op_gc) after m0_be_op_done(&gft->gft_tmp_op);
In proposed solution main parent op is made active at the start at the same place where gft_tmp_op was being activated in order to put this parent
into active state; there by making gft_tmp_op redundent and avoiding the out of order execution of child/parent callback executions;
RCA: Due to recursive calls to be_op_state_change where gc callback of gft_op i.e. child1 invokes done callback of gft_tmp_op i.e. child 2 which subsequently results in invocation of parent be_op_state_change. This results in group fom getting completed ahead of child op callback processing. so the subsequently crash is observed when group is reused before child callback processing is finished.

Signed-off-by: Vidyadhar Pinglikar vidyadhar.pinglikar@seagate.com
mehjoshi pushed a commit to mehjoshi/cortx-motr that referenced this issue Jul 18, 2022
…ter merging domain/recovered, see Seagate#1744). (Seagate#1871)

Problem: client-ut fails:

```
$ sudo ./utils/m0run -- m0ut -t client-ut
START Iteration: 1 out of 1
client-ut
  m0_client_init  motr[648725]:  6ba0  FATAL  [lib/assert.c:50:m0_panic]  panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226)  [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725  /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52
```

```
(gdb) bt
#0  0x00007fffe9ba837f in raise () from /lib64/libc.so.6
Seagate#1  0x00007fffe9b92db5 in abort () from /lib64/libc.so.6
Seagate#2  0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131
Seagate#3  0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52
Seagate#4  0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226
Seagate#5  0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233
Seagate#6  0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259
Seagate#7  0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674
Seagate#8  0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55
Seagate#9  0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255
Seagate#10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390
Seagate#11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459
Seagate#12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513
Seagate#13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539
Seagate#14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533
```

Solution: check that conf is initialized before accessing it.

Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com>
rkothiya pushed a commit that referenced this issue Jul 18, 2022
…ate (#1969)

Problem:
  While session was getting canceled some other thread is trying to terminate
  that session and moves session state to M0_RPC_SESSION_TERMINATING, which
  lead to hit assert M0_POST(session->s_sm.sm_state == M0_RPC_SESSION_IDLE)
  in rpc_session_cancel().

  #5  in m0_arch_panic (c=c@entry=0x7f7276a91b40 <__pctx.14289>, ap=ap@entry=0x7f7268c05ce0) at lib/user_space/uassert.c:131
  #6  in m0_panic (ctx=ctx@entry=0x7f7276a91b40 <__pctx.14289>) at lib/assert.c:52
  #7  in m0_rpc_session_cancel (session=session@entry=0x56283c7c13d8) at rpc/session.c:863
  #8  in m0_rpc_conn_sessions_cancel (conn=conn@entry=0x56283c7c1030) at rpc/conn.c:1333
  #9  in rpc_conn__on_service_event_cb (clink=<optimized out>) at rpc/conn.c:1364
  #10 in clink_signal (clink=clink@entry=0x56283c7c12c0) at lib/chan.c:135
  #11 in chan_signal_nr (chan=chan@entry=0x56283c6a8768, nr=1) at lib/chan.c:154
  #12 in m0_chan_broadcast (chan=chan@entry=0x56283c6a8768) at lib/chan.c:174
  #13 in ha_state_accept (ignore_same_state=1, note=0x7f7268c06060, confc=0x56283816b028) at ha/note.c:18

Solution:
  It is possible that some other thread is trying to terminate the same session
  while session is getting cancelled, only the IDLE/BUSY sessions are allowed to
  cancel. Updated pre check to return from m0_rpc_cancel instead of panic/assert.
  Also replaced M0_POST()/assert with  proper debug log.

Signed-off-by: Yatin Mahajan <yatin.mahajan@seagate.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants