-
Notifications
You must be signed in to change notification settings - Fork 142
Conversation
Signed-off-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
6825553
to
1eb42fe
Compare
Signed-off-by: Hua Huang <hua.huang@seagate.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine.
I can approve this.
Jenkins CI Result : PR-1#5Motr Test Summary
CppCheck SummaryCppcheck: No new warnings found 👍 |
Jenkins CI Result : PR-1#6Motr Test Summary
CppCheck SummaryCppcheck: No new warnings found 👍 |
Problem: 1) arg_fop was getting allocated on stack in remote_invocation() that way same fop was getting re-used on futher remote_invocation call, that was causing to stuck in m0_tlist_invarient(). 2) m0_fop_put() was missingfom in some of sub ut's of "isc-service-ut" which was casuing to panic in rpc_session_fini() since m0_fop_fini() is not called where deletion of xid happen from session->xid_list. ``` #0 in raise () from /lib64/libc.so.6 #1 in abort () from /lib64/libc.so.6 #2 in m0_arch_panic () at lib/user_space/uassert.c:131 #3 in m0_panic () at lib/assert.c:52 #4 in m0_list_fini () at lib/list.c:38 #5 in m0_tlist_fini () at lib/tlist.c:56 #6 in xidl_tlist_fini () at rpc/item.c:85 #7 in m0_rpc_item_xid_list_fini () at rpc/item.c:1682 #8 in m0_rpc_session_fini_locked () at rpc/session.c:321 #9 in m0_rpc_session_fini () at rpc/session.c:304 #1 in m0_rpc_session_destroy ()at rpc/session.c:567 ``` Solution: 1) Used heap allocation for fop_arg instead of stack allocation. 2) Added m0_fop_put0_locked() at missing places. 3) Fixing up "remote-comp-signature" ut reveiled other two ut failures due to similar reason, fixed those as well.
Problem: 1) arg_fop was getting allocated on stack in remote_invocation() that way same fop was getting re-used on futher remote_invocation call, that was causing to stuck in m0_tlist_invarient(). 2) m0_fop_put() was missingfom in some of sub ut's of "isc-service-ut" which was casuing to panic in rpc_session_fini() since m0_fop_fini() is not called where deletion of xid happen from session->xid_list. ``` #0 in raise () from /lib64/libc.so.6 #1 in abort () from /lib64/libc.so.6 #2 in m0_arch_panic () at lib/user_space/uassert.c:131 #3 in m0_panic () at lib/assert.c:52 #4 in m0_list_fini () at lib/list.c:38 #5 in m0_tlist_fini () at lib/tlist.c:56 #6 in xidl_tlist_fini () at rpc/item.c:85 #7 in m0_rpc_item_xid_list_fini () at rpc/item.c:1682 #8 in m0_rpc_session_fini_locked () at rpc/session.c:321 #9 in m0_rpc_session_fini () at rpc/session.c:304 #1 in m0_rpc_session_destroy ()at rpc/session.c:567 ``` Solution: 1) Used heap allocation for fop_arg instead of stack allocation. 2) Added m0_fop_put0_locked() at missing places. 3) Fixing up "remote-comp-signature" ut reveiled other two ut failures due to similar reason, fixed those as well.
Jenkins CI Result : TEST-MOTR-CODE-COVERAGE#13Motr Test Summary
CppCheck SummaryCppcheck: No new warnings found 👍 |
Jenkins CI Result : TEST-MOTR-CODE-COVERAGE#14Motr Test Summary
CppCheck SummaryCppcheck: No new warnings found 👍 |
On NUMA nodes, with numad enabled, CPU affinity of locality threads can be changed (especially, for compute-intensive applications which use a lot of CPU and/or memory). As result, this would lead to the following panic: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run() lib/locality.c:310 (errno: 0) (last failed: none) [git: sage-base-1.0-389-g09ff618] #0 0x00002b25fbcb6387 in raise () from /usr/lib64/libc.so.6 Seagate#1 0x00002b25fbcb7a78 in abort () from /usr/lib64/libc.so.6 Seagate#2 0x00002b25fc5e7acd in m0_arch_panic (c=c@entry=0x2b25fca74400 <__pctx.15545>, ap=ap@entry=0x2b26032c7a18) at lib/user_space/uassert.c:131 Seagate#3 0x00002b25fc5d7e44 in m0_panic (ctx=ctx@entry=0x2b25fca74400 <__pctx.15545>) at lib/assert.c:52 Seagate#4 0x00002b25fc5dbf6a in m0_locality_chores_run (locality=locality@entry=0x106f778) at lib/locality.c:310 Seagate#5 0x00002b25fc5abcaa in loc_handler_thread (th=0xf5d160) at fop/fom.c:926 Seagate#6 0x00002b25fc5de2ee in m0_thread_trampoline (arg=arg@entry=0xf5d168) at lib/thread.c:117 Seagate#7 0x00002b25fc5e882d in uthread_trampoline (arg=0xf5d168) at lib/user_space/uthread.c:98 Seagate#8 0x00002b25fba6bea5 in start_thread () from /usr/lib64/libpthread.so.0 Seagate#9 0x00002b25fbd7e96d in clone () from /usr/lib64/libc.so.6 The reason is that we link our localities to the CPUs in the very beginning, when the application is just started and Motr is initialised. And we use the CPU id to figure out the current locality at m0_locality_here(). Of course, when the affinity is changed later and the thread is moved to another CPU, this would give the wrong locality result. Solution: cache locality pointer in TLS and return it from m0_locality_here(). Kudos to Nikita Danilov for the idea. Reviewed-by: Nikita Danilov <nikita.danilov@seagate.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
lib: fix panic at m0_locality_chores_run() on NUMA nodes On NUMA nodes, with numad service enabled, CPU affinity of locality threads can be changed (especially, for compute- intensive applications which use a lot of CPU and/or memory). As a result, this leads to the following panic: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run() lib/locality.c:310 (errno: 0) (last failed: none) [git: sage-base-1.0-389-g09ff618] #0 0x00002b25fbcb6387 in raise () from /usr/lib64/libc.so.6 #1 0x00002b25fbcb7a78 in abort () from /usr/lib64/libc.so.6 #2 0x00002b25fc5e7acd in m0_arch_panic (c=c@entry=0x2b25fca74400 <__pctx.15545>, ap=ap@entry=0x2b26032c7a18) at lib/user_space/uassert.c:131 #3 0x00002b25fc5d7e44 in m0_panic (ctx=ctx@entry=0x2b25fca74400 <__pctx.15545>) at lib/assert.c:52 #4 0x00002b25fc5dbf6a in m0_locality_chores_run (locality=locality@entry=0x106f778) at lib/locality.c:310 #5 0x00002b25fc5abcaa in loc_handler_thread (th=0xf5d160) at fop/fom.c:926 #6 0x00002b25fc5de2ee in m0_thread_trampoline (arg=arg@entry=0xf5d168) at lib/thread.c:117 #7 0x00002b25fc5e882d in uthread_trampoline (arg=0xf5d168) at lib/user_space/uthread.c:98 #8 0x00002b25fba6bea5 in start_thread () from /usr/lib64/libpthread.so.0 #9 0x00002b25fbd7e96d in clone () from /usr/lib64/libc.so.6 The reason is that we link our localities to the CPUs in the very beginning, when the application is just started and Motr is initialised. And we use the CPU id to figure out the current locality at m0_locality_here(). Of course, when the affinity is changed later and the thread is moved to another CPU, this gives the wrong locality result. Solution: stash the initial CPU id in TLS and use it at m0_locality_here(). Kudos to Nikita Danilov for the idea. Reviewed-by: Nikita Danilov <nikita.danilov@seagate.com> Reviewed-by: Huang Hua <hua.huang@seagate.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
lib: fix panic at m0_locality_chores_run() on NUMA nodes On NUMA nodes, with numad service enabled, CPU affinity of locality threads can be changed (especially, for compute- intensive applications which use a lot of CPU and/or memory). As a result, this leads to the following panic: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run() lib/locality.c:310 (errno: 0) (last failed: none) [git: sage-base-1.0-389-g09ff618] #0 0x00002b25fbcb6387 in raise () from /usr/lib64/libc.so.6 Seagate#1 0x00002b25fbcb7a78 in abort () from /usr/lib64/libc.so.6 Seagate#2 0x00002b25fc5e7acd in m0_arch_panic (c=c@entry=0x2b25fca74400 <__pctx.15545>, ap=ap@entry=0x2b26032c7a18) at lib/user_space/uassert.c:131 Seagate#3 0x00002b25fc5d7e44 in m0_panic (ctx=ctx@entry=0x2b25fca74400 <__pctx.15545>) at lib/assert.c:52 Seagate#4 0x00002b25fc5dbf6a in m0_locality_chores_run (locality=locality@entry=0x106f778) at lib/locality.c:310 Seagate#5 0x00002b25fc5abcaa in loc_handler_thread (th=0xf5d160) at fop/fom.c:926 Seagate#6 0x00002b25fc5de2ee in m0_thread_trampoline (arg=arg@entry=0xf5d168) at lib/thread.c:117 Seagate#7 0x00002b25fc5e882d in uthread_trampoline (arg=0xf5d168) at lib/user_space/uthread.c:98 Seagate#8 0x00002b25fba6bea5 in start_thread () from /usr/lib64/libpthread.so.0 Seagate#9 0x00002b25fbd7e96d in clone () from /usr/lib64/libc.so.6 The reason is that we link our localities to the CPUs in the very beginning, when the application is just started and Motr is initialised. And we use the CPU id to figure out the current locality at m0_locality_here(). Of course, when the affinity is changed later and the thread is moved to another CPU, this gives the wrong locality result. Solution: stash the initial CPU id in TLS and use it at m0_locality_here(). Kudos to Nikita Danilov for the idea. Reviewed-by: Nikita Danilov <nikita.danilov@seagate.com> Reviewed-by: Huang Hua <hua.huang@seagate.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
On NUMA nodes, with numad service enabled, CPU affinity of locality threads can be changed (especially, for compute- intensive applications which use a lot of CPU and/or memory). As a result, this leads to the following panic: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run() lib/locality.c:310 (errno: 0) (last failed: none) [git: sage-base-1.0-389-g09ff618] #0 0x00002b25fbcb6387 in raise () from /usr/lib64/libc.so.6 Seagate#1 0x00002b25fbcb7a78 in abort () from /usr/lib64/libc.so.6 Seagate#2 0x00002b25fc5e7acd in m0_arch_panic (c=c@entry=0x2b25fca74400 <__pctx.15545>, ap=ap@entry=0x2b26032c7a18) at lib/user_space/uassert.c:131 Seagate#3 0x00002b25fc5d7e44 in m0_panic (ctx=ctx@entry=0x2b25fca74400 <__pctx.15545>) at lib/assert.c:52 Seagate#4 0x00002b25fc5dbf6a in m0_locality_chores_run (locality=locality@entry=0x106f778) at lib/locality.c:310 Seagate#5 0x00002b25fc5abcaa in loc_handler_thread (th=0xf5d160) at fop/fom.c:926 Seagate#6 0x00002b25fc5de2ee in m0_thread_trampoline (arg=arg@entry=0xf5d168) at lib/thread.c:117 Seagate#7 0x00002b25fc5e882d in uthread_trampoline (arg=0xf5d168) at lib/user_space/uthread.c:98 Seagate#8 0x00002b25fba6bea5 in start_thread () from /usr/lib64/libpthread.so.0 Seagate#9 0x00002b25fbd7e96d in clone () from /usr/lib64/libc.so.6 The reason is that we link our localities to the CPUs in the very beginning, when the application is just started and Motr is initialised. And we use the CPU id to figure out the current locality at m0_locality_here(). Of course, when the affinity is changed later and the thread is moved to another CPU, this gives the wrong locality result. Solution: stash the initial CPU id in TLS and use it at m0_locality_here(). Kudos to Nikita Danilov for the idea. Reviewed-by: Nikita Danilov <nikita.danilov@seagate.com> Reviewed-by: Huang Hua <hua.huang@seagate.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
On NUMA nodes, with numad enabled, CPU affinity of locality threads can be changed (especially, for compute-intensive applications which use a lot of CPU and/or memory). As result, this would lead to the following panic: Motr panic: (locality == m0_locality_here()) at m0_locality_chores_run() lib/locality.c:310 (errno: 0) (last failed: none) [git: sage-base-1.0-389-g09ff618] #0 0x00002b25fbcb6387 in raise () from /usr/lib64/libc.so.6 Seagate#1 0x00002b25fbcb7a78 in abort () from /usr/lib64/libc.so.6 Seagate#2 0x00002b25fc5e7acd in m0_arch_panic (c=c@entry=0x2b25fca74400 <__pctx.15545>, ap=ap@entry=0x2b26032c7a18) at lib/user_space/uassert.c:131 Seagate#3 0x00002b25fc5d7e44 in m0_panic (ctx=ctx@entry=0x2b25fca74400 <__pctx.15545>) at lib/assert.c:52 Seagate#4 0x00002b25fc5dbf6a in m0_locality_chores_run (locality=locality@entry=0x106f778) at lib/locality.c:310 Seagate#5 0x00002b25fc5abcaa in loc_handler_thread (th=0xf5d160) at fop/fom.c:926 Seagate#6 0x00002b25fc5de2ee in m0_thread_trampoline (arg=arg@entry=0xf5d168) at lib/thread.c:117 Seagate#7 0x00002b25fc5e882d in uthread_trampoline (arg=0xf5d168) at lib/user_space/uthread.c:98 Seagate#8 0x00002b25fba6bea5 in start_thread () from /usr/lib64/libpthread.so.0 Seagate#9 0x00002b25fbd7e96d in clone () from /usr/lib64/libc.so.6 The reason is that we link our localities to the CPUs in the very beginning, when the application is just started and Motr is initialised. And we use the CPU id to figure out the current locality at m0_locality_here(). Of course, when the affinity is changed later and the thread is moved to another CPU, this would give the wrong locality result. Solution: cache locality pointer in TLS and return it from m0_locality_here(). Kudos to Nikita Danilov for the idea. Reviewed-by: Nikita Danilov <nikita.danilov@seagate.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> (cherry picked from commit b00aa57)
… gc callback of (#1820) Problem : In m0_be_op_fini, when bos_tlink_fini is performed then its expected that bo_set_link should not have link for link for parent's m0_be_op::bo_children. State seen at the time of crash: Two gft_pd_io in progress state, with corresponding two bio in sched queue; crash is hit while performing the gc callback processing for gft whhose gft_pd_io is in progress state and bio is queued behind an active io. Panic: 2022-04-24 11:19:15,672 - motr[00107]: e2e0 FATAL [lib/assert.c:50:m0_panic] panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() (lib/list.c:178) [git: 2.0.0-670-27-g0012fe90] /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107 2022-04-24 11:19:15,672 - Motr panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() lib/list.c:178 (errno: 0) (last failed: none) [git: 2.0.0-670-27-g0012fe90] pid: 107 /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7f7514e79c83] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7f7514e79e59] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_panic+0x13d)[0x7f7514e6890d] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x3895f6)[0x7f7514e6c5f6] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_be_op_fini+0x1f)[0x7f7514dae66f] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c4c5b)[0x7f7514da7c5b] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c300a)[0x7f7514da600a] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c3119)[0x7f7514da6119] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386f7f)[0x7f7514e69f7f] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386ffa)[0x7f7514e69ffa] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(m0_chan_broadcast_lock+0x1d)[0x7f7514e6a08d] Backtrace: (gdb) bt #0 0x00007f7512d8938f in raise () from /lib64/libc.so.6 #1 0x00007f7512d73dc5 in abort () from /lib64/libc.so.6 #2 0x00007f7514e79e63 in m0_arch_panic (c=c@entry=0x7f751531ade0 <__pctx.4611>, ap=ap@entry=0x7f74afffe390) at lib/user_space/uassert.c:131 #3 0x00007f7514e6890d in m0_panic (ctx=ctx@entry=0x7f751531ade0 <__pctx.4611>) at lib/assert.c:52 #4 0x00007f7514e6c5f6 in m0_list_link_fini (link=) at lib/list.c:178 #5 0x00007f7514e70310 in m0_tlink_fini (d=d@entry=0x7f75152880a0 <bos_tl>, obj=obj@entry=0x56523e641a90) at lib/tlist.c:283 #6 0x00007f7514dae66f in bos_tlink_fini (amb=0x56523e641a90) at be/op.c:109 #7 m0_be_op_fini (op=0x56523e641a90) at be/op.c:109 #8 0x00007f7514dae826 in be_op_state_change (op=, state=state@entry=M0_BOS_DONE) at be/op.c:213 #9 0x00007f7514daea17 in m0_be_op_done (op=) at be/op.c:231 #10 0x00007f7514da7c5b in be_io_sched_cb (op=op@entry=0x56523e5f7870, param=param@entry=0x56523e5f7798) at be/io_sched.c:141 #11 0x00007f7514dae826 in be_op_state_change (op=op@entry=0x56523e5f7870, state=state@entry=M0_BOS_DONE) at be/op.c:213 #12 0x00007f7514daea17 in m0_be_op_done (op=op@entry=0x56523e5f7870) at be/op.c:231 #13 0x00007f7514da600a in be_io_finished (bio=bio@entry=0x56523e5f7798) at be/io.c:555 #14 0x00007f7514da6119 in be_io_cb (link=0x56523e61ac60) at be/io.c:587 #15 0x00007f7514e69f7f in clink_signal (clink=clink@entry=0x56523e61ac60) at lib/chan.c:135 #16 0x00007f7514e69ffa in chan_signal_nr (chan=chan@entry=0x56523e61ab58, nr=0) at lib/chan.c:154 #17 0x00007f7514e6a06c in m0_chan_broadcast (chan=chan@entry=0x56523e61ab58) at lib/chan.c:174 #18 0x00007f7514e6a08d in m0_chan_broadcast_lock (chan=chan@entry=0x56523e61ab58) at lib/chan.c:181 #19 0x00007f7514f4209a in ioq_complete (res2=, res=, qev=, ioq=0x56523e5de610) at stob/ioq.c:587 #20 stob_ioq_thread (ioq=0x56523e5de610) at stob/ioq.c:669 #21 0x00007f7514e6f49e in m0_thread_trampoline (arg=arg@entry=0x56523e5de6e8) at lib/thread.c:117 #22 0x00007f7514e7ab11 in uthread_trampoline (arg=0x56523e5de6e8) at lib/user_space/uthread.c:98 #23 0x00007f751454915a in start_thread () from /lib64/libpthread.so.0 #24 0x00007f7512e4edd3 in clone () from /lib64/libc.so.6 RCA - Sequence of Events: be_tx_group_format_seg_io_op_gc invoked for gft_pd_io_op of tx_group_fom_1 (last_child is false) (gdb) p &((struct m0_be_group_format *)cb_gc_param)->gft_pd_io_op $29 = (struct m0_be_op *) 0x56523e641a90 be_tx_group_format_seg_io_op_gc handling of gft_pd_io_op invokes m0_be_op_done for gft_tmp_op (no callbacks for gft_tmp_op) but now last_child is set true for parent as its both child (gft_tmp_op and gft_pd_io_op) op dones have been invoked m0_be_op_done handling of gft_tmp_op invokes be_op_state_change with M0_BOS_DONE for parent(tgf_op) During be_op_state_change processing for main parent tgf_op, m0_sm_state_set will update bo_sm state and it will unblock the tx_group_fom_1 by triggering op->bo_sm.sm_chan This recursive callback processing happens in context of stob_ioq_thread which is initialized on M0_STOB_IOQ_NR_THREADS. Here due to invocation of gft_tmp_op (i.e peer) child done processing from gft_pd_io_op child gc processing results in their parent early callback invocation. Parent Callback Prcoseeing: 6. This now unblocks tx_group_fom_1 which will lead to m0_be_pd_io_put in m0_be_group_format_reset and and tx_group_fom_1 will move to TGS_OPEN. So pd_io and tx_group_fom_1 is now ready for reuse. Problem window: 7. problem will now occur in window if remaining gc callback processing of gft_pd_io_op i.e. m0_be_op_fini(&gft->gft_tmp_op); m0_be_op_fini(op); is being done if the pd_io and/or tx_group_fom_1 is reused with new context. Solution: Removal of gft_tmp_op altogether will ensure that parent callback processing never invoked ahead of its child callback processing This way tx_group_fom will always be notifed of seg io completion only after all the relevent child calbback processing is completed and thereby will avoid the crashes seen in the gc callback processing(be_tx_group_format_seg_io_op_gc) after m0_be_op_done(&gft->gft_tmp_op); In proposed solution main parent op is made active at the start at the same place where gft_tmp_op was being activated in order to put this parent into active state; there by making gft_tmp_op redundent and avoiding the out of order execution of child/parent callback executions; RCA: Due to recursive calls to be_op_state_change where gc callback of gft_op i.e. child1 invokes done callback of gft_tmp_op i.e. child 2 which subsequently results in invocation of parent be_op_state_change. This results in group fom getting completed ahead of child op callback processing. so the subsequently crash is observed when group is reused before child callback processing is finished. Signed-off-by: Vidyadhar Pinglikar vidyadhar.pinglikar@seagate.com
…ter merging domain/recovered, see Seagate#1744). Problem: client-ut fails: ``` $ sudo ./utils/m0run -- m0ut -t client-ut START Iteration: 1 out of 1 client-ut m0_client_init motr[648725]: 6ba0 FATAL [lib/assert.c:50:m0_panic] panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226) [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725 /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 ``` ``` (gdb) bt #0 0x00007fffe9ba837f in raise () from /lib64/libc.so.6 Seagate#1 0x00007fffe9b92db5 in abort () from /lib64/libc.so.6 Seagate#2 0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131 Seagate#3 0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52 Seagate#4 0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226 Seagate#5 0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233 Seagate#6 0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259 Seagate#7 0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674 Seagate#8 0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55 Seagate#9 0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255 Seagate#10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390 Seagate#11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459 Seagate#12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513 Seagate#13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539 Seagate#14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533 ``` Solution: check that conf is initialized before accessing it. Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com> # Please enter the commit message for your changes. Lines starting # with '#' will be kept; you may remove them yourself if you want to. # An empty message aborts the commit. # # Date: Fri Jun 10 04:00:37 2022 -0600 # # On branch br/it/fix-client-ut-after-domain-recovered # Changes to be committed: # modified: dtm0/domain.c # # Changes not staged for commit: # modified: scripts/install-build-deps # # Untracked files: # run_st.sh #
…ter merging domain/recovered, see Seagate#1744). Problem: client-ut fails: ``` $ sudo ./utils/m0run -- m0ut -t client-ut START Iteration: 1 out of 1 client-ut m0_client_init motr[648725]: 6ba0 FATAL [lib/assert.c:50:m0_panic] panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226) [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725 /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 ``` ``` (gdb) bt #0 0x00007fffe9ba837f in raise () from /lib64/libc.so.6 Seagate#1 0x00007fffe9b92db5 in abort () from /lib64/libc.so.6 Seagate#2 0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131 Seagate#3 0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52 Seagate#4 0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226 Seagate#5 0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233 Seagate#6 0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259 Seagate#7 0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674 Seagate#8 0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55 Seagate#9 0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255 Seagate#10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390 Seagate#11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459 Seagate#12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513 Seagate#13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539 Seagate#14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533 ``` Solution: check that conf is initialized before accessing it. Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com>
…ter merging domain/recovered, see #1744). (#1871) Problem: client-ut fails: ``` $ sudo ./utils/m0run -- m0ut -t client-ut START Iteration: 1 out of 1 client-ut m0_client_init motr[648725]: 6ba0 FATAL [lib/assert.c:50:m0_panic] panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226) [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725 /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 ``` ``` (gdb) bt #0 0x00007fffe9ba837f in raise () from /lib64/libc.so.6 #1 0x00007fffe9b92db5 in abort () from /lib64/libc.so.6 #2 0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131 #3 0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52 #4 0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226 #5 0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233 #6 0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259 #7 0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674 #8 0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55 #9 0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255 #10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390 #11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459 #12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513 #13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539 #14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533 ``` Solution: check that conf is initialized before accessing it. Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com>
… gc callback of (Seagate#1820) Problem : In m0_be_op_fini, when bos_tlink_fini is performed then its expected that bo_set_link should not have link for link for parent's m0_be_op::bo_children. State seen at the time of crash: Two gft_pd_io in progress state, with corresponding two bio in sched queue; crash is hit while performing the gc callback processing for gft whhose gft_pd_io is in progress state and bio is queued behind an active io. Panic: 2022-04-24 11:19:15,672 - motr[00107]: e2e0 FATAL [lib/assert.c:50:m0_panic] panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() (lib/list.c:178) [git: 2.0.0-670-27-g0012fe90] /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107 2022-04-24 11:19:15,672 - Motr panic: (!m0_list_link_is_in(link)) at m0_list_link_fini() lib/list.c:178 (errno: 0) (last failed: none) [git: 2.0.0-670-27-g0012fe90] pid: 107 /etc/cortx/log/motr/0696b1d9e4744c59a92cb2bdded112ac/trace/m0d-0x7200000000000001:0x2e/m0trace.107 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7f7514e79c83] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7f7514e79e59] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_panic+0x13d)[0x7f7514e6890d] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x3895f6)[0x7f7514e6c5f6] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(m0_be_op_fini+0x1f)[0x7f7514dae66f] 2022-04-24 11:19:15,706 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c4c5b)[0x7f7514da7c5b] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2cb826)[0x7f7514dae826] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c300a)[0x7f7514da600a] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x2c3119)[0x7f7514da6119] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386f7f)[0x7f7514e69f7f] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(+0x386ffa)[0x7f7514e69ffa] 2022-04-24 11:19:15,707 - /lib64/libmotr.so.2(m0_chan_broadcast_lock+0x1d)[0x7f7514e6a08d] Backtrace: (gdb) bt #0 0x00007f7512d8938f in raise () from /lib64/libc.so.6 Seagate#1 0x00007f7512d73dc5 in abort () from /lib64/libc.so.6 Seagate#2 0x00007f7514e79e63 in m0_arch_panic (c=c@entry=0x7f751531ade0 <__pctx.4611>, ap=ap@entry=0x7f74afffe390) at lib/user_space/uassert.c:131 Seagate#3 0x00007f7514e6890d in m0_panic (ctx=ctx@entry=0x7f751531ade0 <__pctx.4611>) at lib/assert.c:52 Seagate#4 0x00007f7514e6c5f6 in m0_list_link_fini (link=) at lib/list.c:178 Seagate#5 0x00007f7514e70310 in m0_tlink_fini (d=d@entry=0x7f75152880a0 <bos_tl>, obj=obj@entry=0x56523e641a90) at lib/tlist.c:283 Seagate#6 0x00007f7514dae66f in bos_tlink_fini (amb=0x56523e641a90) at be/op.c:109 Seagate#7 m0_be_op_fini (op=0x56523e641a90) at be/op.c:109 Seagate#8 0x00007f7514dae826 in be_op_state_change (op=, state=state@entry=M0_BOS_DONE) at be/op.c:213 Seagate#9 0x00007f7514daea17 in m0_be_op_done (op=) at be/op.c:231 Seagate#10 0x00007f7514da7c5b in be_io_sched_cb (op=op@entry=0x56523e5f7870, param=param@entry=0x56523e5f7798) at be/io_sched.c:141 Seagate#11 0x00007f7514dae826 in be_op_state_change (op=op@entry=0x56523e5f7870, state=state@entry=M0_BOS_DONE) at be/op.c:213 Seagate#12 0x00007f7514daea17 in m0_be_op_done (op=op@entry=0x56523e5f7870) at be/op.c:231 Seagate#13 0x00007f7514da600a in be_io_finished (bio=bio@entry=0x56523e5f7798) at be/io.c:555 Seagate#14 0x00007f7514da6119 in be_io_cb (link=0x56523e61ac60) at be/io.c:587 Seagate#15 0x00007f7514e69f7f in clink_signal (clink=clink@entry=0x56523e61ac60) at lib/chan.c:135 Seagate#16 0x00007f7514e69ffa in chan_signal_nr (chan=chan@entry=0x56523e61ab58, nr=0) at lib/chan.c:154 Seagate#17 0x00007f7514e6a06c in m0_chan_broadcast (chan=chan@entry=0x56523e61ab58) at lib/chan.c:174 Seagate#18 0x00007f7514e6a08d in m0_chan_broadcast_lock (chan=chan@entry=0x56523e61ab58) at lib/chan.c:181 Seagate#19 0x00007f7514f4209a in ioq_complete (res2=, res=, qev=, ioq=0x56523e5de610) at stob/ioq.c:587 Seagate#20 stob_ioq_thread (ioq=0x56523e5de610) at stob/ioq.c:669 Seagate#21 0x00007f7514e6f49e in m0_thread_trampoline (arg=arg@entry=0x56523e5de6e8) at lib/thread.c:117 Seagate#22 0x00007f7514e7ab11 in uthread_trampoline (arg=0x56523e5de6e8) at lib/user_space/uthread.c:98 Seagate#23 0x00007f751454915a in start_thread () from /lib64/libpthread.so.0 Seagate#24 0x00007f7512e4edd3 in clone () from /lib64/libc.so.6 RCA - Sequence of Events: be_tx_group_format_seg_io_op_gc invoked for gft_pd_io_op of tx_group_fom_1 (last_child is false) (gdb) p &((struct m0_be_group_format *)cb_gc_param)->gft_pd_io_op $29 = (struct m0_be_op *) 0x56523e641a90 be_tx_group_format_seg_io_op_gc handling of gft_pd_io_op invokes m0_be_op_done for gft_tmp_op (no callbacks for gft_tmp_op) but now last_child is set true for parent as its both child (gft_tmp_op and gft_pd_io_op) op dones have been invoked m0_be_op_done handling of gft_tmp_op invokes be_op_state_change with M0_BOS_DONE for parent(tgf_op) During be_op_state_change processing for main parent tgf_op, m0_sm_state_set will update bo_sm state and it will unblock the tx_group_fom_1 by triggering op->bo_sm.sm_chan This recursive callback processing happens in context of stob_ioq_thread which is initialized on M0_STOB_IOQ_NR_THREADS. Here due to invocation of gft_tmp_op (i.e peer) child done processing from gft_pd_io_op child gc processing results in their parent early callback invocation. Parent Callback Prcoseeing: 6. This now unblocks tx_group_fom_1 which will lead to m0_be_pd_io_put in m0_be_group_format_reset and and tx_group_fom_1 will move to TGS_OPEN. So pd_io and tx_group_fom_1 is now ready for reuse. Problem window: 7. problem will now occur in window if remaining gc callback processing of gft_pd_io_op i.e. m0_be_op_fini(&gft->gft_tmp_op); m0_be_op_fini(op); is being done if the pd_io and/or tx_group_fom_1 is reused with new context. Solution: Removal of gft_tmp_op altogether will ensure that parent callback processing never invoked ahead of its child callback processing This way tx_group_fom will always be notifed of seg io completion only after all the relevent child calbback processing is completed and thereby will avoid the crashes seen in the gc callback processing(be_tx_group_format_seg_io_op_gc) after m0_be_op_done(&gft->gft_tmp_op); In proposed solution main parent op is made active at the start at the same place where gft_tmp_op was being activated in order to put this parent into active state; there by making gft_tmp_op redundent and avoiding the out of order execution of child/parent callback executions; RCA: Due to recursive calls to be_op_state_change where gc callback of gft_op i.e. child1 invokes done callback of gft_tmp_op i.e. child 2 which subsequently results in invocation of parent be_op_state_change. This results in group fom getting completed ahead of child op callback processing. so the subsequently crash is observed when group is reused before child callback processing is finished. Signed-off-by: Vidyadhar Pinglikar vidyadhar.pinglikar@seagate.com
…ter merging domain/recovered, see Seagate#1744). (Seagate#1871) Problem: client-ut fails: ``` $ sudo ./utils/m0run -- m0ut -t client-ut START Iteration: 1 out of 1 client-ut m0_client_init motr[648725]: 6ba0 FATAL [lib/assert.c:50:m0_panic] panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() (conf/helpers.c:226) [git: 2.0.0-794-16-g4bdd8326-dirty] /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 Motr panic: (confc->cc_root != ((void *)0)) at m0_confc_root_open() conf/helpers.c:226 (errno: 0) (last failed: none) [git: 2.0.0-794-16-g4bdd8326-dirty] pid: 648725 /var/motr/m0ut/m0trace.648725.2022-06-10-03:33:52 ``` ``` (gdb) bt #0 0x00007fffe9ba837f in raise () from /lib64/libc.so.6 Seagate#1 0x00007fffe9b92db5 in abort () from /lib64/libc.so.6 Seagate#2 0x00007fffebd8d2de in m0_arch_panic (c=0x7fffec2c12e0 <__pctx.14974>, ap=0x7fffffffdb18) at lib/user_space/uassert.c:131 Seagate#3 0x00007fffebd6d626 in m0_panic (ctx=0x7fffec2c12e0 <__pctx.14974>) at lib/assert.c:52 Seagate#4 0x00007fffebc91476 in m0_confc_root_open (confc=0x87f5f8, root=0x7fffffffdcc8) at conf/helpers.c:226 Seagate#5 0x00007fffebcefc2c in has_in_conf (reqh=0x87e6b8) at dtm0/domain.c:233 Seagate#6 0x00007fffebcefcdb in m0_dtm0_domain_is_recoverable (dod=0x887b88, reqh=0x87e6b8) at dtm0/domain.c:259 Seagate#7 0x00007fffed5b6baa in m0_client_init (m0c_p=0x7fffffffde20, conf=0x7ffff209be20 <default_config>, init_m0=false) at /root/cortx-motr/motr/client_init.c:1674 Seagate#8 0x00007fffed5ae4b3 in do_init (instance=0x7fffffffde20) at /root/cortx-motr/motr/ut/client.h:55 Seagate#9 0x00007fffed5b7931 in ut_test_m0_client_init () at motr/ut/client.c:255 Seagate#10 0x00007fffed6a917f in run_test (test=0x707908 <ut_suite+4488>, max_name_len=43) at ut/ut.c:390 Seagate#11 0x00007fffed6a9468 in run_suite (suite=0x706780 <ut_suite>, max_name_len=43) at ut/ut.c:459 Seagate#12 0x00007fffed6a9706 in tests_run_all (m=0x7ffff7dc0220 <ut>) at ut/ut.c:513 Seagate#13 0x00007fffed6a9764 in m0_ut_run () at ut/ut.c:539 Seagate#14 0x0000000000404b13 in main (argc=3, argv=0x7fffffffe598) at ut/m0ut.c:533 ``` Solution: check that conf is initialized before accessing it. Signed-off-by: Ivan Tishchenko <ivan.tishchenko@seagate.com>
Signed-off-by: Madhavrao Vemuri madhav.vemuri@seagate.com