Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Fix crash of all ios processes in some write scenarios #1686

Merged
merged 1 commit into from
May 4, 2022

Conversation

andriytk
Copy link
Contributor

@andriytk andriytk commented Apr 27, 2022

Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack is the same in all processes core files:

#3  0x00007f9454ccbe37 in m0_panic (ctx=ctx@entry=0x7f94550e15a0 <__pctx.8664>) at lib/assert.c:52
#4  0x00007f9454c0747c in btree_save (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90, op=op@entry=0x7f944ca77ec0,
    val=val@entry=0x7f944ca77e70, anchor=0x0, optype=BTREE_SAVE_UPDATE, zonemask=2, key=<optimized out>, key=<optimized out>)
    at be/btree.c:339
#5  0x00007f9454c086f7 in m0_be_btree_update (tree=tree@entry=0x40000010ed80, tx=tx@entry=0x7f94200a7f90,
    op=op@entry=0x7f944ca77ec0, key=key@entry=0x7f944ca77e60, val=val@entry=0x7f944ca77e70) at be/btree.c:1952
#6  0x00007f9454bfd62b in btree_update_sync (val=0x7f944ca77e70, key=0x7f944ca77e60, tx=0x7f94200a7f90, tree=0x40000010ed80)
    at balloc/balloc.c:95
#7  balloc_gi_sync (cb=cb@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90, gi=gi@entry=0x13ff860) at balloc/balloc.c:928
#8  0x00007f9454bfe36e in balloc_free_db_update (motr=motr@entry=0x40000010eb40, tx=tx@entry=0x7f94200a7f90,
    grp=grp@entry=0x13ff860, tgt=tgt@entry=0x7f944ca78470, alloc_flag=<optimized out>) at balloc/balloc.c:1934
#9  0x00007f9454bff9c6 in balloc_free_internal (req=<synthetic pointer>, req=<synthetic pointer>, tx=0x7f94200a7f90,
    ctx=0x40000010eb40) at balloc/balloc.c:2716
#10 balloc_free (ballroom=0x40000010ec68, tx=0x7f94200a7f88, ext=0x7f944ca78560) at balloc/balloc.c:2929
#11 0x00007f9454d97681 in stob_ad_bfree (adom=<optimized out>, adom=<optimized out>, ext=0x7f944ca78530, ext=0x7f944ca78530,
    tx=0x7f94200a7f88) at stob/ad.c:1098
#12 stob_ad_seg_free (tx=0x7f94200a7f88, adom=<optimized out>, ext=ext@entry=0x7f944ca79160, val=1594, seg=<optimized out>)
    at stob/ad.c:1647
#13 0x00007f9454d9783d in __lambda (seg=0x7f944ca79150) at stob/ad.c:1719
#14 0x00007f9454c10802 in m0_be_emap_paste (it=it@entry=0x7f944ca79140, tx=0x7f94200a7f90, ext=ext@entry=0x7f944ca78a90,
    val=1794, del=del@entry=0x7f944ca78b1c, cut_left=cut_left@entry=0x7f944ca78b38, cut_right=0x7f944ca78b54)
    at be/extmap.c:628
#15 0x00007f9454d9a546 in stob_ad_write_map_ext (orig=<optimized out>, off=464, adom=0x4000001120d8) at stob/ad.c:1731
#16 stob_ad_write_map (map=0x7f944ca78900, frags=18, wc=0x7f944ca78920, dst=0x7f944ca789b0, adom=0x4000001120d8,
    io=0x7f94340b4298) at stob/ad.c:1858
#17 stob_ad_write_prepare (map=0x7f944ca78900, src=0x7f944ca78970, adom=0x4000001120d8, io=<optimized out>) at stob/ad.c:2006
#18 stob_ad_io_launch_prepare (io=<optimized out>) at stob/ad.c:2052
#19 0x00007f9454d9ca47 in m0_stob_io_prepare (io=io@entry=0x7f94340b4298, obj=obj@entry=0x7f94341170a0,
    tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:178
#20 0x00007f9454d9ce92 in m0_stob_io_prepare_and_launch (io=io@entry=0x7f94340b4298, obj=0x7f94341170a0,
    tx=tx@entry=0x7f94200a7f88, scope=scope@entry=0x0) at stob/io.c:226
#21 0x00007f9454cb702c in io_launch (fom=0x7f94200a7ec0) at ioservice/io_foms.c:1837
#22 0x00007f9454cb47a0 in m0_io_fom_cob_rw_tick (fom=0x7f94200a7ec0) at ioservice/io_foms.c:2333
#23 0x00007f9454c9edf1 in fom_exec (fom=0x7f94200a7ec0) at fop/fom.c:791
#24 loc_handler_thread (th=0x11ed150) at fop/fom.c:931

Setup: 1 node, 10 disks data pool with 4+2+0 EC.
Scenario: write the same object twice like this:

$ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
$ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: the regression of BE credit calculation in stob_ad_write_credit() code was introduced at commit ab22d23.

Solution: rollback the regression change.

Closes #1678.

Coding

Checklist for Author

  • Coding conventions are followed and code is consistent

Testing

Checklist for Author

  • Unit and System Tests are added
  • Test Cases cover Happy Path, Non-Happy Path and Scalability
  • Testing was performed with RPM

Impact Analysis

Checklist for Author/Reviewer/GateKeeper

  • Interface change (if any) are documented
  • Side effects on other features (deployment/upgrade)
  • Dependencies on other component(s)

Review Checklist

Checklist for Author

  • JIRA number/GitHub Issue added to PR
  • PR is self reviewed
  • Jira and state/status is updated and JIRA is updated with PR link
  • Check if the description is clear and explained

Documentation

Checklist for Author

  • Changes done to WIKI / Confluence page / Quick Start Guide

@cla-bot cla-bot bot added the cla-signed label Apr 27, 2022
@andriytk andriytk changed the title Fix crash of all m0d-ios processes during write i/o Fix crash of all ios processes in some write i/o scenarios Apr 27, 2022
@andriytk andriytk force-pushed the fix-all-ios-panic branch 2 times, most recently from 4f65eff to d3ff93e Compare April 27, 2022 11:47
@madhavemuri
Copy link
Contributor

madhavemuri commented Apr 27, 2022

@andriytk : With following patch for motr utils ST, same crash can be reproduced.

diff --git a/motr/st/utils/motr_utils_st.sh b/motr/st/utils/motr_utils_st.sh
index c6a3cc4..aa0eed5 100755
--- a/motr/st/utils/motr_utils_st.sh
+++ b/motr/st/utils/motr_utils_st.sh
@@ -169,6 +169,32 @@ EOF
        }
        echo "motr r/w test with m0cp and m0cat is successful"
        rm -f $dest_file
+       echo "motr r/w test with update of m0cp and m0cat"
+        $motr_st_util_dir/m0cp $MOTR_PARAMS_V -o $object_id1 $src_file \
+                                 -s $block_size -c $block_count -L 2 \
+                                 -b $blks_per_io || {
+               error_handling $? "Failed to copy object"
+       }
+       $motr_st_util_dir/m0cp $MOTR_PARAMS_V -o $object_id1 $src_file \
+                                 -s $block_size -c $block_count -L 2 \
+                                 -b $blks_per_io -u || {
+               error_handling $? "Failed to copy object"
+       }
+       $motr_st_util_dir/m0cat $MOTR_PARAMS_V -o $object_id1 \
+                                 -s $block_size -c $block_count -L 2 -b $blks_per_io \
+                                 $dest_file || {
+               error_handling $? "Failed to read object"
+       }
+       $motr_st_util_dir/m0unlink $MOTR_PARAMS -o $object_id1 || {
+               error_handling $? "Failed to delete object"
+       }
+       diff $src_file $dest_file || {
+               rc=$?
+               error_handling $rc "Files are different"
+       }
+
+       echo "motr r/w test with update of m0cp and m0cat is successful"
+       rm -f $dest_file

        # Test m0cp_mt
        echo "m0cp_mt test"

I have update Layout id to 2, as with 1 after the fix is applied, fails with delete timeout errors due to large number of transactions.

$./scripts/m0 run-st 42motr-utils

Motr panic: ((cr->tc_balance[cu]) != 0) at btree_save() be/btree.c:1393 (errno: 0) (last failed: none) [git: 2.0.0-670-53-g74de5d4-dirty] pid: 3398  /var/motr/root/sandbox.st-42motr-utils/ios3/m0trace.3398
/root/cortx-motr/motr/.libs/libmotr.so.2(m0_arch_backtrace+0x20)[0x7fd6b42eb2e0]
/root/cortx-motr/motr/.libs/libmotr.so.2(m0_arch_panic+0xe6)[0x7fd6b42eb496]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x3a4f44)[0x7fd6b42d8f44]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x2daa14)[0x7fd6b420ea14]
/root/cortx-motr/motr/.libs/libmotr.so.2(m0_be_btree_update+0x97)[0x7fd6b420fa87]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x2e27f4)[0x7fd6b42167f4]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x2e2a4f)[0x7fd6b4216a4f]
/root/cortx-motr/motr/.libs/libmotr.so.2(m0_be_emap_paste+0x3ae)[0x7fd6b4217e7e]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x477c5f)[0x7fd6b43abc5f]
/root/cortx-motr/motr/.libs/libmotr.so.2(m0_stob_io_prepare+0x1b6)[0x7fd6b43ae136]
/root/cortx-motr/motr/.libs/libmotr.so.2(m0_stob_io_prepare_and_launch+0x52)[0x7fd6b43ae582]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x38f67e)[0x7fd6b42c367e]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x38d3bc)[0x7fd6b42c13bc]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x376cfb)[0x7fd6b42aacfb]
/root/cortx-motr/motr/.libs/libmotr.so.2(m0_thread_trampoline+0x63)[0x7fd6b42e0753]
/root/cortx-motr/motr/.libs/libmotr.so.2(+0x3b815d)[0x7fd6b42ec15d]
/lib64/libpthread.so.0(+0x7ea5)[0x7fd6b3a1dea5]
/lib64/libc.so.6(clone+0x6d)[0x7fd6b23a396d]```

@andriytk
Copy link
Contributor Author

Thanks, @madhavemuri! Will add it.

@huanghua78
Copy link

Asking @gshipra and @yatin-mahajan to review this patch.

@andriytk
Copy link
Contributor Author

Thanks @huanghua78, good idea!

@andriytk andriytk force-pushed the fix-all-ios-panic branch from d3ff93e to c2f8e32 Compare April 27, 2022 15:23
@andriytk andriytk changed the title Fix crash of all ios processes in some write i/o scenarios Fix crash of all ios processes in some write scenarios Apr 27, 2022
Panic: ((cr->tc_balance[cu]) != 0) at btree_save() (be/btree.c:1393)

Stack:

    Seagate#3  m0_panic() at lib/assert.c:52
    Seagate#4  btree_save() at be/btree.c:339
    Seagate#5  m0_be_btree_update() at be/btree.c:1952
    Seagate#6  btree_update_sync() at balloc/balloc.c:95
    Seagate#7  balloc_gi_sync() at balloc/balloc.c:928
    Seagate#8  balloc_free_db_update() at balloc/balloc.c:1934
    Seagate#9  balloc_free_internal() at balloc/balloc.c:2716
    Seagate#10 balloc_free() at balloc/balloc.c:2929
    Seagate#11 stob_ad_bfree() at stob/ad.c:1098
    Seagate#12 stob_ad_seg_free (val=1594) at stob/ad.c:1647
    Seagate#13 __lambda() at stob/ad.c:1719
    Seagate#14 m0_be_emap_paste(val=1794) at be/extmap.c:628
    Seagate#15 stob_ad_write_map_ext(off=464) at stob/ad.c:1731
    Seagate#16 stob_ad_write_map(frags=18) at stob/ad.c:1858
    Seagate#17 stob_ad_write_prepare() at stob/ad.c:2006
    Seagate#18 stob_ad_io_launch_prepare() at stob/ad.c:2052
    Seagate#19 m0_stob_io_prepare() at stob/io.c:178
    Seagate#20 m0_stob_io_prepare_and_launch() at stob/io.c:226
    Seagate#21 io_launch() at ioservice/io_foms.c:1837
    Seagate#22 m0_io_fom_cob_rw_tick() at ioservice/io_foms.c:2333
    Seagate#23 fom_exec() at fop/fom.c:791
    Seagate#24 loc_handler_thread() at fop/fom.c:931

Setup: 1 node, 4+2+0 EC data pool with 10 disks.
Scenario: write the same object twice like this:

    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207
    $ m0cp <motr-conn-params> -s 1m -c 40 -L 4 /dev/zero -o 0x12345678:0x678900207 -u

RCA: regression of BE credit calculation in stob_ad_write_credit()
code was introduced at commit ab22d23.

Solution: rollback the regression change.

Co-authored-by: Madhavrao Vemuri <madhav.vemuri@seagate.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
@andriytk andriytk force-pushed the fix-all-ios-panic branch from c2f8e32 to 710f283 Compare April 27, 2022 15:25
@andriytk
Copy link
Contributor Author

@cortx-admin
Copy link

Jenkins CI Result : Motr#1153

Motr Test Summary

Test ResultCountInfo
❌Failed2
📁

01motr-single-node/52motr-singlenode-sanity
01motr-single-node/00userspace-tests

🏁Skipped32
📁

01motr-single-node/28sys-kvs
01motr-single-node/35m0singlenode
01motr-single-node/04initscripts
01motr-single-node/37protocol
02motr-single-node/51kem
02motr-single-node/20rpc-session-cancel
02motr-single-node/10pver-assign
02motr-single-node/21fsync-single-node
02motr-single-node/13dgmode-io
02motr-single-node/14poolmach
02motr-single-node/11m0t1fs
02motr-single-node/26motr-user-kernel-tests
02motr-single-node/08spiel
03motr-single-node/06conf
03motr-single-node/36spare-reservation
04motr-single-node/34sns-repair-1n-1f
04motr-single-node/08spiel-sns-repair-quiesce
04motr-single-node/28sys-kvs-kernel
04motr-single-node/11m0t1fs-rconfc-fail
04motr-single-node/08spiel-sns-repair
04motr-single-node/19sns-repair-abort
04motr-single-node/22sns-repair-ios-fail
05motr-single-node/18sns-repair-quiesce
05motr-single-node/12fwait
05motr-single-node/16sns-repair-multi
05motr-single-node/07mount-fail
05motr-single-node/15sns-repair-single
05motr-single-node/23sns-abort-quiesce
05motr-single-node/17sns-repair-concurrent-io
05motr-single-node/07mount
05motr-single-node/07mount-multiple
05motr-single-node/12fsync

✔️Passed41
📁

01motr-single-node/43m0crate
01motr-single-node/05confgen
01motr-single-node/06hagen
01motr-single-node/01net
01motr-single-node/01kernel-tests
01motr-single-node/03console
01motr-single-node/02rpcping
02motr-single-node/07m0d-fatal
02motr-single-node/67fdmi-plugin-multi-filters
02motr-single-node/53clusterusage-alert
02motr-single-node/41motr-conf-update
03motr-single-node/61sns-repair-motr-1n-1f
03motr-single-node/72spiel-sns-motr-repair-quiesce
03motr-single-node/08spiel-multi-confd
03motr-single-node/69sns-repair-motr-quiesce
03motr-single-node/62sns-repair-motr-mf
03motr-single-node/70sns-failure-after-repair-quiesce
03motr-single-node/63sns-repair-motr-1k-1f
03motr-single-node/60sns-repair-motr-1f
03motr-single-node/66sns-repair-motr-abort-quiesce
03motr-single-node/24motr-dix-repair-lookup-insert-spiel
03motr-single-node/68sns-repair-motr-shutdown
03motr-single-node/64sns-repair-motr-ios-fail
03motr-single-node/71spiel-sns-motr-repair
03motr-single-node/24motr-dix-repair-lookup-insert-m0repair
03motr-single-node/04sss
03motr-single-node/65sns-repair-motr-abort
04motr-single-node/48motr-raid0-io
04motr-single-node/49motr-rpc-cancel
04motr-single-node/25m0kv
04motr-single-node/44motr-rm-lock-cc-io
04motr-single-node/45motr-rmw
05motr-single-node/23dix-repair-m0repair
05motr-single-node/43motr-sync-replication
05motr-single-node/42motr-utils
05motr-single-node/45motr-sns-repair-N-1
05motr-single-node/40motr-dgmode
05motr-single-node/23dix-repair-quiesce-m0repair
05motr-single-node/23spiel-dix-repair-quiesce
05motr-single-node/44motr-sns-repair
05motr-single-node/23spiel-dix-repair

Total75🔗

CppCheck Summary

   Cppcheck: No new warnings found 👍

@andriytk
Copy link
Contributor Author

UTs failed:

----- run_ut -----
motr[103510]:  2b20  FATAL  [lib/assert.c:50:m0_panic]  panic: (ld_start_tlist_is_empty(&ld->lds_start_q)) at be_log_discard_flush_finished() (be/log_discard.c:182)  [git: sage-base-1.0-804-g710f283-dirty] /var/motr/m0ut/m0trace.103510
Motr panic: (ld_start_tlist_is_empty(&ld->lds_start_q)) at be_log_discard_flush_finished() be/log_discard.c:182 (errno: 4) (last failed: none) [git: sage-base-1.0-804-g710f283-dirty] pid: 103510  /var/motr/m0ut/m0trace.103510
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(m0_arch_backtrace+0x20)[0x7f39ad1e6470]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(m0_arch_panic+0xe6)[0x7f39ad1e6626]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(+0x399614)[0x7f39ad1d4614]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(+0x2e206c)[0x7f39ad11d06c]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(+0x2e2b9c)[0x7f39ad11db9c]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(m0_sm_asts_run+0x131)[0x7f39ad27a711]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(+0x39e600)[0x7f39ad1d9600]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(m0_thread_trampoline+0x5e)[0x7f39ad1dba4e]
/root/motr/motr_test_github_workdir/workdir/src/motr/.libs/libmotr.so.2(+0x3ac2dd)[0x7f39ad1e72dd]
/lib64/libpthread.so.0(+0x7e65)[0x7f39ac924e65]
/lib64/libc.so.6(clone+0x6d)[0x7f39ab2aa88d]
/root/motr/motr_test_github_workdir/workdir/src/utils/m0run: line 425: 103510 Aborted                 (core dumped) $(srcdir_path_of $binary) "$@"

But this looks like an old known issue - #1660.

@truptiatseagate
Copy link

Shipra is no more working with Seagate.
Have requested @yatin-mahajan to review this patch by Monday.

@chandradharraval
Copy link

Hi @andriytk,
I think UT failure is known issue which is not quick fix. There is one draft PR for the same which needs to be taken up forward.
Meanwhile I believe we can merged your PR #1686 with exception(Having known UT failure)
cc : @madhavemuri , @mehjoshi

Copy link
Contributor

@yatin-mahajan yatin-mahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@madhavemuri madhavemuri merged commit 317b5ca into Seagate:main May 4, 2022
@andriytk andriytk deleted the fix-all-ios-panic branch May 4, 2022 08:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All m0d-ios processes crash if one of them crashes during write i/o
7 participants