Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

panic at m0_balloc_load_extents() (balloc/balloc.c:1347) #1845

Closed
andriytk opened this issue May 31, 2022 — with Board Genius Sync · 39 comments
Closed

panic at m0_balloc_load_extents() (balloc/balloc.c:1347) #1845

andriytk opened this issue May 31, 2022 — with Board Genius Sync · 39 comments

Comments

Copy link
Contributor

andriytk commented May 31, 2022

[root@ssc-vm-g4-rhev4-1490 ~]# kubectl logs cortx-data-ssc-vm-g4-rhev4-1491-66bbb5b6c-wjk8v -c cortx-motr-io-002 -f &
[1] 49242
...
[root@ssc-vm-g4-rhev4-1490 ~]# aws s3api put-object --bucket test-bucket --key 256m --body /tmp/256m --endpoint-url http://192.168.60.187:30080
2022-05-31 16:20:28,370 - motr[00035]:  a070  ERROR  [balloc/balloc.c:1346:m0_balloc_load_extents]  Invalid extent
2022-05-31 16:20:28,371 - motr[00035]:  9f40  FATAL  [lib/assert.c:50:m0_panic]  panic: (0) at m0_balloc_load_extents() (balloc/balloc.c:1347)  [git: 2.0.0-790-9-g662e7a18] /etc/cortx/log/motr/def3fa765c954db3812b45500924c243/trace/m0d-0x7200000000000001:0x69/m0trace.35.2022-05-31-16:09:28
2022-05-31 16:20:28,371 - Motr panic: (0) at m0_balloc_load_extents() balloc/balloc.c:1347 (errno: 0) (last failed: none) [git: 2.0.0-790-9-g662e7a18] pid: 35  /etc/cortx/log/motr/def3fa765c954db3812b45500924c243/trace/m0d-0x7200000000000001:0x69/m0trace.35.2022-05-31-16:09:28
2022-05-31 16:20:28,371 - /lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7ff1e241d1a3]
2022-05-31 16:20:28,371 - /lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7ff1e241d379]
2022-05-31 16:20:28,371 - /lib64/libmotr.so.2(m0_panic+0x13d)[0x7ff1e240be2d]
2022-05-31 16:20:28,372 - /lib64/libmotr.so.2(m0_balloc_load_extents+0x50a)[0x7ff1e23344ba]
2022-05-31 16:20:28,372 - /lib64/libmotr.so.2(+0x2b027f)[0x7ff1e233527f]
2022-05-31 16:20:28,372 - /lib64/libmotr.so.2(+0x45b78e)[0x7ff1e24e078e]
2022-05-31 16:20:28,372 - /lib64/libmotr.so.2(m0_stob_io_prepare+0x1bf)[0x7ff1e24e3a7f]
2022-05-31 16:20:28,374 - /lib64/libmotr.so.2(m0_stob_io_prepare_and_launch+0x65)[0x7ff1e24e3f25]
2022-05-31 16:20:28,374 - /lib64/libmotr.so.2(+0x371209)[0x7ff1e23f6209]
2022-05-31 16:20:28,374 - /lib64/libmotr.so.2(+0x36e8f3)[0x7ff1e23f38f3]
2022-05-31 16:20:28,374 - /lib64/libmotr.so.2(+0x357814)[0x7ff1e23dc814]
2022-05-31 16:20:28,374 - /lib64/libmotr.so.2(m0_thread_trampoline+0x5e)[0x7ff1e24129be]
2022-05-31 16:20:28,375 - /lib64/libmotr.so.2(+0x399061)[0x7ff1e241e061]
2022-05-31 16:20:28,375 - /lib64/libpthread.so.0(+0x815a)[0x7ff1e1aeb15a]
2022-05-31 16:20:28,375 - /lib64/libc.so.6(clone+0x43)[0x7ff1e03f0dd3]
...

pods:

[root@ssc-vm-g4-rhev4-1490 ~]# kubectl get pods
NAME                                                 READY   STATUS             RESTARTS        AGE
cortx-client-ssc-vm-g4-rhev4-1490-b58999bcb-x97bq    2/2     Running            0               29h
cortx-client-ssc-vm-g4-rhev4-1491-9fcb4879-wdh94     2/2     Running            0               29h
cortx-client-ssc-vm-g4-rhev4-1492-8699679c96-gbgdj   2/2     Running            0               29h
cortx-consul-client-4nr7v                            1/1     Running            0               29h
cortx-consul-client-c4dtr                            1/1     Running            0               29h
cortx-consul-client-mrtjx                            1/1     Running            0               29h
cortx-consul-server-0                                1/1     Running            0               29h
cortx-consul-server-1                                1/1     Running            0               29h
cortx-consul-server-2                                1/1     Running            0               29h
cortx-control-5c46bf94ff-nv4tb                       1/1     Running            0               29h
cortx-data-ssc-vm-g4-rhev4-1490-687d6d8944-ssgng     4/4     Running            32 (49m ago)    29h
cortx-data-ssc-vm-g4-rhev4-1491-66bbb5b6c-wjk8v      3/4     CrashLoopBackOff   46 (101s ago)   29h
cortx-data-ssc-vm-g4-rhev4-1492-6449bbcc56-x2n4n     4/4     Running            43 (23h ago)    29h
cortx-ha-d95fc745c-p95c4                             3/3     Running            0               29h
cortx-server-ssc-vm-g4-rhev4-1490-59469c57cd-9xdz6   2/2     Running            3 (23h ago)     29h
cortx-server-ssc-vm-g4-rhev4-1491-7b4cbf54db-2z74j   2/2     Running            5 (23h ago)     29h
cortx-server-ssc-vm-g4-rhev4-1492-58dcf46874-cqc82   2/2     Running            5 (23h ago)     29h
kafka-0                                              1/1     Running            0               29h
kafka-1                                              1/1     Running            0               29h
kafka-2                                              1/1     Running            0               29h
zookeeper-0                                          1/1     Running            0               29h
zookeeper-1                                          1/1     Running            0               29h
zookeeper-2                                          1/1     Running            0               29h

Such situation and state was reproduced while trying to reproduce #1838 issue.

Copy link

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-31906. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

@andriytk andriytk transferred this issue from Seagate/cortx-hare May 31, 2022
@andriytk andriytk changed the title Undetected m0d process connectivity issue panic: (0) at m0_balloc_load_extents() (balloc/balloc.c:1347) May 31, 2022
@andriytk andriytk changed the title panic: (0) at m0_balloc_load_extents() (balloc/balloc.c:1347) panic at m0_balloc_load_extents() (balloc/balloc.c:1347) May 31, 2022
Copy link

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-31907. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

@andriytk
Copy link
Contributor Author

Last m0trace records:

2132746  2865676.886.587786  7ff1d67fb8d0  CALL    fop/fom_generic.c:858:m0_fom_tick_generic   > fom=0x7ff18c079920 phase=tx_open
2132747  2865676.886.589734  7ff1d67fb840  CALL    be/tx.c:273:m0_be_tx_payload_prep   > tx=0x7ff18c0799f0 size=262144
2132748  2865676.886.592750  7ff1d67fb7f0  CALL    lib/memory.c:130:m0_alloc   > size=48
2132749  2865676.886.595670  7ff1d67fb7f0  CALL    lib/memory.c:141:m0_alloc   < ptr=0x559735a67fd0 size=48
2132750  2865676.886.595816  7ff1d67fb7f0  CALL    lib/memory.c:130:m0_alloc   > size=24
2132751  2865676.886.597830  7ff1d67fb7f0  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d81f30a0 size=24
2132752  2865676.886.602684  7ff1d67fb820  CALL    be/tx.c:263:m0_be_tx_prep   > tx=0x7ff18c0799f0 credit=(31022,8301076)
2132753  2865676.886.604154  7ff1d67fb810  CALL    be/tx.c:284:m0_be_tx_open   > tx=0x7ff18c0799f0 t_prepared=(31022,8301076) t_payload_prepared=262144
2132754  2865676.886.606134  7ff1d67fb760  CALL    be/tx.c:489:be_tx_state_move   > tx=0x7ff18c0799f0 rc=0 prepare -> opening
2132755  2865676.886.612314  7ff1d67fb700  DEBUG   be/engine.c:646:m0_be_engine__tx_state_set   tx 0x7ff18c0799f0: => opening
2132756  2865676.886.641030  7ff1d67fb600  CALL    be/log.c:854:m0_be_log_reserve   > log=(lg_current=156909568 lg_discarded=156827648 lg_free=134119424) size=9314304
2132757  2865676.886.642834  7ff1d67fb600  CALL    be/log.c:865:m0_be_log_reserve   < rc=0 log=(lg_current=156909568 lg_discarded=156827648 lg_free=124805120)
2132758  2865676.886.646894  7ff1d67fb630  DEBUG   be/tx.c:550:m0_be_tx__state_post   tx=0x7ff18c0799f0 state=grouping
2132759  2865676.886.648864  7ff1d67fb760  CALL    be/tx.c:521:be_tx_state_move   < 
2132760  2865676.886.649400  7ff1d67fb810  CALL    be/tx.c:296:m0_be_tx_open   < 
2132761  2865676.886.650772  7ff1d67fb880  DEBUG   fop/fom.c:1694:m0_fom_phase_set   fom=0x7ff18c079920, item 0x7ff18c077bb0[42] phase set: tx_open -> tx_wait
2132762  2865676.886.654332  7ff1d67fb8d0  CALL    fop/fom_generic.c:874:m0_fom_tick_generic   < fom=0x7ff18c079920 phase=tx_wait rc=2
2132763  2865676.886.654646  7ff1d67fb8d0  CALL    fop/fom_generic.c:875:m0_fom_tick_generic   < rc=2
2132764  2865676.886.654922  7ff1d67fb940  DEBUG   ioservice/io_foms.c:2324:m0_io_fom_cob_rw_tick   fom = 0x7ff18c079920, stob = 0x7ff1d80877c0, rc = 0
2132765  2865676.886.655016  7ff1d67fb940  CALL    ioservice/io_foms.c:2326:m0_io_fom_cob_rw_tick   < rc=2
2132766  2865676.886.656194  7ff1d67fb940  CALL    ioservice/io_foms.c:2258:m0_io_fom_cob_rw_tick   > fom 0x7ff18c079920, fop 0x7ff18c077b90, item 0x7ff18c077bb0[42] tx_wait<4300000a04b51096:f02400000900000a>
2132767  2865676.886.656580  7ff1d67fb8d0  CALL    fop/fom_generic.c:858:m0_fom_tick_generic   > fom=0x7ff18c079920 phase=tx_wait
2132768  2865676.886.657166  7ff1d67fb870  CALL    fop/fom_generic.c:294:fom_tx_wait   > fom=0x7ff18c079920, tx_state 3
2132769  2865676.886.660114  7ff1d67fb870  CALL    fop/fom_generic.c:305:fom_tx_wait   < 
2132770  2865676.886.660388  7ff1d67fb8d0  CALL    fop/fom_generic.c:874:m0_fom_tick_generic   < fom=0x7ff18c079920 phase=tx_wait rc=1
2132771  2865676.886.660530  7ff1d67fb8d0  CALL    fop/fom_generic.c:875:m0_fom_tick_generic   < rc=1
2132772  2865676.886.660876  7ff1d67fb940  CALL    ioservice/io_foms.c:2301:m0_io_fom_cob_rw_tick   < rc=1
2132773  2865676.886.670314  7ff1d67fba10  CALL    be/tx.c:427:be_tx_state_move_ast   > tx=0x7ff18c0799f0 opening -> grouping
2132774  2865676.886.671314  7ff1d67fb960  CALL    be/tx.c:489:be_tx_state_move   > tx=0x7ff18c0799f0 rc=0 opening -> grouping
2132775  2865676.886.680802  7ff1d67fb900  DEBUG   be/engine.c:646:m0_be_engine__tx_state_set   tx 0x7ff18c0799f0: => grouping
2132776  2865676.886.687224  7ff1d67fb7a0  CALL    be/tx_group.c:297:m0_be_tx_group_tx_add   > gr=0x559735aff570 tx=0x7ff18c0799f0 t_prepared=(31022,8301076) t_payload_prepared=262144 tg_used=(0,0) tg_payload_prepared=0 group_tx_nr=0
2132777  2865676.886.688698  7ff1d67fb760  DEBUG   be/tx_group.c:279:be_tx_group_tx_add   tx=0x7ff18c0799f0 group=0x559735aff570
2132778  2865676.886.692538  7ff1d67fb7a0  CALL    be/tx_group.c:322:m0_be_tx_group_tx_add   < rc=0
2132779  2865676.886.694420  7ff1d67fb860  CALL    be/engine.c:430:be_engine_group_timeout_arm   > en=0x7ffe12ca1f98 gr=0x559735aff570 sm_grp=0x5597354d8ce0
2132780  2865676.886.719360  7ff1d67fb860  CALL    be/engine.c:443:be_engine_group_timeout_arm   < grouping_q_length=1 delay=1382812
2132781  2865676.886.721078  7ff1d67fb820  DEBUG   be/tx.c:550:m0_be_tx__state_post   tx=0x7ff18c0799f0 state=active
2132782  2865676.886.721534  7ff1d67fb860  CALL    be/engine.c:496:be_engine_tx_trygroup   < rc=0
2132783  2865676.886.722276  7ff1d67fb960  CALL    be/tx.c:521:be_tx_state_move   < 
2132784  2865676.886.725698  7ff1d67fba10  CALL    be/tx.c:427:be_tx_state_move_ast   > tx=0x7ff18c0799f0 grouping -> active
2132785  2865676.886.726212  7ff1d67fb960  CALL    be/tx.c:489:be_tx_state_move   > tx=0x7ff18c0799f0 rc=0 grouping -> active
2132786  2865676.886.727398  7ff1d67fb910  CALL    lib/memory.c:119:m0_alloc_nz   > size=262144
2132787  2865676.886.793174  7ff1d67fb910  CALL    lib/memory.c:122:m0_alloc_nz   < ptr=0x7ff1d8268410 size=262144
2132788  2865676.886.794394  7ff1d67fb840  CALL    lib/memory.c:119:m0_alloc_nz   > size=2481720
2132789  2865676.886.841146  7ff1d67fb840  CALL    lib/memory.c:122:m0_alloc_nz   < ptr=0x7ff1d82a8420 size=2481720
2132790  2865676.886.842684  7ff1d67fb8c0  CALL    be/tx_regmap.c:392:m0_be_regmap_init   < rc=0
2132791  2865676.886.842950  7ff1d67fb8c0  CALL    lib/memory.c:119:m0_alloc_nz   > size=8301076
2132792  2865676.886.874600  7ff1d67fb8c0  CALL    lib/memory.c:122:m0_alloc_nz   < ptr=0x7ff1d8506260 size=8301076
2132793  2865676.886.875332  7ff1d67fb910  CALL    be/tx_regmap.c:597:m0_be_reg_area_init   < rc=0
2132794  2865676.886.875746  7ff1d67fb960  CALL    be/tx.c:462:be_tx_memory_allocate   < rc=0
2132795  2865676.886.882872  7ff1d67fb900  DEBUG   be/engine.c:646:m0_be_engine__tx_state_set   tx 0x7ff18c0799f0: => active
2132796  2865676.886.883876  7ff1d67fb960  CALL    be/tx.c:521:be_tx_state_move   < 
2132797  2865676.886.886660  7ff1d67fba10  CALL    be/engine.c:360:be_engine_group_timer_arm   > en=0x7ffe12ca1f98 gr=0x559735aff570 sm_grp=0x5597354d8ce0
2132798  2865676.886.888262  7ff1d67fb9d0  CALL    lib/timer.c:67:m0_timer_fini   > 0x559735b02360
2132799  2865676.886.902822  7ff1d67fb9d0  CALL    lib/timer.c:72:m0_timer_fini   < 0x559735b02360
2132800  2865676.886.917540  7ff1d67fb8f0  CALL    lib/user_space/timer.c:198:timer_posix_init   < rc=0
2132801  2865676.886.918064  7ff1d67fb8f0  CALL    lib/user_space/timer.c:304:timer_hard_init   < rc=0
2132802  2865676.886.918432  7ff1d67fb980  CALL    lib/timer.c:61:m0_timer_init   < 0x559735b02360, rc=0
2132803  2865676.886.924754  7ff1d67fba10  CALL    be/engine.c:366:be_engine_group_timer_arm   < 
2132804  2865676.886.944460  7ff1d67fb940  CALL    ioservice/io_foms.c:2258:m0_io_fom_cob_rw_tick   > fom 0x7ff18c079920, fop 0x7ff18c077b90, item 0x7ff18c077bb0[42] tx_wait<4300000a04b51096:f02400000900000a>
2132805  2865676.886.945308  7ff1d67fb8d0  CALL    fop/fom_generic.c:858:m0_fom_tick_generic   > fom=0x7ff18c079920 phase=tx_wait
2132806  2865676.886.946038  7ff1d67fb870  CALL    fop/fom_generic.c:294:fom_tx_wait   > fom=0x7ff18c079920, tx_state 5
2132807  2865676.886.947242  7ff1d67fb870  CALL    fop/fom_generic.c:310:fom_tx_wait   < 
2132808  2865676.886.948028  7ff1d67fb880  DEBUG   fop/fom.c:1694:m0_fom_phase_set   fom=0x7ff18c079920, item 0x7ff18c077bb0[42] phase set: tx_wait -> io-prepare
2132809  2865676.886.952738  7ff1d67fb8d0  CALL    fop/fom_generic.c:874:m0_fom_tick_generic   < fom=0x7ff18c079920 phase=io-prepare rc=2
2132810  2865676.886.953094  7ff1d67fb8d0  CALL    fop/fom_generic.c:875:m0_fom_tick_generic   < rc=2
2132811  2865676.886.953638  7ff1d67fb8f0  DEBUG   fop/fom.c:1694:m0_fom_phase_set   fom=0x7ff18c079920, item 0x7ff18c077bb0[42] phase set: io-prepare -> stobio-launch
2132812  2865676.886.956906  7ff1d67fb940  CALL    ioservice/io_foms.c:2301:m0_io_fom_cob_rw_tick   < rc=2
2132813  2865676.886.957934  7ff1d67fb940  CALL    ioservice/io_foms.c:2258:m0_io_fom_cob_rw_tick   > fom 0x7ff18c079920, fop 0x7ff18c077b90, item 0x7ff18c077bb0[42] stobio-launch<4300000a04b51096:f02400000900000a>
2132814  2865676.886.960362  7ff1d67fb7b0  CALL    ioservice/io_foms.c:1710:io_launch   > fom=0x7ff18c079920
2132815  2865676.886.964728  7ff1d67fb6a0  CALL    ioservice/io_foms.c:965:m0_io_cob_stob_create   > COB:<4300000a04b51096:f02400000900000a>pver:<7600000000000001:b8>
2132816  2865676.886.966750  7ff1d67fb610  CALL    cob/cob.c:1417:m0_cob_locate   > dom=0x400000131490 oikey=(<4300000a04b51096:f02400000900000a>, 0)
2132817  2865676.886.967378  7ff1d67fb580  CALL    lib/memory.c:130:m0_alloc   > size=416
2132818  2865676.886.968240  7ff1d67fb580  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d815b0c0 size=416
2132819  2865676.886.969462  7ff1d67fb560  DEBUG   file/file.c:485:m0_file_init   <0:0>
2132820  2865676.887.005808  7ff1d67fb250  CALL    be/btree.c:2338:m0_be_btree_cursor_get_sync   < rc=0
2132821  2865676.887.008052  7ff1d67fb290  DEBUG   cob/cob.c:1264:cob_oi_lookup   found: fid=<4300000a06ce1096:7b2e000008000001> lno=0 pfid=<4700000006ce1096:7b2e000008000001> name='10'
2132822  2865676.887.008586  7ff1d67fb290  DEBUG   cob/cob.c:1272:cob_oi_lookup   old fid=<4300000a04b51096:f02400000900000a> fid=<4300000a06ce1096:7b2e000008000001>
2132823  2865676.887.009026  7ff1d67fb290  CALL    cob/cob.c:1283:cob_oi_lookup   < rc=-2
2132824  2865676.887.009594  7ff1d67fb610  DEBUG   cob/cob.c:1434:m0_cob_locate   cob_oi_lookup() failed with -2
2132825  2865676.887.011542  7ff1d67fb5e0  DEBUG   lib/memory.c:151:m0_free   0x7ff1d815b0c0
2132826  2865676.887.012316  7ff1d67fb610  CALL    cob/cob.c:1436:m0_cob_locate   < rc=-2
2132827  2865676.887.018554  7ff1d67fb650  CALL    ioservice/cob_foms.c:828:m0_cc_stob_create   > stob create fid=<200000a04b51096:f02400000900000a>
2132828  2865676.887.021230  7ff1d67fb5e0  CALL    ioservice/storage_dev.c:817:m0_storage_dev_stob_create   > stob_id={<200000000000000:a>,<200000a04b51096:f02400000900000a>}
2132829  2865676.887.035558  7ff1d67fb560  CALL    stob/stob.c:89:m0_stob_find_by_key   < rc=0
2132830  2865676.887.036726  7ff1d67fb540  CALL    stob/stob.c:164:m0_stob_create   > stob=0x7ff1d80877c0 so_id={<200000000000000:a>,<200000a04b51096:f02400000900000a>} so_ref=5
2132831  2865676.887.038322  7ff1d67fb2d0  DEBUG   stob/ad.c:732:stob_ad_create   200000a04b51096:f02400000900000a
2132832  2865676.887.046062  7ff1d67fb050  DEBUG   be/extmap.c:763:m0_be_emap_obj_insert   Nob: key = 56 val = 64 
2132833  2865676.887.049924  7ff1d67fafc0  CALL    be/btree.c:1900:be_btree_insert   > tree=0x400000108468
2132834  2865676.887.051234  7ff1d67fabb0  CALL    be/btree.c:1380:btree_save   > tree=0x400000108468
2132835  2865676.887.051788  7ff1d67fabb0  DEBUG   be/btree.c:1390:btree_save   DEC cr=0x7ff18c079a98 user=0 balance=24
2132836  2865676.887.079056  7ff1d67faa20  DEBUG   be/fl.c:213:m0_be_fl_pick   chunk=0x4000001a5310 size=120 bac_size=26842868896 index=128 iterations=1
2132837  2865676.887.133034  7ff1d67faa90  DEBUG   be/alloc.c:1068:m0_be_alloc_aligned   allocator=0x55973595e708 size=120 shift=3 c=0x4000001a5310 c->bac_size=120 ptr=0x4000001a5360
2132838  2865676.887.138256  7ff1d67fab20  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132839  2865676.887.164508  7ff1d67faad0  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132840  2865676.889.210860  7ff1d67fab20  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132841  2865676.889.240734  7ff1d67faad0  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132842  2865676.889.252824  7ff1d67fabb0  CALL    be/btree.c:1479:btree_save   < tree=0x400000108468
2132843  2865676.889.254682  7ff1d67fafc0  CALL    be/btree.c:1909:be_btree_insert   < tree=0x400000108468
2132844  2865676.889.271688  7ff1d67fb540  CALL    stob/stob.c:184:m0_stob_create   < rc=0
2132845  2865676.889.274590  7ff1d67fb550  CALL    stob/stob.c:299:m0_stob_put   > stob=0x7ff1d80877c0 so_id={<200000000000000:a>,<200000a04b51096:f02400000900000a>} so_ref=5
2132846  2865676.889.275256  7ff1d67fb550  DEBUG   stob/stob.c:307:m0_stob_put   stob 0x7ff1d80877c0, fid=<200000a04b51096:f02400000900000a> so_ref 4, released ref, chan_waiters 0
2132847  2865676.889.276750  7ff1d67fb5e0  CALL    ioservice/storage_dev.c:833:m0_storage_dev_stob_create   < rc=0
2132848  2865676.889.278148  7ff1d67fb650  CALL    ioservice/cob_foms.c:831:m0_cc_stob_create   < rc=0
2132849  2865676.889.279684  7ff1d67fb6a0  INFO    ioservice/io_foms.c:979:m0_io_cob_stob_create   Create on write if cob doesn't exists
2132850  2865676.889.281034  7ff1d67fb560  CALL    ioservice/io_foms.c:915:m0_io_cob_create   > COB fid:<4300000a04b51096:f02400000900000a>pver:<7600000000000001:b8>
2132851  2865676.889.282426  7ff1d67fb4d0  CALL    lib/memory.c:130:m0_alloc   > size=416
2132852  2865676.889.287726  7ff1d67fb4d0  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d815b0c0 size=416
2132853  2865676.889.289840  7ff1d67fb4b0  DEBUG   file/file.c:485:m0_file_init   <0:0>
2132854  2865676.889.308944  7ff1d67fb440  CALL    lib/memory.c:130:m0_alloc   > size=26
2132855  2865676.889.311152  7ff1d67fb440  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d8266c90 size=26
2132856  2865676.889.316514  7ff1d67fb480  CALL    cob/cob.c:1702:m0_cob_create   > nskey=(<4700000004b51096:f02400000900000a>, '10') nsrec=(<4300000a04b51096:f02400000900000a>, 0)
2132857  2865676.889.326090  7ff1d67fb0a0  CALL    be/btree.c:1900:be_btree_insert   > tree=0x4000001314b8
2132858  2865676.889.328726  7ff1d67fac90  CALL    be/btree.c:1380:btree_save   > tree=0x4000001314b8
2132859  2865676.889.329092  7ff1d67fac90  DEBUG   be/btree.c:1390:btree_save   DEC cr=0x7ff18c079a98 user=0 balance=23
2132860  2865676.889.346514  7ff1d67fab00  DEBUG   be/fl.c:213:m0_be_fl_pick   chunk=0x4000001a53d8 size=56 bac_size=26842868696 index=128 iterations=1
2132861  2865676.889.634178  7ff1d67fab70  DEBUG   be/alloc.c:1068:m0_be_alloc_aligned   allocator=0x55973595e708 size=50 shift=3 c=0x4000001a53d8 c->bac_size=56 ptr=0x4000001a5428
2132862  2865676.889.678732  7ff1d67fac00  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132863  2865676.889.896452  7ff1d67fabb0  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132864  2865676.890.026326  7ff1d67fac00  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132865  2865676.890.097606  7ff1d67fabb0  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132866  2865676.890.110522  7ff1d67fac90  CALL    be/btree.c:1479:btree_save   < tree=0x4000001314b8
2132867  2865676.890.111416  7ff1d67fb0a0  CALL    be/btree.c:1909:be_btree_insert   < tree=0x4000001314b8
2132868  2865676.890.163468  7ff1d67fb0a0  CALL    be/btree.c:1900:be_btree_insert   > tree=0x4000001315b8
2132869  2865676.890.164676  7ff1d67fac90  CALL    be/btree.c:1380:btree_save   > tree=0x4000001315b8
2132870  2865676.890.165082  7ff1d67fac90  DEBUG   be/btree.c:1390:btree_save   DEC cr=0x7ff18c079a98 user=0 balance=22
2132871  2865676.890.255014  7ff1d67fab00  DEBUG   be/fl.c:213:m0_be_fl_pick   chunk=0x4000001a5460 size=184 bac_size=26842868560 index=128 iterations=1
2132872  2865676.890.356460  7ff1d67fab70  DEBUG   be/alloc.c:1068:m0_be_alloc_aligned   allocator=0x55973595e708 size=184 shift=3 c=0x4000001a5460 c->bac_size=184 ptr=0x4000001a54b0
2132873  2865676.890.364958  7ff1d67fac00  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132874  2865676.890.391780  7ff1d67fabb0  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132875  2865676.890.473704  7ff1d67fac00  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132876  2865676.890.496294  7ff1d67fabb0  CALL    format/format.c:146:m0_format_footer_verify_generic   < rc=0
2132877  2865676.890.508352  7ff1d67fac90  CALL    be/btree.c:1479:btree_save   < tree=0x4000001315b8
2132878  2865676.890.508732  7ff1d67fb0a0  CALL    be/btree.c:1909:be_btree_insert   < tree=0x4000001315b8
2132879  2865676.890.651216  7ff1d67fb480  CALL    cob/cob.c:1786:m0_cob_create   < rc=0
2132880  2865676.890.654646  7ff1d67fb510  DEBUG   lib/memory.c:151:m0_free   0x7ff1d8266c90
2132881  2865676.890.656730  7ff1d67fb530  DEBUG   lib/memory.c:151:m0_free   0x7ff1d815b0c0
2132882  2865676.890.658008  7ff1d67fb560  CALL    ioservice/io_foms.c:938:m0_io_cob_create   < rc=0
2132883  2865676.890.660786  7ff1d67fb610  CALL    cob/cob.c:1417:m0_cob_locate   > dom=0x400000131490 oikey=(<4300000a04b51096:f02400000900000a>, 0)
2132884  2865676.890.661334  7ff1d67fb580  CALL    lib/memory.c:130:m0_alloc   > size=416
2132885  2865676.890.662332  7ff1d67fb580  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d815b0c0 size=416
2132886  2865676.890.663352  7ff1d67fb560  DEBUG   file/file.c:485:m0_file_init   <0:0>
2132887  2865676.890.680364  7ff1d67fb250  CALL    be/btree.c:2338:m0_be_btree_cursor_get_sync   < rc=0
2132888  2865676.890.683018  7ff1d67fb290  DEBUG   cob/cob.c:1264:cob_oi_lookup   found: fid=<4300000a04b51096:f02400000900000a> lno=0 pfid=<4700000004b51096:f02400000900000a> name='10'
2132889  2865676.890.683678  7ff1d67fb1f0  CALL    lib/memory.c:130:m0_alloc   > size=26
2132890  2865676.890.683974  7ff1d67fb1f0  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d8266c90 size=26
2132891  2865676.890.684760  7ff1d67fb290  CALL    cob/cob.c:1283:cob_oi_lookup   < rc=0
2132892  2865676.890.688306  7ff1d67fb050  CALL    be/btree.c:1998:be_btree_lookup   > tree=0x4000001315b8 key_in=0x7ff1d67fb620 key_out=(nil) value=0x7ff1d67fb630
2132893  2865676.890.695882  7ff1d67fb050  CALL    be/btree.c:2030:be_btree_lookup   < rc=0
2132894  2865676.890.698640  7ff1d67fb570  CALL    cob/cob.c:1368:cob_get_fabomg   < rc=0
2132895  2865676.890.698992  7ff1d67fb610  CALL    cob/cob.c:1454:m0_cob_locate   < rc=0
2132896  2865676.890.699842  7ff1d67fb6a0  CALL    ioservice/io_foms.c:992:m0_io_cob_stob_create   < rc=0
2132897  2865676.890.701224  7ff1d67fb7b0  CALL    ioservice/io_foms.c:1016:io_fom_cob2file   < rc=0
2132898  2865676.890.764816  7ff1d67fb6f0  CALL    lib/memory.c:130:m0_alloc   > size=2048
2132899  2865676.890.769062  7ff1d67fb6f0  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d81851c0 size=2048
2132900  2865676.890.769424  7ff1d67fb6f0  CALL    lib/memory.c:130:m0_alloc   > size=2048
2132901  2865676.890.770734  7ff1d67fb6f0  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d815a180 size=2048
2132902  2865676.890.782858  7ff1d67fb7b0  DEBUG   ioservice/io_foms.c:1816:io_launch   launch fom: 0x7ff18c079920, start_time 1654014028237796989, req_count: 0, count: 0, submitted: 100, expect: 100
2132903  2865676.890.784764  7ff1d67fb700  CALL    lib/memory.c:130:m0_alloc   > size=424
2132904  2865676.890.787134  7ff1d67fb700  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d82192a0 size=424
2132905  2865676.890.788224  7ff1d67fb750  CALL    stob/ad.c:1039:stob_ad_io_init   < rc=0
2132906  2865676.890.795026  7ff1d67fb6a0  CALL    stob/io.c:164:m0_stob_io_prepare   > stob=0x7ff1d80877c0 so_id={<200000000000000:a>,<200000a04b51096:f02400000900000a>} si_opcode=2 io=0x7ff1d81c80c8 tx=0x7ff18c0799e8
2132907  2865676.890.796486  7ff1d67fa760  CALL    stob/ad.c:2021:stob_ad_io_launch_prepare   > op=2, stob 0x7ff1d80877c0, stob_id={<200000000000000:a>,<200000a04b51096:f02400000900000a>}
2132908  2865676.890.800550  7ff1d67fa690  DEBUG   stob/ad.c:1111:stob_ad_cursor   <200000a04b51096:f02400000900000a>
2132909  2865676.890.869388  7ff1d67fa510  CALL    lib/memory.c:130:m0_alloc   > size=64
2132910  2865676.890.870284  7ff1d67fa510  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff18c01a1b0 size=64
2132911  2865676.890.874160  7ff1d67fa600  CALL    be/extmap.c:1149:be_emap_lookup   < rc=0
2132912  2865676.890.876684  7ff1d67fa690  CALL    stob/ad.c:1117:stob_ad_cursor   < rc=0
2132913  2865676.890.878202  7ff1d67fa700  CALL    stob/ad.c:1229:stob_ad_cursors_init   < rc=0
2132914  2865676.890.881774  7ff1d67fa760  CALL    stob/ad.c:1924:stob_ad_write_prepare   > op=2 sz=256
2132915  2865676.890.884364  7ff1d67fa760  DEBUG   stob/ad.c:1072:stob_ad_balloc   count=256
2132916  2865676.890.887830  7ff1d67fa3d0  CALL    balloc/balloc.c:2877:balloc_alloc   > bal=0x400000103e38 goal=0x0 count=256
2132917  2865676.890.888156  7ff1d67fa3d0  CALL    balloc/balloc.c:2607:balloc_allocate_internal   > 
2132918  2865676.890.890090  7ff1d67fa380  CALL    balloc/balloc.c:1077:balloc_got_freespace   > 
2132919  2865676.890.890384  7ff1d67fa380  DEBUG   balloc/balloc.c:1081:balloc_got_freespace   bsb_freeblocks=6385152 blocks=256
2132920  2865676.890.890700  7ff1d67fa380  CALL    balloc/balloc.c:1090:balloc_got_freespace   < 
2132921  2865676.890.890996  7ff1d67fa380  CALL    balloc/balloc.c:1091:balloc_got_freespace   < rc=1
2132922  2865676.890.892424  7ff1d67fa3d0  CALL    balloc/balloc.c:1012:balloc_init_ac   > 
2132923  2865676.890.893794  7ff1d67fa3d0  CALL    balloc/balloc.c:1040:balloc_init_ac   < rc=0
2132924  2865676.890.894250  7ff1d67fa3d0  CALL    balloc/balloc.c:1110:balloc_normalize_request   > 
2132925  2865676.890.894570  7ff1d67fa3d0  CALL    balloc/balloc.c:1173:balloc_normalize_request   < 
2132926  2865676.890.896316  7ff1d67fa3d0  CALL    balloc/balloc.c:2411:balloc_regular_allocator   > goal=0x2af00 len=256
2132927  2865676.890.897722  7ff1d67fa3d0  DEBUG   balloc/balloc.c:2439:balloc_regular_allocator   cr=0
2132928  2865676.890.967040  7ff1d67fa070  CALL    balloc/balloc.c:1290:m0_balloc_load_extents   > grp=2 non-spare-frags=2 spare-frags=0
2132929  2865676.890.967436  7ff1d67fa020  CALL    lib/memory.c:130:m0_alloc   > size=216
2132930  2865676.890.968106  7ff1d67fa020  CALL    lib/memory.c:141:m0_alloc   < ptr=0x7ff1d8087690 size=216
2132931  2865676.890.986634  7ff1d67f9d60  CALL    be/btree.c:1998:be_btree_lookup   > tree=0x400000103f78 key_in=0x7ff1d67fa150 key_out=0x7ff1d67fa150 value=0x7ff1d67fa160
2132932  2865676.891.066758  7ff1d67f9d60  CALL    be/btree.c:2030:be_btree_lookup   < rc=0
2132933  2865676.891.070996  7ff1d67f9d60  CALL    be/btree.c:1998:be_btree_lookup   > tree=0x400000103f78 key_in=0x7ff1d67fa150 key_out=0x7ff1d67fa150 value=0x7ff1d67fa160
2132934  2865676.891.072528  7ff1d67f9d60  CALL    be/btree.c:2030:be_btree_lookup   < rc=0
2132935  2865676.891.074092  7ff1d67fa070  ERROR   balloc/balloc.c:1346:m0_balloc_load_extents   Invalid extent
2132936  2865676.891.349106  7ff1d67f9f40  FATAL   lib/assert.c:50:m0_panic   panic: (0) at m0_balloc_load_extents() (balloc/balloc.c:1347)  [git: 2.0.0-790-9-g662e7a18] /etc/cortx/log/motr/def3fa765c954db3812b45500924c243/trace/m0d-0x7200000000000001:0x69/m0trace.35.2022-05-31-16:09:28

@andriytk
Copy link
Contributor Author

from gdb:

(gdb) bt
#0  0x00007ff1e032b38f in raise () from /lib64/libc.so.6
#1  0x00007ff1e0315dc5 in abort () from /lib64/libc.so.6
#2  0x00007ff1e241d383 in m0_arch_panic (c=c@entry=0x7ff1e281f420 <__pctx.12199>, ap=ap@entry=0x7ff1d67f9ff0)
    at lib/user_space/uassert.c:131
#3  0x00007ff1e240be2d in m0_panic (ctx=ctx@entry=0x7ff1e281f420 <__pctx.12199>) at lib/assert.c:52
#4  0x00007ff1e23344ba in m0_balloc_load_extents (cb=<optimized out>, grp=grp@entry=0x559735b07310) at balloc/balloc.c:1347
#5  0x00007ff1e233527f in balloc_regular_allocator (bac=0x7ff1d67fa690) at balloc/balloc.c:2471
#6  balloc_allocate_internal (req=0x7ff1d67fa630, tx=<optimized out>, ctx=0x400000103e38) at balloc/balloc.c:2626
#7  balloc_alloc (ballroom=0x400000103f60, tx=<optimized out>, count=<optimized out>, out=0x7ff1d67faa40, 
    alloc_zone=<optimized out>) at balloc/balloc.c:2891
#8  0x00007ff1e24e078e in stob_ad_balloc (adom=0x4000001073d0, adom=0x4000001073d0, alloc_type=<optimized out>, 
    out=0x7ff1d67faa40, count=256, tx=<optimized out>) at stob/ad.c:1074
#9  stob_ad_write_prepare (map=0x7ff1d67fa8c0, src=0x7ff1d67fa8f0, adom=0x4000001073d0, io=0x7ff1d81c80c8) at stob/ad.c:1935
#10 stob_ad_io_launch_prepare (io=0x7ff1d81c80c8) at stob/ad.c:2039
#11 0x00007ff1e24e3a7f in m0_stob_io_prepare (io=io@entry=0x7ff1d81c80c8, obj=obj@entry=0x7ff1d80877c0, 
    tx=tx@entry=0x7ff18c0799e8, scope=scope@entry=0x0) at stob/io.c:178
#12 0x00007ff1e24e3f25 in m0_stob_io_prepare_and_launch (io=io@entry=0x7ff1d81c80c8, obj=0x7ff1d80877c0, 
    tx=tx@entry=0x7ff18c0799e8, scope=scope@entry=0x0) at stob/io.c:226
#13 0x00007ff1e23f6209 in io_launch (fom=0x7ff18c079920) at ioservice/io_foms.c:1837
#14 0x00007ff1e23f38f3 in m0_io_fom_cob_rw_tick (fom=0x7ff18c079920) at ioservice/io_foms.c:2333
#15 0x00007ff1e23dc814 in fom_exec (fom=0x7ff18c079920) at fop/fom.c:791
#16 loc_handler_thread (th=0x5597354d5b20) at fop/fom.c:931
#17 0x00007ff1e24129be in m0_thread_trampoline (arg=arg@entry=0x5597354d5b28) at lib/thread.c:117
#18 0x00007ff1e241e061 in uthread_trampoline (arg=0x5597354d5b28) at lib/user_space/uthread.c:98
#19 0x00007ff1e1aeb15a in start_thread () from /lib64/libpthread.so.0
#20 0x00007ff1e03f0dd3 in clone () from /lib64/libc.so.6
(gdb) f 4
#4  0x00007ff1e23344ba in m0_balloc_load_extents (cb=<optimized out>, grp=grp@entry=0x559735b07310) at balloc/balloc.c:1347
1347				M0_ASSERT(false);
(gdb) l
1342				++spare_frags;
1343				zone_params_update(grp, &ex->le_ext,
1344						   M0_BALLOC_SPARE_ZONE);
1345			} else {
1346				M0_LOG(M0_ERROR, "Invalid extent");
1347				M0_ASSERT(false);
1348			}
1349			next_key = ex->le_ext.e_end + 1;
1350			balloc_debug_dump_extent("loading...", &ex->le_ext);
1351		}
(gdb) p /x *ex
$3 = {le_is_alloc = 0x0, le_link = {ll_next = 0x0, ll_prev = 0x0}, le_ext = {e_header = {hd_magic = 0x33011ca5e511de77, 
      hd_bits = 0x1001100000020}, e_start = 0x30c00, e_end = 0x31000, e_footer = {ft_magic = 0x33f007e7f007e777, 
      ft_checksum = 0xc8ac140dfdbf97a3}}}
(gdb) p /x normal_range
$5 = {e_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001100000020}, e_start = 0x20000, e_end = 0x30000, e_footer = {
    ft_magic = 0x33f007e7f007e777, ft_checksum = 0x701361c50db696ac}}
(gdb) p /x spare_range
$6 = {e_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001100000020}, e_start = 0x30000, e_end = 0x30000, e_footer = {
    ft_magic = 0x33f007e7f007e777, ft_checksum = 0xf97fcd9fbbeb16af}}

@andriytk
Copy link
Contributor Author

Looks like there is no more free extents in the balloc group?

@r-wambui r-wambui added Triage: DevTeam Triage owner is on the dev team Status: L1 Triage Initial triage labels Jun 8, 2022
Copy link

Mukund Bala Kanekar commented in Jira Server:

Found series of crash dumps on ssc-vm-g4-rhev4-1491 in timeframe May 31 09:16 to May 31 10:20 (MDT time zone)
{noformat}
[root@ssc-vm-g4-rhev4-1491 ~]# ls -lrt /mnt/fs-local-volume/local-path-provisioner/pvc-c9282ee5-c9ff-4469-8d2f-f959c38559a5_default_cortx-data-fs-local-pvc-ssc-vm-g4-rhev4-1491/motr/m0d-0x7200000000000001:0x69
total 6886792
drwx------ 3 root root 4096 May 30 05:02 db
drwx------ 3 root root 4096 May 30 05:02 db-log
drwx------ 3 root root 4096 May 30 05:02 stobs
-rw------- 1 root root 1014378496 May 30 10:10 core.1653927029.35
-rw------- 1 root root 952832000 May 30 10:24 core.1653927860.35
-rw------- 1 root root 925487104 May 30 10:24 core.1653927873.35
-rw------- 1 root root 925384704 May 30 10:29 core.1653928191.35
-rw------- 1 root root 925483008 May 30 10:30 core.1653928246.35
-rw------- 1 root root 936599552 May 30 10:32 core.1653928353.35
-rw------- 1 root root 943222784 May 30 10:58 core.1653929926.35
-rw------- 1 root root 925409280 May 30 10:59 core.1653929967.35
-rw------- 1 root root 947154944 May 30 10:59 core.1653929993.34
-rw------- 1 root root 926494720 May 30 11:00 core.1653930038.35
-rw------- 1 root root 937992192 May 31 09:16 core.1654010193.35
-rw------- 1 root root 936472576 May 31 09:18 core.1654010284.35
-rw------- 1 root root 936955904 May 31 09:18 core.1654010334.35
-rw------- 1 root root 933593088 May 31 09:47 core.1654012026.35
-rw------- 1 root root 924852224 May 31 09:47 core.1654012037.35
-rw------- 1 root root 926212096 May 31 09:49 core.1654012156.35
-rw------- 1 root root 936136704 May 31 09:50 core.1654012203.35
-rw------- 1 root root 925474816 May 31 09:54 core.1654012449.34
-rw------- 1 root root 937316352 May 31 09:58 core.1654012684.35
-rw------- 1 root root 926699520 May 31 10:04 core.1654013065.35
-rw------- 1 root root 942231552 May 31 10:20 core.1654014028.35
[root@ssc-vm-g4-rhev4-1491 ~]# date
Thu Jun 9 02:31:09 MDT 2022
[root@ssc-vm-g4-rhev4-1491 ~]#{noformat}
Back trace for core.1654010193.35 (first) and core.1654014028.35(last) core dump on May 31 is same i.e assert is from "m0_balloc_load_extents (cb=, grp=grp@entry=0x55a329499350) at balloc/balloc.c:1347"

Further analysis from core.1654010193.35,
The extent under process at the time of crash has e_start = 0x30c00, e_end = 0x31000 but the group 2 under process has the normal range e_start = 0x20000, e_end = 0x30000
and 0 blocks in spare range e_start = 0x30000, e_end = 0x30000.

 
{noformat}
(gdb) f 4
#4 0x00007f5cbe59b4ba in m0_balloc_load_extents (cb=, grp=grp@entry=0x55a329499350) at balloc/balloc.c:1347
1347 M0_ASSERT(false);
(gdb) l
1342 ++spare_frags;
1343 zone_params_update(grp, &ex->le_ext,
1344 M0_BALLOC_SPARE_ZONE);
1345 } else {
1346 M0_LOG(M0_ERROR, "Invalid extent");
1347 M0_ASSERT(false);
1348 }
1349 next_key = ex->le_ext.e_end + 1;
1350 balloc_debug_dump_extent("loading...", &ex->le_ext);
1351 }
(gdb) p/x *ex
$28 = {le_is_alloc = 0x0, le_link = {ll_next = 0x0, ll_prev = 0x0}, le_ext = {e_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001100000020}, e_start = 0x30c00,
e_end = 0x31000, e_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0xc8ac140dfdbf97a3}}}
(gdb) p/x *grp
$29 = {bgi_state = 0x1, bgi_groupno = 0x2, bgi_normal = {bzp_type = 0x2000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0x20000, e_end = 0x30000,
e_footer = {ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0x200, bzp_fragments = 0x2, bzp_maxchunk = 0x200, bzp_extents = {l_head = 0x7f5c8c09f838,
l_tail = 0x7f5c8c09f838}}, bgi_spare = {bzp_type = 0x1000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0x30000, e_end = 0x30000, e_footer = {
ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0x0, bzp_fragments = 0x0, bzp_maxchunk = 0x0, bzp_extents = {l_head = 0x55a329499410,
l_tail = 0x55a329499410}}, bgi_extents = 0x7f5c8c09f830, bgi_mutex = {bm_u = {mutex = {m_arch = {m_impl = {__data = {__lock = 0x1, __count = 0x0, __owner = 0x4a,
__nusers = 0x1, __kind = 0x0, __spins = 0x0, __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4a,
0x0, 0x0, 0x0, 0x1, 0x0 <repeats 27 times>}, __align = 0x1}}, m_owner = 0x55a328fe7938, m_addb2 = 0x0}, pad = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4a,
0x0, 0x0, 0x0, 0x1, 0x0 <repeats 27 times>, 0x38, 0x79, 0xfe, 0x28, 0xa3, 0x55, 0x0 <repeats 122 times>}}}}
(gdb){noformat}
The extent under process looks to be part of next group (i.e group 3 ) which has normal range e_start = 0x30000, e_end = 0x40000
{noformat}
(gdb) p/x *(grp+1)
$30 = {bgi_state = 0x1, bgi_groupno = 0x3, bgi_normal = {bzp_type = 0x2000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0x30000, e_end = 0x40000,
e_footer = {ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0xd400, bzp_fragments = 0x2, bzp_maxchunk = 0xd000, bzp_extents = {l_head = 0x55a329499530,
l_tail = 0x55a329499530}}, bgi_spare = {bzp_type = 0x1000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0x40000, e_end = 0x40000, e_footer = {
ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0x0, bzp_fragments = 0x0, bzp_maxchunk = 0x0, bzp_extents = {l_head = 0x55a329499590,
l_tail = 0x55a329499590}}, bgi_extents = 0x0, bgi_mutex = {bm_u = {mutex = {m_arch = {m_impl = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = 0x0,
__kind = 0x0, __spins = 0x0, __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size = {0x0 <repeats 40 times>}, __align = 0x0}}, m_owner = 0x0,
m_addb2 = 0x0}, pad = {0x0 <repeats 168 times>}}}}
(gdb){noformat}
According to group info for group 2 , bzp_fragments = 0x2 for normal range,
0th extend has been already added to normal range free extent list of group 2 , and has valid range belonging to normal range of group 2
(
{noformat}
gdb) p/x grp->bgi_normal.bzp_extents
$18 = {l_head = 0x7f5c8c09f838, l_tail = 0x7f5c8c09f838}
(gdb) p/x &grp->bgi_normal.bzp_extents
$19 = 0x55a3294993b0
(gdb) p sizeof(struct m0_list_link)
$20 = 16
(gdb) p/x (struct m0_ext)0x7f5c8c09f848
$21 = {e_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001100000020}, e_start = 0x2ae00,
e_end = 0x2b000, e_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0xe4460464fcf98771}}
(gdb){noformat}
and the crash has occured while processing 2nd extent at index =1
{noformat}
(gdb) p i
$31 = 1{noformat}
There are 2 possibilities,

  1. Fragment count 2 stored in 'struct m0_be_btree cb_db_group_desc' is invalid.
    OR
  2. The free extent list stored in 'struct m0_be_btree cb_db_group_extents' is invalid.

Copy link

Mukund Bala Kanekar commented in Jira Server:

From core dump details of group_extents and group_desc btree's are as follows
{noformat}
(gdb) p db_ext
$1 = (struct m0_be_btree *) 0x400000103f78
(gdb) p/x ((struct m0_balloc *)((char *)(db_ext)-(char *)(&((struct m0_balloc *)0)->cb_db_group_extents)))
$2 = 0x400000103e38
(gdb) p/x &((struct m0_balloc *)0x400000103e38)->cb_db_group_extents
$3 = 0x400000103f78
(gdb) p/x ((struct m0_balloc *)0x400000103e38)->cb_db_group_extents
$4 = {bb_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1000300000050}, bb_cookie_gen = 0x16f3dd167a5ffc27, bb_backlink = {bli_tree = {co_addr = 0x400000103f88,
co_generation = 0x16f3dd167a5ffc27}, bli_type = 0x2, bli_gen = 0x16f3dd16f8c98fd3, bli_fid = {f_container = 0x6200000000000002, f_key = 0xa}},
bb_root = 0x400000104280, bb_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0xe4844cbdb08c9889}, bb_lock = {bl_u = {rwlock = {rw_lock = {__data = {
__readers = 0x0, __writers = 0x0, __wrphase_futex = 0x0, __writers_futex = 0x0, __pad3 = 0x0, __pad4 = 0x0, __cur_writer = 0x0, __shared = 0x0,
__rwelision = 0x0, __pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, __pad2 = 0x0, __flags = 0x0}, __size = {0x0 <repeats 56 times>}, __align = 0x0}}, pad = {
0x0 <repeats 144 times>}}}, bb_seg = 0x55a32943a010, bb_ops = 0x7f5cbea86880}
(gdb) p/x ((struct m0_balloc *)0x400000103e38)->cb_db_group_desc
$5 = {bb_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1000300000050}, bb_cookie_gen = 0x16f3dd167a5ffc28, bb_backlink = {bli_tree = {co_addr = 0x400000104088,
co_generation = 0x16f3dd167a5ffc28}, bli_type = 0x3, bli_gen = 0x16f3dd16f8c98fd3, bli_fid = {f_container = 0x6200000000000003, f_key = 0xa}},
bb_root = 0x400000105b28, bb_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0x6bffc6d17c8d7acf}, bb_lock = {bl_u = {rwlock = {rw_lock = {__data = {
__readers = 0x0, __writers = 0x0, __wrphase_futex = 0x0, __writers_futex = 0x0, __pad3 = 0x0, __pad4 = 0x0, __cur_writer = 0x0, __shared = 0x0,
__rwelision = 0x0, __pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, __pad2 = 0x0, __flags = 0x0}, __size = {0x0 <repeats 56 times>}, __align = 0x0}}, pad = {
0x0 <repeats 144 times>}}}, bb_seg = 0x55a32943a010, bb_ops = 0x7f5cbea86860}
(gdb){noformat}
Next step is get details of on disk data of group 2 in 'cb_db_group_desc' btree using m0beck to identify any discrepancies.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Further debug shows that the root node in Group Descriptor Btree holds the group 2 Descriptor at kv array index 2.
{noformat}
(gdb) p/x ((struct m0_balloc *)0x400000103e38)->cb_db_group_desc
$1 = {bb_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1000300000050}, bb_cookie_gen = 0x16f3dd167a5ffc28, bb_backlink = {bli_tree = {co_addr = 0x400000104088,
co_generation = 0x16f3dd167a5ffc28}, bli_type = 0x3, bli_gen = 0x16f3dd16f8c98fd3, bli_fid = {f_container = 0x6200000000000003, f_key = 0xa}},
bb_root = 0x400000105b28, bb_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0x6bffc6d17c8d7acf}, bb_lock = {bl_u = {rwlock = {rw_lock = {__data = {
__readers = 0x0, __writers = 0x0, __wrphase_futex = 0x0, __writers_futex = 0x0, __pad3 = 0x0, __pad4 = 0x0, __cur_writer = 0x0, __shared = 0x0,
__rwelision = 0x0, __pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, __pad2 = 0x0, __flags = 0x0}, __size = {0x0 <repeats 56 times>}, __align = 0x0}}, pad = {
0x0 <repeats 144 times>}}}, bb_seg = 0x55a32943a010, bb_ops = 0x7f5cbea86860}
(gdb) p/x ((struct m0_balloc )0x400000103e38)->cb_db_group_desc.bb_root
$2 = 0x400000105b28
(gdb) p/x ((struct m0_be_bnode)0x400000105b28)
$3 = {bt_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1000400001848}, bt_backlink = {bli_tree = {co_addr = 0x400000104088, co_generation = 0x16f3dd167a5ffc28},
bli_type = 0x3, bli_gen = 0x16f3dd16f8c98fd3, bli_fid = {f_container = 0x6200000000000003, f_key = 0xa}}, bt_next = 0x0, bt_num_active_key = 0x64, bt_level = 0x0,
bt_isleaf = 0x1, bt_pad = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, bt_kv_arr = {{btree_key = 0x40000010a010, btree_val = 0x40000010a018}, {btree_key = 0x40000010a2f8,
btree_val = 0x40000010a300}, {btree_key = 0x40000010a108, btree_val = 0x40000010a110}, {btree_key = 0x40000010a9c0, btree_val = 0x40000010a9c8}, {
btree_key = 0x40000010b560, btree_val = 0x40000010b568}, {btree_key = 0x40000010bc88, btree_val = 0x40000010bc90}, {btree_key = 0x40000010ce90,
btree_val = 0x40000010ce98}, {btree_key = 0x40000010a200, btree_val = 0x40000010a208}, {btree_key = 0x40000010a3f0, btree_val = 0x40000010a3f8}, {
btree_key = 0x40000010d270, btree_val = 0x40000010d278}, {btree_key = 0x40000010d460, btree_val = 0x40000010d468}, {btree_key = 0x40000010a4e8,
btree_val = 0x40000010a4f0}, {btree_key = 0x40000010b3d0, btree_val = 0x40000010b3d8}, {btree_key = 0x40000010b088, btree_val = 0x40000010b090}, {
btree_key = 0x40000010d558, btree_val = 0x40000010d560}, {btree_key = 0x40000010a5e0, btree_val = 0x40000010a5e8}, {btree_key = 0x40000010bd20,
btree_val = 0x40000010bd28}, {btree_key = 0x40000010b940, btree_val = 0x40000010b948}, {btree_key = 0x40000010bb30, btree_val = 0x40000010bb38}, {
btree_key = 0x40000010b848, btree_val = 0x40000010b850}, {btree_key = 0x40000010aca8, btree_val = 0x40000010acb0}, {btree_key = 0x40000010c7c8,
btree_val = 0x40000010c7d0}, {btree_key = 0x40000010b468, btree_val = 0x40000010b470}, {btree_key = 0x40000010cf88, btree_val = 0x40000010cf90}, {
btree_key = 0x40000010d178, btree_val = 0x40000010d180}, {btree_key = 0x40000010ae98, btree_val = 0x40000010aea0}, {btree_key = 0x40000010b278,
btree_val = 0x40000010b280}, {btree_key = 0x40000010a7d0, btree_val = 0x40000010a7d8}, {btree_key = 0x40000010a8c8, btree_val = 0x40000010a8d0}, {
btree_key = 0x40000010b658, btree_val = 0x40000010b660}, {btree_key = 0x40000010c5d8, btree_val = 0x40000010c5e0}, {btree_key = 0x40000010dc20,
btree_val = 0x40000010dc28}, {btree_key = 0x40000010dd18, btree_val = 0x40000010dd20}, {btree_key = 0x40000010aab8, btree_val = 0x40000010aac0}, {
btree_key = 0x40000010c2f0, btree_val = 0x40000010c2f8}, {btree_key = 0x40000010c3e8, btree_val = 0x40000010c3f0}, {btree_key = 0x40000010af90,
btree_val = 0x40000010af98}, {btree_key = 0x40000010a6d8, btree_val = 0x40000010a6e0}, {btree_key = 0x40000010b180, btree_val = 0x40000010b188}, {
btree_key = 0x40000010b750, btree_val = 0x40000010b758}, {btree_key = 0x40000010ba38, btree_val = 0x40000010ba40}, {btree_key = 0x40000010abb0,
btree_val = 0x40000010abb8}, {btree_key = 0x40000010c6d0, btree_val = 0x40000010c6d8}, {btree_key = 0x40000010c8c0, btree_val = 0x40000010c8c8}, {
btree_key = 0x40000010c9b8, btree_val = 0x40000010c9c0}, {btree_key = 0x40000010cab0, btree_val = 0x40000010cab8}, {btree_key = 0x40000010ada0,
btree_val = 0x40000010ada8}, {btree_key = 0x40000010c4e0, btree_val = 0x40000010c4e8}, {btree_key = 0x40000010c1f8, btree_val = 0x40000010c200}, {
btree_key = 0x40000010c100, btree_val = 0x40000010c108}, {btree_key = 0x40000010c008, btree_val = 0x40000010c010}, {btree_key = 0x40000010bf10,
btree_val = 0x40000010bf18}, {btree_key = 0x40000010be18, btree_val = 0x40000010be20}, {btree_key = 0x40000010d080, btree_val = 0x40000010d088}, {
btree_key = 0x40000010cba8, btree_val = 0x40000010cbb0}, {btree_key = 0x40000010cca0, btree_val = 0x40000010cca8}, {btree_key = 0x40000010cd98,
btree_val = 0x40000010cda0}, {btree_key = 0x40000010d3c8, btree_val = 0x40000010d3d0}, {btree_key = 0x40000010d650, btree_val = 0x40000010d658}, {
btree_key = 0x40000010d748, btree_val = 0x40000010d750}, {btree_key = 0x40000010d840, btree_val = 0x40000010d848}, {btree_key = 0x40000010d938,
btree_val = 0x40000010d940}, {btree_key = 0x40000010da30, btree_val = 0x40000010da38}, {btree_key = 0x40000010db28, btree_val = 0x40000010db30}, {
btree_key = 0x40000010eee8, btree_val = 0x40000010eef0}, {btree_key = 0x40000010f078, btree_val = 0x40000010f080}, {btree_key = 0x40000010edf0,
btree_val = 0x40000010edf8}, {btree_key = 0x40000010f930, btree_val = 0x40000010f938}, {btree_key = 0x40000010eaa8, btree_val = 0x40000010eab0}, {
btree_key = 0x40000010df08, btree_val = 0x40000010df10}, {btree_key = 0x40000010f838, btree_val = 0x40000010f840}, {btree_key = 0x40000010e820,
btree_val = 0x40000010e828}, {btree_key = 0x40000010fff8, btree_val = 0x400000110000}, {btree_key = 0x40000010ff00, btree_val = 0x40000010ff08}, {
btree_key = 0x40000010f170, btree_val = 0x40000010f178}, {btree_key = 0x40000010fe08, btree_val = 0x40000010fe10}, {btree_key = 0x40000010e9b0,
btree_val = 0x40000010e9b8}, {btree_key = 0x40000010f740, btree_val = 0x40000010f748}, {btree_key = 0x40000010eba0, btree_val = 0x40000010eba8}, {
btree_key = 0x40000010f648, btree_val = 0x40000010f650}, {btree_key = 0x40000010fd10, btree_val = 0x40000010fd18}, {btree_key = 0x40000010e538,
btree_val = 0x40000010e540}, {btree_key = 0x40000010e8b8, btree_val = 0x40000010e8c0}, {btree_key = 0x40000010f550, btree_val = 0x40000010f558}, {
btree_key = 0x40000010efe0, btree_val = 0x40000010efe8}, {btree_key = 0x40000010f458, btree_val = 0x40000010f460}, {btree_key = 0x40000010fc18,
btree_val = 0x40000010fc20}, {btree_key = 0x40000010e6c8, btree_val = 0x40000010e6d0}, {btree_key = 0x40000010e250, btree_val = 0x40000010e258}, {
btree_key = 0x40000010f360, btree_val = 0x40000010f368}, {btree_key = 0x40000010fb20, btree_val = 0x40000010fb28}, {btree_key = 0x40000010e630,
btree_val = 0x40000010e638}, {btree_key = 0x40000010ecf8, btree_val = 0x40000010ed00}, {btree_key = 0x40000010e440, btree_val = 0x40000010e448}, {
btree_key = 0x40000010f268, btree_val = 0x40000010f270}, {btree_key = 0x40000010fa28, btree_val = 0x40000010fa30}, {btree_key = 0x40000010e158,
btree_val = 0x40000010e160}, {btree_key = 0x40000010e348, btree_val = 0x40000010e350}, {btree_key = 0x40000010e060, btree_val = 0x40000010e068}, {
btree_key = 0x40000010de10, btree_val = 0x40000010de18}, {btree_key = 0x0, btree_val = 0x0} <repeats 155 times>}, bt_child_arr = {0x0 <repeats 256 times>},
bt_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0xffb7b81c66e7d45a}}
(gdb) p/x ((struct m0_be_bnode
)0x400000105b28)->bt_kv_arr[2].btree_key
$4 = 0x40000010a108
(gdb) p/x *((m0_bindex_t )0x40000010a108)
$5 = 0x2
(gdb) p/x ((struct m0_be_bnode
)0x400000105b28)->bt_kv_arr[2].btree_val
$6 = 0x40000010a110
(gdb) p/x ((struct m0_balloc_group_desc)0x40000010a110)
$7 = {bgd_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001000000030}, bgd_groupno = 0x2, bgd_freeblocks = 0x300, bgd_fragments = 0x2, bgd_maxchunk = 0x200,
bgd_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0xddc9911d574def59}}
(gdb){noformat}
From the group 2 Descriptor in btree, the total free blocks are 0x300, and max chunk is of length 0x200.
Referring to my previous comments, Extent Btree has only one entry belonging to group 2 with length 0x200, which matches with max chunk length as well.
It appears that the second extent entry with length 0x100 is missing for group 2.

Copy link

Hua Huang commented in Jira Server:

Please use this patch to dump the ex before checking it:

diff --git a/balloc/balloc.c b/balloc/balloc.c
index fe5bdd14f..e583a7f60 100644
--- a/balloc/balloc.c
+++ b/balloc/balloc.c
@@ -1332,6 +1332,7 @@ M0_INTERNAL int m0_balloc_load_extents(struct m0_balloc *cb,
                if (rc != 0)
                        break;
                m0_ext_init(&ex->le_ext);
+               balloc_debug_dump_extent("loading...", &ex->le_ext);
                if (m0_ext_is_partof(&normal_range, &ex->le_ext)) {
                        m0_list_add_tail(group_normal_ext(grp), &ex->le_link);
                        ++normal_frags;
@@ -1347,7 +1348,6 @@ M0_INTERNAL int m0_balloc_load_extents(struct m0_balloc *cb,
                        M0_ASSERT(false);
                }
                next_key = ex->le_ext.e_end + 1;
-               balloc_debug_dump_extent("loading...", &ex->le_ext);
        }

        if (i != group_fragments_get(grp) + group_spare_fragments_get(grp))

So we can see what the 'ex' is.

Copy link

Mukund Bala Kanekar commented in Jira Server:

[~520428] so far we are doing static analysis with available core dump.

Also for group 2 we are able to extract extent details from dump , please refer my previous comments.

Copy link

Hua Huang commented in Jira Server:

[~522123], Oh, I see.

{quote}
There are 2 possibilities,

  1. Fragment count 2 stored in 'struct m0_be_btree cb_db_group_desc' is invalid.
    OR
  2. The free extent list stored in 'struct m0_be_btree cb_db_group_extents' is invalid.
    {quote}
    I agree.
    From the dump, we can see, this extent under processing belongs to the next group.
    There must be something wrong.

We need to figure out if:
the other extent (which is 0x100 bytes long) is allocated or not.

Copy link

Mukund Bala Kanekar commented in Jira Server:

The other extents length 0x100 is derived based on the group 2 Descriptor in btree ( total free blocks are 0x300 and first extent is of length 0x200.)

There could be 2 possibilities, 

  1. Extent with length 0x100 is removed from extent btree, but the group descriptor btree updated is missed(may be due to intermediate return ). 
    OR 
  2. Group descriptor btree is updated correctly but Extent with length 0x100 is not added to extent btree.

Looking as code so far #1 looks to more accurate case as btree update order is,  extent btree is updated first then group descriptor btree is updated later.

@stale
Copy link

stale bot commented Jun 18, 2022

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 4 days. Tagging @nkommuri @mehjoshi @huanghua78 for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Balloc Extent btree and Balloc Group Desc btree can go out of sync with following sequence of events,

  1. Any of the the first Balloc Extent btree operation(btree_delete_sync) succeeds and 2nd Balloc Extent btree operation(btree_insert_sync) fails.
  2. Code does intermediate return, without updating Group Desc btree .
  3. Update due to first extent btree operation will remain in memory for the failed transaction.
  4. This in memory update can be part of subsequent successful transaction.

 

To confirm sequence really occurs, debug build is created with,

  1. assert in balloc code in case btree operation fails
  2. error message in btree code for balloc related btrees.

The custom build details are as follows,
http://cortx-storage.colo.seagate.com/releases/cortx/github/integration-custom-ci/rockylinux-8.4/custom-build-6911/

CORTX images are available at,
cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-6911-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-6911-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-control:2.0.0-6911-custom-ci'
cortx-docker.colo.seagate.com/seagate/cortx-rgw:2.0.0-6911-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-6911-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-6911-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-control:2.0.0-6911-custom-ci

[~520414] can you please use above build for your experimentations and also share the steps you did before ending ending up in balloc assert.

Copy link

Hua Huang commented in Jira Server:

[~522123], where is the patch (to build custom-build-6911)?
So, maybe I try it on my setup too.

@stale stale bot removed the needs-attention label Jun 24, 2022
Copy link

Mukund Bala Kanekar commented in Jira Server:

Sure, I have attached patch "balloc_debug_0623.txt"
Also you can refer cortx-motr branch "CORTX-31907_balloc_debug" commit#d313462192ad7d406d0b27403091e2159ac58409

Copy link

Mukund Bala Kanekar commented in Jira Server:

There was regression issue with previous custom build #6911 (dur to recent changes in rgw)

Created new custom build #6930 with other component commit #s from last sanity passed build #837 ([https://github.com/seagate/cortx/pkgs/container/cortx-rgw/26506950?tag=2.0.0-837)]
(cortx-py-utils:31df3dd, cortx-motr:f9e83286, cortx-hare:f14ece8, cortx-provisioner:325a3e0b, cortx-ha:7a5d356, cortx-rgw-integration:17125bc, cortx-csm_agent:b827ce0d, cortx-rgw:g2694eb6c018, cortx-re:a81201a)

http://cortx-storage.colo.seagate.com/releases/cortx/github/integration-custom-ci/rockylinux-8.4/custom-build-6930/

CORTX images are available at,
cortx-docker.colo.seagate.com/seagate/cortx-rgw:2.0.0-6930-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-6930-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-6930-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-control:2.0.0-6930-custom-ci

CFT Sanity is successful on custom build#6930, please refer CORTX-32425 for more details.
[~520414] can you please use build#6930 for your experimentations and also share the steps you did before ending ending up in balloc assert.

@stale
Copy link

stale bot commented Jul 7, 2022

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 4 days. Tagging @nkommuri @mehjoshi @huanghua78 for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

@stale stale bot added the needs-attention label Jul 7, 2022
Copy link

Mukund Bala Kanekar commented in Jira Server:

As part of clean up activity [https://github.com/Seagate/cortx-motr/tree/CORTX-31907_balloc_debug] branch is moved to forked repo [https://github.com/mukundkanekar/cortx-motr/tree/CORTX-31907_balloc_debug] 

Copy link

Mukund Bala Kanekar commented in Jira Server:

So far there was no luck while trying to reproduce on custom build.

[~530903] can you please share Jira or support bundle details if has seen recently ? 

Copy link

Chandradhar Raval commented in Jira Server:

Similar panic is also observed in Happy path Type-1 IOstability run with #869 build.

System details :
Master : ssc-vm-g2-rhev4-3338
Client : ssc-vm-g3-rhev4-1104

Logs : Support bundle location: /root/deploy-scripts/k8_cortx_cloud/logs-cortx-cloud-2022-07-27_21-59.tar

cc : [~522123], [~520428], [~522059], [~531171]

Copy link

Gaurav Kumar Gaur commented in Jira Server:

Sorry, [~530903], [~522123]
I started redeploy and missed taking backup of logs-cortx-cloud-2022-07-27_21-59.tar. As deployment remove the whole directory the tar was also deleted.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Found below core dumps on 'sc-vm-rhev4-2478' vm
{code:java}
root@ssc-vm-rhev4-2478 ~]# ls -lrt /mnt/fs-local-volume/local-path-provisioner/pvc-555b588e-6fa1-4d0f-9c28-ae26cb9cd280_cortx_cortx-data-fs-local-pvc-ssc-vm-rhev4-2478/motr/m0d-0x7200000000000001:0x4
total 10812296
drwx------ 3 root root 4096 Jul 20 01:00 db
drwx------ 3 root root 4096 Jul 20 01:00 db-log
drwx------ 3 root root 4096 Jul 20 01:00 stobs
-rw------- 1 root root 2545922048 Jul 27 19:02 core.1658970137.42
-rw------- 1 root root 1708761088 Jul 31 18:25 core.1659313526.42
-rw------- 1 root root 1027809280 Jul 31 18:25 core.1659313544.42
-rw------- 1 root root 1108475904 Jul 31 18:30 core.1659313803.42
-rw------- 1 root root 1599795200 Jul 31 18:48 core.1659314904.42
-rw------- 1 root root 994058240 Jul 31 18:48 core.1659314921.42
-rw------- 1 root root 1039974400 Jul 31 18:49 core.1659314947.42
-rw------- 1 root root 1148772352 Jul 31 18:52 core.1659315125.43
-rw------- 1 root root 1571958784 Jul 31 19:02 core.1659315731.42
-rw------- 1 root root 1511800832 Jul 31 19:06 core.1659316017.42
-rw------- 1 root root 1220747264 Jul 31 19:07 core.1659316048.43
-rw------- 1 root root 1340735488 Jul 31 19:13 core.1659316427.42
-rw------- 1 root root 1102131200 Jul 31 19:18 core.1659316682.42
-rw------- 1 root root 1187303424 Jul 31 19:21 core.1659316865.42
-rw------- 1 root root 1005858816 Jul 31 19:24 core.1659317098.42
-rw------- 1 root root 1164447744 Jul 31 19:33 core.1659317606.43
-rw------- 1 root root 1073897472 Jul 31 19:39 core.1659317977.42
-rw------- 1 root root 1336893440 Jul 31 19:49 core.1659318567.42
-rw------- 1 root root 1100767232 Jul 31 19:54 core.1659318889.42
-rw------- 1 root root 1060384768 Jul 31 20:00 core.1659319209.42{code}

Analyzed 'Jul 31 18:25 core.1659313526.42' further using gdb and build #869.
{code:java}
(gdb) bt
#0 0x00007fea7286938f in raise () from /lib64/libc.so.6
#1 0x00007fea72853dc5 in abort () from /lib64/libc.so.6
#2 0x00007fea74987ff3 in m0_arch_panic (c=c@entry=0x7fea74d94680 <__pctx.12641>, ap=ap@entry=0x7fea6df080c0) at lib/user_space/uassert.c:131
#3 0x00007fea74976a1d in m0_panic (ctx=ctx@entry=0x7fea74d94680 <__pctx.12641>) at lib/assert.c:52
#4 0x00007fea74881f10 in m0_balloc_load_extents (cb=, grp=grp@entry=0x559bccf2a9e0) at balloc/balloc.c:1456
#5 0x00007fea74882cdf in balloc_regular_allocator (bac=0x7fea6df08850) at balloc/balloc.c:2556
#6 balloc_allocate_internal (req=0x7fea6df087f0, tx=, ctx=0x400000126400) at balloc/balloc.c:2711
#7 balloc_alloc (ballroom=0x400000126528, tx=, count=, out=0x7fea6df08c00, alloc_zone=) at balloc/balloc.c:2976
#8 0x00007fea74a4cb5e in stob_ad_balloc (adom=0x40000012b000, adom=0x40000012b000, alloc_type=, out=0x7fea6df08c00, count=1, tx=)
at stob/ad.c:1118
#9 stob_ad_write_prepare (map=0x7fea6df08a80, src=0x7fea6df08ab0, adom=0x40000012b000, io=0x7fea5957d4f8) at stob/ad.c:1979
#10 stob_ad_io_launch_prepare (io=0x7fea5957d4f8) at stob/ad.c:2083
#11 0x00007fea74a4ff6f in m0_stob_io_prepare (io=io@entry=0x7fea5957d4f8, obj=obj@entry=0x7fea5907b430, tx=tx@entry=0x7fe936d98678, scope=scope@entry=0x0) at stob/io.c:178
#12 0x00007fea74a50415 in m0_stob_io_prepare_and_launch (io=io@entry=0x7fea5957d4f8, obj=0x7fea5907b430, tx=tx@entry=0x7fe936d98678, scope=scope@entry=0x0) at stob/io.c:226
#13 0x00007fea74960df9 in io_launch (fom=0x7fe936d985b0) at ioservice/io_foms.c:1837
#14 0x00007fea7495e4d3 in m0_io_fom_cob_rw_tick (fom=0x7fe936d985b0) at ioservice/io_foms.c:2333
#15 0x00007fea749472d8 in fom_exec (fom=0x7fe936d985b0) at fop/fom.c:808
#16 loc_handler_thread (th=0x559bcc468f40) at fop/fom.c:948
#17 0x00007fea7497d56e in m0_thread_trampoline (arg=arg@entry=0x559bcc468f48) at lib/thread.c:117
#18 0x00007fea74988cd1 in uthread_trampoline (arg=0x559bcc468f48) at lib/user_space/uthread.c:98
#19 0x00007fea7402915a in start_thread () from /lib64/libpthread.so.0
#20 0x00007fea7292edd3 in clone () from /lib64/libc.so.6
(gdb) f 4
#4 0x00007fea74881f10 in m0_balloc_load_extents (cb=, grp=grp@entry=0x559bccf2a9e0) at balloc/balloc.c:1456
1456 M0_ASSERT(false);{code}
 

2nd extent/fregment( e_start = 0xa0000,e_end = 0xc0000) under process belongs to next group(group 5) range.
{code:java}
(gdb) p/x *ex
$1 = {le_is_alloc = 0x0, le_link = {ll_next = 0x0, ll_prev = 0x0}, le_ext = {e_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001100000020}, e_start = 0xa0000,
e_end = 0xc0000, e_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0x7d783919093e7f49}}}
(gdb) p/x *grp
$2 = {bgi_state = 0x1, bgi_groupno = 0x4, bgi_normal = {bzp_type = 0x2000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0x80000, e_end = 0xa0000,
e_footer = {ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0x1fd00, bzp_fragments = 0x2, bzp_maxchunk = 0x1fd00, bzp_extents = {l_head = 0x7fea41bf59d8,
l_tail = 0x7fea41bf59d8}}, bgi_spare = {bzp_type = 0x1000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0xa0000, e_end = 0xa0000, e_footer = {
ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0x0, bzp_fragments = 0x0, bzp_maxchunk = 0x0, bzp_extents = {l_head = 0x559bccf2aaa0,
l_tail = 0x559bccf2aaa0}}, bgi_extents = 0x7fea41bf59d0, bgi_mutex = {bm_u = {mutex = {m_arch = {m_impl = {__data = {__lock = 0x1, __count = 0x0, __owner = 0x4b,
__nusers = 0x1, __kind = 0x0, __spins = 0x0, __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4b,
0x0, 0x0, 0x0, 0x1, 0x0 <repeats 27 times>}, __align = 0x1}}, m_owner = 0x559bcc468f48, m_addb2 = 0x0}, pad = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4b,
0x0, 0x0, 0x0, 0x1, 0x0 <repeats 27 times>, 0x48, 0x8f, 0x46, 0xcc, 0x9b, 0x55, 0x0 <repeats 122 times>}}}}
(gdb) p/x *(grp+1)
$3 = {bgi_state = 0x1, bgi_groupno = 0x5, bgi_normal = {bzp_type = 0x2000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0xa0000, e_end = 0xc0000,
e_footer = {ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0x20000, bzp_fragments = 0x1, bzp_maxchunk = 0x20000, bzp_extents = {l_head = 0x559bccf2abc0,
l_tail = 0x559bccf2abc0}}, bgi_spare = {bzp_type = 0x1000, bzp_range = {e_header = {hd_magic = 0x0, hd_bits = 0x0}, e_start = 0xc0000, e_end = 0xc0000, e_footer = {
ft_magic = 0x0, ft_checksum = 0x0}}, bzp_freeblocks = 0x0, bzp_fragments = 0x0, bzp_maxchunk = 0x0, bzp_extents = {l_head = 0x559bccf2ac20,
l_tail = 0x559bccf2ac20}}, bgi_extents = 0x0, bgi_mutex = {bm_u = {mutex = {m_arch = {m_impl = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = 0x0,
__kind = 0x0, __spins = 0x0, __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size = {0x0 <repeats 40 times>}, __align = 0x0}}, m_owner = 0x0,
m_addb2 = 0x0}, pad = {0x0 <repeats 168 times>}}}}{code}
1st fragment of group 4 has e_start = 0x80300,e_end = 0xa0000 (lenghth = 0x1fd00)
{code:java}
(gdb) p/x grp->bgi_normal.bzp_extents
$4 = {l_head = 0x7fea41bf59d8, l_tail = 0x7fea41bf59d8}
(gdb) p sizeof(struct m0_list_link)
$5 = 16
(gdb) p/x (struct m0_ext)(0x7fea41bf59d8+16)
$6 = {e_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001100000020}, e_start = 0x80300, e_end = 0xa0000, e_footer = {ft_magic = 0x33f007e7f007e777,
ft_checksum = 0x7132dfb6e48d980a}}
(gdb) p i
$7 = 1
(gdb) p 0xa0000-0x80300
$8 = 130304
(gdb) p/x 0xa0000-0x80300
$9 = 0x1fd00
(gdb) p normal_frags
$10 = 1
(gdb){code}

Next step to extract group descriptor info for group 4 from group descriptor btree.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Extracted 'struct m0_balloc' pointer from frame 5
{code:java}
(gdb) f 5
#5 0x00007fea74882cdf in balloc_regular_allocator (bac=0x7fea6df08850) at balloc/balloc.c:2556
2556 rc = m0_balloc_load_extents(bac->bac_ctxt, grp);
(gdb) p/x bac->bac_ctxt
$12 = 0x400000126400{code}

Extracted group descriptor btree info,
{code:java}
(gdb) p/x *((struct m0_balloc *)0x400000126400)->cb_db_group_desc
$13 = {t_type = 0x0, t_height = 0x2, t_desc = 0x559bccb3bf90}
(gdb) p/x ((struct td)0x559bccb3bf90)
$14 = {t_type = 0x0, t_lock = {rw_lock = {__data = {__readers = 0x1, __writers = 0x0, __wrphase_futex = 0x1, __writers_futex = 0x0, __pad3 = 0x0, __pad4 = 0x0,
__cur_writer = 0x0, __shared = 0x0, __rwelision = 0x0, __pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, __pad2 = 0x0, __flags = 0x0}, __size = {0x1, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x1, 0x0 <repeats 47 times>}, __align = 0x1}}, t_root = 0x559bcc8e8960, t_height = 0x2, t_ref = 0x1, t_seg = 0x559bccb2a010, t_fid = {
f_container = 0x6200000000000003, f_key = 0x8}, t_keycmp = {rko_keycmp = 0x0}}
(gdb) p/x *((struct nd *)0x559bcc8e8960)
$15 = {n_addr = {as_core = 0x400000128400}, n_tree = 0x559bccb3bf90, n_type = 0x7fea74db16a0, n_seg = 0x0, n_linkage = {t_link = {ll_next = 0x559bcc8e6430,
ll_prev = 0x559bccd3bf00}}, n_magic = 0x33c1a551c1dea77, n_lock = {rw_lock = {__data = {__readers = 0x1, __writers = 0x0, __wrphase_futex = 0x1,
__writers_futex = 0x0, __pad3 = 0x0, __pad4 = 0x0, __cur_writer = 0x0, __shared = 0x0, __rwelision = 0x0, __pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
__pad2 = 0x0, __flags = 0x0}, __size = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0 <repeats 47 times>}, __align = 0x1}}, n_ref = 0x1, n_txref = 0x0,
n_seq = 0x1707026bd8b427bf, n_op = 0x0, n_size = 0x1000, n_be_node_valid = 0x1}
(gdb) p/x 0x400000128400+0x1000-8
$16 = 0x4000001293f8
(gdb) x/gx 0x4000001293f8
0x4000001293f8: 0x0000400000134050
(gdb) p/x *((struct ff_head *)0x0000400000134050)
$17 = {ff_fmt = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1000400000050}, ff_seg = {h_node_type = 0x1, h_tree_type = 0x0, h_crc_type = 0x0, h_gen = 0x1703776edc04cd94,
h_fid = {f_container = 0x6200000000000003, f_key = 0x8}, h_opaque = 0x0}, ff_used = 0x1b, ff_level = 0x0, ff_ksize = 0x8, ff_vsize = 0x40, ff_nsize = 0xfb0, ff_foot = {
ft_magic = 0x33f007e7f007e777, ft_checksum = 0x4db9c222d130d64c}, ff_opaque = 0x559bcc8f2a60}{code}

Extracted keys array details from group descriptor btree
{code:java}
(gdb) p/x sizeof(struct ff_head)
$19 = 0x68
(gdb) x/32gx (0x0000400000134050+0x68)
0x4000001340b8: 0x0000000000000000 0x0100000000000000
0x4000001340c8: 0x0200000000000000 0x0300000000000000
0x4000001340d8: 0x0400000000000000 0x0500000000000000
0x4000001340e8: 0x0600000000000000 0x0700000000000000
0x4000001340f8: 0x0800000000000000 0x0900000000000000
0x400000134108: 0x0a00000000000000 0x0b00000000000000
0x400000134118: 0x0c00000000000000 0x0d00000000000000
0x400000134128: 0x0e00000000000000 0x0f00000000000000
0x400000134138: 0x1000000000000000 0x1100000000000000
0x400000134148: 0x1200000000000000 0x1300000000000000
0x400000134158: 0x1400000000000000 0x1500000000000000
0x400000134168: 0x1600000000000000 0x1700000000000000
0x400000134178: 0x1800000000000000 0x1900000000000000
0x400000134188: 0x1a00000000000000 0x0000000000000000
0x400000134198: 0x0000000000000000 0x0000000000000000
0x4000001341a8: 0x0000000000000000 0x0000000000000000{code}

Extracted value( group descriptor info for group 4)from group descriptor btree
{code:java}
(gdb) p/x ((struct m0_balloc_group_desc*)(0x0000400000134050+0xfb0))[-5]
$20 = {bgd_header = {hd_magic = 0x33011ca5e511de77, hd_bits = 0x1001000000030}, bgd_groupno = 0x4, bgd_freeblocks = 0x1ff00, bgd_fragments = 0x2, bgd_maxchunk = 0x1fd00,
bgd_footer = {ft_magic = 0x33f007e7f007e777, ft_checksum = 0xb756443e48350dd4}}{code}
Thank you [~530902] for helping to gather information from new btree.

Summary : 
From group(4) descriptor info,

  1. maxchunk = 0x1fd00, this matches with first extent processed for group 4 from extent btree.
  2. The second extent with length freeblocks(0x1ff00 ) - first maxchunk(0x1fd00) = 0x200 is missing from extent btree.
  3. fragments = 2 also indicates that there must be second extent.
    This observation is similar to the dump analyzed for original assert reported in this bug.

The trace files for the period before crash are not available, which could have helped to understand events which leads into this situation.
{code:java}
[root@ssc-vm-rhev4-2478 ~]# ls -lrt /mnt/fs-local-volume/local-path-provisioner/pvc-555b588e-6fa1-4d0f-9c28-ae26cb9cd280_cortx_cortx-da ta-fs-local-pvc-ssc-vm-rhev4-2478/log/motr/3e55246af7f018aaadc5b8cf68307d20/trace/m0d-0x7200000000000001:0x4/
total 32896
-rwx------ 1 root root 16842752 Jul 31 19:59 m0trace.42.2022-08-01-01:59:56
-rwx------ 1 root root 16842752 Aug 2 00:55 m0trace.42.2022-08-01-02:05:19
[root@ssc-vm-rhev4-2478 ~]#{code}
 

Copy link

Rohan Kulkarni commented in Jira Server:

Observed similar issue while testing IO stability type-1 workload on custom 7267 with dtm enabled build with IO stability resource limit changes. 
{noformat}
2022-08-05 02:16:10,928 - INFO - motr[00042]:  c6b0  FATAL  [lib/assert.c:50:m0_panic]  panic: (0) at m0_balloc_load_extents() (balloc/balloc.c:1456)  [git: TagTest3-1480-533-gbc4ec3ca] /etc/cortx/log/motr/700aeb181d045f419c9f48527a5e4cf0/trace/m0d-0x7200000000000001:0x1/m0trace.42.2022-08-05-02:08:46
2022-08-05 02:16:10,928 - INFO - Motr panic: (0) at m0_balloc_load_extents() balloc/balloc.c:1456 (errno: 0) (last failed: none) [git: TagTest3-1480-533-gbc4ec3ca] pid: 42  /etc/cortx/log/motr/700aeb181d045f419c9f48527a5e4cf0/trace/m0d-0x7200000000000001:0x1/m0trace.42.2022-08-05-02:08:46
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7f38724ccf73]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7f38724cd149]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(m0_panic+0x13d)[0x7f38724bbb2d]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(m0_balloc_load_extents+0x670)[0x7f38723c7a90]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(+0x2bef28)[0x7f38723c7f28]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(+0x48687d)[0x7f387258f87d]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(+0x486bfc)[0x7f387258fbfc]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(m0_be_emap_paste+0x4aa)[0x7f38723d4c3a]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(+0x48a01e)[0x7f387259301e]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(+0x48a189)[0x7f3872593189]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(m0_stob_punch+0xc0)[0x7f387259adc0]
2022-08-05 02:16:10,932 - INFO - /lib64/libmotr.so.2(+0x3a2469)[0x7f38724ab469]
2022-08-05 02:16:10,933 - INFO - /lib64/libmotr.so.2(+0x3a4087)[0x7f38724ad087]
2022-08-05 02:16:10,933 - INFO - /lib64/libmotr.so.2(+0x3833e8)[0x7f387248c3e8]
2022-08-05 02:16:10,933 - INFO - /lib64/libmotr.so.2(m0_thread_trampoline+0x5e)[0x7f38724c26ce]
2022-08-05 02:16:10,933 - INFO - /lib64/libmotr.so.2(+0x3c4e31)[0x7f38724cde31]
2022-08-05 02:16:10,933 - INFO - /lib64/libpthread.so.0(+0x815a)[0x7f3871b6f15a]
2022-08-05 02:16:10,933 - INFO - /lib64/libc.so.6(clone+0x43)[0x7f3870474dd3]
2022-08-05 02:16:12,048 - INFO - /usr/libexec/cortx-motr/motr-start: line 53:    42 Aborted                 (core dumped) /usr/libexec/cortx-motr/motr-server "$1"
2022-08-05 02:16:15,942 - INFO - motr[00042]:  a2a0  ERROR  [pool/pool_machine.c:632:m0_poolmach_state_transit]  <! rc=-22
2022-08-05 02:16:15,943 - INFO - motr[00042]:  a2a0  ERROR  [pool/pool_machine.c:632:m0_poolmach_state_transit]  <! rc=-22
2022-08-05 02:16:15,943 - INFO - motr[00042]:  a2a0  ERROR  [pool/pool_machine.c:632:m0_poolmach_state_transit]  <! rc=-22{noformat}
Log Location : http://cortx-storage.colo.seagate.com/logs1/6month-retention/CORTX-31907/

Copy link

Rohan Kulkarni commented in Jira Server:

Observed similar panic in main build 869. 

cortx-all:2.0.0-869 | service script version : v0.8.0 (Resource limits defined in F-99D) 

Setup Configuration: 
Cluster: 5 Worker Node + 1 Master Node
sns: 4+1+0
dix: 1+4+0

Workload executed for 315 hrs. More details can be found under https://seagate-systems.atlassian.net/wiki/spaces/PRIVATECOR/pages/1086750721/Test+Summary+PI8
{noformat}
2022-08-07 20:17:33,686 - INFO - motr[00042]: b010 FATAL [lib/assert.c:50:m0_panic] panic: (0) at m0_balloc_load_extents() (balloc/balloc.c:1456) [git: 2.0.0-859-19-g84fac5ca] /etc/cortx/log/motr/05b1077075a0511aeec88ec0f12b4acc/trace/m0d-0x7200000000000001:0x1/m0trace.42.2022-08-07-16:54:16
2022-08-07 20:17:33,686 - INFO - Motr panic: (0) at m0_balloc_load_extents() balloc/balloc.c:1456 (errno: 0) (last failed: none) [git: 2.0.0-859-19-g84fac5ca] pid: 42 /etc/cortx/log/motr/05b1077075a0511aeec88ec0f12b4acc/trace/m0d-0x7200000000000001:0x1/m0trace.42.2022-08-07-16:54:16
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7fd3ec327e13]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7fd3ec327fe9]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(m0_panic+0x13d)[0x7fd3ec316a1d]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(m0_balloc_load_extents+0x670)[0x7fd3ec221f10]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(+0x2bfcdf)[0x7fd3ec222cdf]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(+0x489b5e)[0x7fd3ec3ecb5e]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(m0_stob_io_prepare+0x1bf)[0x7fd3ec3eff6f]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(m0_stob_io_prepare_and_launch+0x65)[0x7fd3ec3f0415]
2022-08-07 20:17:33,732 - INFO - /lib64/libmotr.so.2(+0x39ddf9)[0x7fd3ec300df9]
2022-08-07 20:17:33,733 - INFO - /lib64/libmotr.so.2(+0x39b4d3)[0x7fd3ec2fe4d3]
2022-08-07 20:17:33,733 - INFO - /lib64/libmotr.so.2(+0x3842d8)[0x7fd3ec2e72d8]
2022-08-07 20:17:33,733 - INFO - /lib64/libmotr.so.2(m0_thread_trampoline+0x5e)[0x7fd3ec31d56e]
2022-08-07 20:17:33,733 - INFO - /lib64/libmotr.so.2(+0x3c5cd1)[0x7fd3ec328cd1]
2022-08-07 20:17:33,733 - INFO - /lib64/libpthread.so.0(+0x815a)[0x7fd3eb9c915a]
2022-08-07 20:17:33,733 - INFO - /lib64/libc.so.6(clone+0x43)[0x7fd3ea2cedd3]
2022-08-07 20:17:36,761 - INFO - /usr/libexec/cortx-motr/motr-start: line 53: 42 Aborted (core dumped) /usr/libexec/cortx-motr/motr-server "$1"
2022-08-07 20:17:38,018 - INFO - Logger configured for phase=start{noformat}
Log Location: [http://cortx-storage.colo.seagate.com/logs1/6month-retention/CORTX-31907/setup_2_869/]

Copy link

Mukund Bala Kanekar commented in Jira Server:

The assert occurs while trying to load extents and discrepancies is seen between 2 balloc btrees, but thee is not enough trace/debug info available confirm how we ended up in this situation as the inconsistencies happened in past.

Custom build [#7478|https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7478/] is created to assert early in possible places which could lead such discrepancies, this can help to confirm the RCA.

[~931947], [~522059] can we rerun the similar tests on one of the setup with Custom build [#7478|https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7478/]  to reproduce issues ?

CORTX images are available at,
cortx-docker.colo.seagate.com/seagate/cortx-rgw:2.0.0-7478-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-7478-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-7478-custom-ci
cortx-docker.colo.seagate.com/seagate/cortx-control:2.0.0-7478-custom-ci

Copy link

Priyanka Borawake commented in Jira Server:

https://jts.seagate.com/browse/CORTX-31907?focusedCommentId=2411889&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-2411889 

Had restarted test after panic mentioned in above comment, Observing continuous container restart due to panic.  Not able to continue further testing due to above. 
{noformat}
[root@cortx-data-headless-svc-ssc-vm-rhev4-2385 m0d-0x7200000000000001:0x1]# ls -lrt
total 7633676
drwx------ 3 root root 4096 Jul 25 10:00 db
drwx------ 3 root root 4096 Jul 25 10:00 db-log
drwx------ 3 root root 4096 Jul 25 10:00 stobs
-rw------- 1 root root 1688387584 Aug 7 20:17 core.1659903453.42
-rw------- 1 root root 1433870336 Aug 7 20:18 core.1659903487.42
-rw------- 1 root root 1337831424 Aug 7 20:26 core.1659903996.43
-rw------- 1 root root 1042116608 Aug 7 20:29 core.1659904148.42
-rw------- 1 root root 1049464832 Aug 7 20:31 core.1659904277.42
-rw------- 1 root root 1044635648 Aug 7 20:33 core.1659904409.42
-rw------- 1 root root 1068785664 Aug 8 04:52 core.1659934366.42
-rw------- 1 root root 988745728 Aug 8 04:53 core.1659934401.42
-rw------- 1 root root 999981056 Aug 8 04:54 core.1659934458.42
-rw------- 1 root root 1146548224 Aug 8 05:06 core.1659935167.42
-rw------- 1 root root 1023770624 Aug 8 05:06 core.1659935188.42
-rw------- 1 root root 1083695104 Aug 8 07:59 core.1659945591.43
-rw------- 1 root root 1012453376 Aug 8 08:00 core.1659945606.42
-rw------- 1 root root 1089105920 Aug 8 08:02 core.1659945754.42
-rw------- 1 root root 1060470784 Aug 8 08:05 core.1659945909.42
-rw------- 1 root root 1048936448 Aug 8 08:07 core.1659946041.42
-rw------- 1 root root 1032396800 Aug 8 08:10 core.1659946211.43
-rw------- 1 root root 1149149184 Aug 8 08:15 core.1659946513.42{noformat}
{noformat}
[root@ssc-vm-rhev4-2381 k8_cortx_cloud]# kubectl get pods
NAME READY STATUS RESTARTS AGE
cortx-consul-client-4pf7f 1/1 Running 0 14d
cortx-consul-client-b8bgr 1/1 Running 0 14d
cortx-consul-client-jz84c 1/1 Running 0 14d
cortx-consul-client-plsf2 1/1 Running 0 14d
cortx-consul-client-x5t7b 1/1 Running 0 14d
cortx-consul-server-0 1/1 Running 0 14d
cortx-consul-server-1 1/1 Running 0 14d
cortx-consul-server-2 1/1 Running 0 14d
cortx-control-779d97c794-jpn8r 1/1 Running 0 14d
cortx-data-ssc-vm-rhev4-2382-78f7cb9b6d-jx696 3/3 Running 3 (15h ago) 14d
cortx-data-ssc-vm-rhev4-2383-58fdfdfffc-zzs5r 3/3 Running 2 (4d6h ago) 14d
cortx-data-ssc-vm-rhev4-2384-6c9594f6d6-42tq8 3/3 Running 3 (18h ago) 14d
cortx-data-ssc-vm-rhev4-2385-587f859595-zv4fr 3/3 Running 21 (106m ago) 14d
cortx-data-ssc-vm-rhev4-2386-77478f8c5b-xqlbn 3/3 Running 3 (14h ago) 14d
cortx-ha-74d8ff66db-zdwdr 3/3 Running 0 13d
cortx-kafka-0 1/1 Running 1 (14d ago) 14d
cortx-kafka-1 1/1 Running 0 14d
cortx-kafka-2 1/1 Running 0 14d
cortx-server-ssc-vm-rhev4-2382-69df9c4c4c-cbtg9 2/2 Running 0 14d
cortx-server-ssc-vm-rhev4-2383-8c5579c6f-cd5hx 2/2 Running 0 14d
cortx-server-ssc-vm-rhev4-2384-68749b8969-gtq7l 2/2 Running 0 14d
cortx-server-ssc-vm-rhev4-2385-56d49fb5d5-5zgjc 2/2 Running 0 14d
cortx-server-ssc-vm-rhev4-2386-bb6dfbbcc-24269 2/2 Running 0 14d
cortx-zookeeper-0 1/1 Running 0 14d
cortx-zookeeper-1 1/1 Running 0 14d
cortx-zookeeper-2 1/1 Running 0 14d
{noformat}
logs attached : [Index of /logs1/6month-retention/CORTX-31907/setup_2_869_rerun/ (seagate.com)|http://cortx-storage.colo.seagate.com/logs1/6month-retention/CORTX-31907/setup_2_869_rerun/]

Copy link

Priyanka Borawake commented in Jira Server:

Raising Severity to 1 as not able to continue further testing on setup. 

 

Copy link

Mukund Bala Kanekar commented in Jira Server:

Can we restart test on "Setup-2" with Custom build #7478 to reproduce issue.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Created new custom build [#7496|https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7496/]
The base is taken as main build #890, (Last known build in use for IO-Stability Testing).

Deployment is started on setup 2 using Jenkins job https://eos-jenkins.colo.seagate.com/view/QA-R2/job/QA/job/K8s_Cortx_Continuous_Deployment_sns_config/1469/

Thank you [~535790] for providing details.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Deployment of custom build #7496 succeeded on setup2 with below Jenkins job.
https://eos-jenkins.colo.seagate.com/view/QA-R2/job/QA/job/K8s_Cortx_Continuous_Deployment_sns_config/1470/

Copy link

Mukund Bala Kanekar commented in Jira Server:

Added assert when there is an intermediate return in alloc/free_db_update() due to error. This will avoid inconstancies in 2 balloc btree's to become persistent and will avoid multiple subsequent panic at m0_balloc_load_extents().

Copy link

Gaurav Chaudhari commented in Jira Server:

{panel:bgColor=#c1c7d0}h2. motr - main branch build pipeline SUCCESS
h3. Build Info:

h3. Image Location :

  • cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-5888
    cortx-docker.colo.seagate.com/seagate/cortx-rgw:2.0.0-5888
    cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-5888
    cortx-docker.colo.seagate.com/seagate/cortx-control:2.0.0-5888
    {panel}

Copy link

Mukund Bala Kanekar commented in Jira Server:

IO-Stability Test is running on custom build #7496 from last 2 days 21hrs (https://eos-jenkins.colo.seagate.com/job/QA/job/IOStabilityTestRuns/367/ )

PR #2064 is merged, we will resolve current JIRA once we confirm that multiple subsequent panic at m0_balloc_load_extents() are no longer seen.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Custom build #7496 is running fine for
A) last 6 days 21 hrs
https://eos-jenkins.colo.seagate.com/job/QA/job/IOStabilityTestRuns/367/
B) last 7 days
https://eos-jenkins.colo.seagate.com/job/QA/job/IOStabilityTestRuns/365/

Copy link

Mukund Bala Kanekar commented in Jira Server:

Custom build #7496 is running fine for last 12+ days
[https://eos-jenkins.colo.seagate.com/job/QA/job/IOStabilityTestRuns/367/]

Copy link

Mukund Bala Kanekar commented in Jira Server:

Custom build #7496 is running fine for last 13+ days
[https://eos-jenkins.colo.seagate.com/job/QA/job/IOStabilityTestRuns/367/]

Closing for now, the main build #895 and onwards has assert when there is an intermediate return in alloc/free_db_update() due to error. 

So in case if issue is seen again in future we will have additional data to debug further.

Copy link

Mukund Bala Kanekar commented in Jira Server:

Intermediate return in alloc/free_db_update() due to error(-2 (ENOENT)) was hit on build [2.0.0-916][v0.9.0].

Analysis,

  1. In exiting code, decision about group is loaded vs not loaded is taken based on grp->bgi_extents pointer value.
  2. On restart if free fragment count = 0 then bgi_extents will remain NULL.
  3. Later when any fragment is freed up then corresponding entry in extent list and extent btree will be added by creating new extent using lext_create() api i.e le_is_alloc = true.
  4. The subsequent call to m0_balloc_load_extents() will find bgi_extents == NULL and try to load extent allocated using M0_ALLOC_ARR where le_is_alloc = false, but with same start/end values.
  5. This ends up in having overlapping entry in extent list and mismatch in fragment count in group info vs actual total entries in list.
  6. If one of the overlapping extents goes off w.r. to value stored in btree due subsequent free/alloc operations, then extent btree update operation fails with -2 (ENOENT) error.

Fix,
Added separate flag to decide about group loaded vs not loaded status, and will avoid overlapping entries in extents list.

For more details please refer,

  1. Jira - https://jts.seagate.com/browse/CORTX-34042
  2. PR - [https://github.com/CORTX-34042 - Fix load_extents logic in balloc module. #2120]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants