Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Client crashes on too many failures #1838

Closed
andriytk opened this issue May 30, 2022 · 9 comments · Fixed by #1856
Closed

Client crashes on too many failures #1838

andriytk opened this issue May 30, 2022 · 9 comments · Fixed by #1856
Assignees
Labels
Status: L2 Triage Triage: DevTeam Triage owner is on the dev team

Comments

@andriytk
Copy link
Contributor

andriytk commented May 30, 2022

motr[00001]:  12a0  ERROR  [io_req.c:1551:device_check]  <! rc=-5 [0x558882659000] too many failures: nodes=1 + svcs=1 + devs=0, allowed: nodes=1 or svcs=1 or devs=2
motr[00001]:  14a0  ERROR  [io_req.c:549:ioreq_iosm_handle_executed]  iro_dgmode_write() failed, rc=-5
<7600000000000001:b8>: nr_failures:5 max_failures:2 event_index:10 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:f5>: nr_failures:3 max_failures:2 event_index:5 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:5 max_failures:2 event_index:8 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:5 max_failures:2 event_index:8 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:a>: nr_failures:3 max_failures:2 event_index:8 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:5 max_failures:2 event_index:8 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:b8>: nr_failures:6 max_failures:2 event_index:11 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:6 max_failures:2 event_index:9 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:6 max_failures:2 event_index:9 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:a>: nr_failures:4 max_failures:2 event_index:9 event_state:3
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:6 max_failures:2 event_index:9 event_state:3
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:b8>: nr_failures:5 max_failures:2 event_index:4 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:5 max_failures:2 event_index:4 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:14>: nr_failures:3 max_failures:2 event_index:4 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:5 max_failures:2 event_index:2 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:5 max_failures:2 event_index:2 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:b8>: nr_failures:4 max_failures:2 event_index:5 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:4 max_failures:2 event_index:5 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:4 max_failures:2 event_index:3 event_state:1
motr[00001]:  ada0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:4 max_failures:2 event_index:3 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:b8>: nr_failures:5 max_failures:2 event_index:4 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:5 max_failures:2 event_index:2 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:5 max_failures:2 event_index:4 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:14>: nr_failures:3 max_failures:2 event_index:4 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:5 max_failures:2 event_index:2 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:b8>: nr_failures:4 max_failures:2 event_index:5 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:4 max_failures:2 event_index:3 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:4 max_failures:2 event_index:5 event_state:1
motr[00001]:  3da0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:4 max_failures:2 event_index:3 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:b8>: nr_failures:5 max_failures:2 event_index:4 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:5 max_failures:2 event_index:4 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:14>: nr_failures:3 max_failures:2 event_index:4 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:5 max_failures:2 event_index:2 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:5 max_failures:2 event_index:2 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7600000000000001:b8>: nr_failures:4 max_failures:2 event_index:5 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:11>: nr_failures:4 max_failures:2 event_index:5 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:3>: nr_failures:4 max_failures:2 event_index:3 event_state:1
motr[00001]:  dda0  ERROR  [pool/pool_machine.c:783:m0_poolmach_state_transit]  <7680000000000002:6>: nr_failures:4 max_failures:2 event_index:3 event_state:1
motr[00001]:  6010  ERROR  [rpc/rpc.c:119:m0_rpc__post_locked]  <! rc=-107
motr[00001]:  6540  ERROR  [cas/client.c:556:cas_req_failure_ast]  <! rc=-107
motr[00001]:  9010  ERROR  [rpc/rpc.c:119:m0_rpc__post_locked]  <! rc=-107
motr[00001]:  9540  ERROR  [cas/client.c:556:cas_req_failure_ast]  <! rc=-107
motr[00001]:  f010  ERROR  [rpc/rpc.c:119:m0_rpc__post_locked]  <! rc=-107
motr[00001]:  f540  ERROR  [cas/client.c:556:cas_req_failure_ast]  <! rc=-107
motr[00001]:  9010  ERROR  [rpc/rpc.c:119:m0_rpc__post_locked]  <! rc=-107
motr[00001]:  9540  ERROR  [cas/client.c:556:cas_req_failure_ast]  <! rc=-107
motr[00001]:  5610  ERROR  [fd/fd.c:425:tolerance_check]  <! rc=-22
motr[00001]:  5740  FATAL  [lib/assert.c:50:m0_panic]  panic: (({ unsigned __nr = (depth); unsigned i; for (i = 0; i < __nr && ({ children_nr[i] != 0 ; }); ++i) ; i == __nr; })) at pool_width_calc() (fd/fd.c:482)  [git: 2.0.0-790-9-g662e7a18] /etc/cortx/log/rgw/dbcf46ecb8524a26b17c207373397162/motr_trace_files/m0trace.1.2022-05-30-11:05:05
Motr panic: (({ unsigned __nr = (depth); unsigned i; for (i = 0; i < __nr && ({ children_nr[i] != 0 ; }); ++i) ; i == __nr; })) at pool_width_calc() fd/fd.c:482 (errno: 11) (last failed: none) [git: 2.0.0-790-9-g662e7a18] pid: 1  /etc/cortx/log/rgw/dbcf46ecb8524a26b17c207373397162/motr_trace_files/m0trace.1.2022-05-30-11:05:05
/lib64/libmotr.so.2(m0_arch_backtrace+0x33)[0x7f77f382f1a3]
/lib64/libmotr.so.2(m0_arch_panic+0xe9)[0x7f77f382f379]
/lib64/libmotr.so.2(m0_panic+0x13d)[0x7f77f381de2d]
/lib64/libmotr.so.2(m0_fd__tile_init+0x1ae)[0x7f77f37dc24e]
/lib64/libmotr.so.2(m0_fd_tile_build+0xad)[0x7f77f37dc5cd]
/lib64/libmotr.so.2(m0_pool_version_init_by_conf+0x1a7)[0x7f77f3890c07]
/lib64/libmotr.so.2(m0_pool_version_append+0xf8)[0x7f77f3892118]
/lib64/libmotr.so.2(+0x4010ff)[0x7f77f38980ff]
/lib64/libmotr.so.2(m0_pool_version_get+0x18c)[0x7f77f3890a0c]
/lib64/libmotr.so.2(m0_layout_find_by_objsz+0x34)[0x7f77f38187e4]
/lib64/libradosgw.so.2(_ZN3rgw3sal10MotrObject11create_mobjEPK18DoutPrefixProviderm+0x42e)[0x7f77f993627e]
/lib64/libradosgw.so.2(_ZN3rgw3sal16MotrAtomicWriter5writeEv+0x980)[0x7f77f9948890]
/lib64/libradosgw.so.2(_ZN9RGWPutObj7executeE14optional_yield+0xd5d)[0x7f77f96a6f5d]
/lib64/libradosgw.so.2(_Z25rgw_process_authenticatedP15RGWHandler_RESTRP5RGWOpP10RGWRequestP9req_state14optional_yieldPN3rgw3sal5StoreEb+0xb3f)[0x7f77f92fb7df]
/lib64/libradosgw.so.2(_Z15process_requestPN3rgw3sal5StoreEP7RGWRESTP10RGWRequestRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_4auth16StrategyRegistryEP12RGWRestfulIOP10OpsLogSink14optional_yieldPNS_7dmclock9SchedulerEPSC_PNSt6chrono8durationImSt5ratioILl1ELl1000000000EEEESt10shared_ptrI11RateLimiterEPi+0x25c6)[0x7f77f92fe6f6]
/lib64/libradosgw.so.2(+0x4cd0aa)[0x7f77f926b0aa]
/lib64/libradosgw.so.2(+0x4ce751)[0x7f77f926c751]
/lib64/libradosgw.so.2(+0x4ce8cc)[0x7f77f926c8cc]
/lib64/libradosgw.so.2(make_fcontext+0x2f)[0x7f77f9b8a65f]
*** Caught signal (Aborted) **
 in thread 7f77d92a4700 thread_name:radosgw
 ceph version 17.0.0-10334-gbdae4dbc0c9 (bdae4dbc0c9a5ccd3d2d3cb430f4d0085802cef4) quincy (dev)
 1: /lib64/libpthread.so.0(+0x12b30) [0x7f77f67b8b30]
 2: gsignal()
 3: abort()
 4: /lib64/libmotr.so.2(+0x398383) [0x7f77f382f383]
 5: m0_panic()
 6: m0_fd__tile_init()
 7: m0_fd_tile_build()
 8: m0_pool_version_init_by_conf()
 9: m0_pool_version_append()
 10: /lib64/libmotr.so.2(+0x4010ff) [0x7f77f38980ff]
 11: m0_pool_version_get()
 12: m0_layout_find_by_objsz()
 13: (rgw::sal::MotrObject::create_mobj(DoutPrefixProvider const*, unsigned long)+0x42e) [0x7f77f993627e]
 14: (rgw::sal::MotrAtomicWriter::write()+0x980) [0x7f77f9948890]
 15: (RGWPutObj::execute(optional_yield)+0xd5d) [0x7f77f96a6f5d]
 16: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Store*, bool)+0xb3f) [0x7f77f92fb7df]
 17: (process_request(rgw::sal::Store*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSink*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, std::shared_ptr<RateLimiter>, int*)+0x25c6) [0x7f77f92fe6f6]
 18: /lib64/libradosgw.so.2(+0x4cd0aa) [0x7f77f926b0aa]
 19: /lib64/libradosgw.so.2(+0x4ce751) [0x7f77f926c751]
 20: /lib64/libradosgw.so.2(+0x4ce8cc) [0x7f77f926c8cc]
 21: make_fcontext()
2022-05-30T16:10:39.967+0000 7f77d92a4700 -1 *** Caught signal (Aborted) **
 in thread 7f77d92a4700 thread_name:radosgw

Versions:

[root@ssc-vm-g4-rhev4-1490 ~]# kubectl exec -it cortx-server-ssc-vm-g4-rhev4-1490-59469c57cd-9xdz6 -c cortx-rgw -- rpm -qa | grep -E cortx\|radosgw
cortx-rgw-integration-2.0.0-5068_765d062.noarch
cortx-motr-2.0.0-5155_git662e7a18.el8.x86_64
cortx-provisioner-2.0.0-5038_0aa6ce08.noarch
cortx-py-utils-2.0.0-5043_a2e13c4.noarch
ceph-radosgw-17.0.0-10334.gbdae4dbc0c9.el8.x86_64
cortx-hare-2.0.0-5229_git5443f9c.el8.x86_64

Container images versions from solution.yaml file:

    cortxcontrol: cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-5518
    cortxdata: cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-5518
    cortxserver: cortx-docker.colo.seagate.com/seagate/cortx-rgw:2.0.0-5518
    cortxha: cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-5518
    cortxclient: cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-5518

Script to reproduce the issue:

#!/bin/bash

dd if=/dev/urandom of=/tmp/200m bs=1M count=200
dd if=/tmp/200m of=/tmp/196m bs=1M count=196

pods=($(kubectl get pods | grep x-data | awk '{print $1}'))

put='aws s3api put-object --bucket test-bucket --key 200m --body /tmp/200m --endpoint-url http://192.168.60.187:30080'
get='aws s3api get-object --bucket test-bucket --key 200m --range bytes=0-205520895 /tmp/196m.check --endpoint-url http://192.168.60.187:30080'

# kills random m0d-ios
kill()
{
  n=${#pods[@]}
  i=$(($RANDOM % $n))
  c=$(($RANDOM % 2 + 1))
  kubectl exec -it ${pods[$i]} -c cortx-motr-io-00$c -- /bin/pkill -9 m0d
}

$put || exit 1
rc=$?

while [[ $rc -eq 0 ]] && $get && cmp /tmp/196m{,.check}; do
  { $put; rc=$?; } &
  sleep 3
  $(kill)
  wait
  kubectl get pods
done

To see the motr errors from rgw run this command:

kubectl get pods | grep x-server | awk '{print $1}' | while read p; do kubectl logs $p -c cortx-rgw -f & done
Copy link

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-31844. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

@siningwuseagate
Copy link
Contributor

siningwuseagate commented Jun 6, 2022

After analysing the m0trace files collected from the crash, the code path leading to the crash is explained in more details below:

  1. m0_layout_find_by_objsz() --> m0_pool_version_get() --> pool->po_pver_policy->pp_ops->ppo_get()
    pop_get is the function pointer to pver_first_available_get(), inside this function, the following work is done:
    it tries to get a clean pool version from the cache by calling m0_pool_clean_pver_find(), in our case, as we observed, the
    cached pool version is dirty (we may need to dig further to understand how this happens), so a new pool version is created
    by calling m0_conf_pver_get() and is appended to the pool using m0_pool_version_append()

  2. m0_pool_version_append() --> m0_pool_version_init_by_conf() --> m0_fd_tile_build()
    m0_fd_tile_build() calls symm_tree_attr_get() to get the failure domain tree's attributes, one of the attributes is the
    the minimum numbers of children at each level.

    In our case, the tree for the pool version has 4 levels
    (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS, M0_CONF_PVER_LVL_DRIVES), the minimum
    number of children of the top 3 levels are 3, 1, 0. Although at the end of symm_tree_attr_get(), it calls tolerance_check()
    to check failure settings. But tolerance_check() only checks the top 2 levels and ignores the 3rd level
    (M0_CONF_PVER_LVL_CTRLS).

  3. After the above checks, m0_fd__tile_init() is called and it calls pool_width_calc() which asserts that the minimum
    number of children at each level (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS) is not 0, but as the
    number at the M0_CONF_PVER_LVL_DRIVES is 0, that leads to the panic.

To avoid the panic, adding a check in symm_tree_attr_get() to ensure the minimum number of children at each level must be greater than 0, otherwise an -EINVAL is returned.

PR for the fix: #1856

siningwuseagate added a commit to siningwuseagate/cortx-motr that referenced this issue Jun 6, 2022
Problem: as desribed in issue Seagate#1838, there exists a case in which
the failure domain tree built for a pool version has 4 levels
(root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS,
M0_CONF_PVER_LVL_DRIVES), the minimum number of children at top 3
level are 3, 1, 0.

Although at the end of symm_tree_attr_get(), it
calls tolerance_check() to check failure settings, tolerance_check()
only checks the top 2 levels and ignores the 3rd level
(M0_CONF_PVER_LVL_CTRLS).

After the above checks, m0_fd__tile_init() is called and it calls
pool_width_calc() which asserts that the minimum
number of children at the top 3 level (root, M0_CONF_PVER_LVL_ENCLS,
M0_CONF_PVER_LVL_CTRLS) is not 0, but as the number at the
M0_CONF_PVER_LVL_CTRLS is 0, that leads to the panic.

Solution: to avoid the panic, adding a check in
symm_tree_attr_get() to ensure the minimum number of children at
each level must be greater than 0, otherwise -EINVAL is returned.

Signed-off-by: Sining Wu <sining.wu@seagate.com>
@siningwuseagate
Copy link
Contributor

RGW fix to avoid the panic by Andriy: Seagate/cortx-rgw@12d90d3

@r-wambui r-wambui added Triage: DevTeam Triage owner is on the dev team Status: L2 Triage labels Jun 8, 2022
@stale
Copy link

stale bot commented Jun 13, 2022

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 4 days. Tagging @nkommuri @mehjoshi @huanghua78 for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

@chandradharraval
Copy link

Hi @andriytk ,
I see RGW fix is integrated based on above comments from @siningwuseagate. Any further work pending for this issue?

@stale stale bot removed the needs-attention label Jun 13, 2022
@chandradharraval
Copy link

HI @andriytk ,
Are we good to close this based on above comment?

@andriytk
Copy link
Contributor Author

No, the fix has not been landed yet - #1856.

mehjoshi pushed a commit that referenced this issue Jun 17, 2022
Problem: as desribed in issue #1838, there exists a case in which
the failure domain tree built for a pool version has 4 levels
(root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS,
M0_CONF_PVER_LVL_DRIVES), the minimum number of children at top 3
level are 3, 1, 0.

Although at the end of symm_tree_attr_get(), it
calls tolerance_check() to check failure settings, tolerance_check()
only checks the top 2 levels and ignores the 3rd level
(M0_CONF_PVER_LVL_CTRLS).

After the above checks, m0_fd__tile_init() is called and it calls
pool_width_calc() which asserts that the minimum
number of children at the top 3 level (root, M0_CONF_PVER_LVL_ENCLS,
M0_CONF_PVER_LVL_CTRLS) is not 0, but as the number at the
M0_CONF_PVER_LVL_CTRLS is 0, that leads to the panic.

Solution: to avoid the panic, adding a check in
symm_tree_attr_get() to ensure the minimum number of children at
each level must be greater than 0, otherwise -EINVAL is returned.

* conf: check pvs_tolerance is greater than 0 before decreasing it

Signed-off-by: Sining Wu <sining.wu@seagate.com>
Copy link

Gaurav Chaudhari commented in Jira Server:

{panel:bgColor=#c1c7d0}h2. motr - main branch build pipeline SUCCESS
h3. Build Info:

h3. Image Location :

  • cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-5638
    cortx-docker.colo.seagate.com/seagate/cortx-rgw:2.0.0-5638
    cortx-docker.colo.seagate.com/seagate/cortx-data:2.0.0-5638
    cortx-docker.colo.seagate.com/seagate/cortx-control:2.0.0-5638
    {panel}

Copy link

Chandradhar Raval commented in Jira Server:

Marking this issue Closed and corresponding PR [https://github.com//pull/1856] is merged

mehjoshi pushed a commit to mehjoshi/cortx-motr that referenced this issue Jul 18, 2022
…ate#1856)


Problem: as desribed in issue Seagate#1838, there exists a case in which
the failure domain tree built for a pool version has 4 levels
(root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS,
M0_CONF_PVER_LVL_DRIVES), the minimum number of children at top 3
level are 3, 1, 0.

Although at the end of symm_tree_attr_get(), it
calls tolerance_check() to check failure settings, tolerance_check()
only checks the top 2 levels and ignores the 3rd level
(M0_CONF_PVER_LVL_CTRLS).

After the above checks, m0_fd__tile_init() is called and it calls
pool_width_calc() which asserts that the minimum
number of children at the top 3 level (root, M0_CONF_PVER_LVL_ENCLS,
M0_CONF_PVER_LVL_CTRLS) is not 0, but as the number at the
M0_CONF_PVER_LVL_CTRLS is 0, that leads to the panic.

Solution: to avoid the panic, adding a check in
symm_tree_attr_get() to ensure the minimum number of children at
each level must be greater than 0, otherwise -EINVAL is returned.

* conf: check pvs_tolerance is greater than 0 before decreasing it

Signed-off-by: Sining Wu <sining.wu@seagate.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Status: L2 Triage Triage: DevTeam Triage owner is on the dev team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants