-
Notifications
You must be signed in to change notification settings - Fork 142
Client crashes on too many failures #1838
Comments
For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-31844. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue. |
After analysing the m0trace files collected from the crash, the code path leading to the crash is explained in more details below:
To avoid the panic, adding a check in symm_tree_attr_get() to ensure the minimum number of children at each level must be greater than 0, otherwise an -EINVAL is returned. PR for the fix: #1856 |
Problem: as desribed in issue Seagate#1838, there exists a case in which the failure domain tree built for a pool version has 4 levels (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS, M0_CONF_PVER_LVL_DRIVES), the minimum number of children at top 3 level are 3, 1, 0. Although at the end of symm_tree_attr_get(), it calls tolerance_check() to check failure settings, tolerance_check() only checks the top 2 levels and ignores the 3rd level (M0_CONF_PVER_LVL_CTRLS). After the above checks, m0_fd__tile_init() is called and it calls pool_width_calc() which asserts that the minimum number of children at the top 3 level (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS) is not 0, but as the number at the M0_CONF_PVER_LVL_CTRLS is 0, that leads to the panic. Solution: to avoid the panic, adding a check in symm_tree_attr_get() to ensure the minimum number of children at each level must be greater than 0, otherwise -EINVAL is returned. Signed-off-by: Sining Wu <sining.wu@seagate.com>
RGW fix to avoid the panic by Andriy: Seagate/cortx-rgw@12d90d3 |
This issue/pull request has been marked as |
Hi @andriytk , |
HI @andriytk , |
No, the fix has not been landed yet - #1856. |
Problem: as desribed in issue #1838, there exists a case in which the failure domain tree built for a pool version has 4 levels (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS, M0_CONF_PVER_LVL_DRIVES), the minimum number of children at top 3 level are 3, 1, 0. Although at the end of symm_tree_attr_get(), it calls tolerance_check() to check failure settings, tolerance_check() only checks the top 2 levels and ignores the 3rd level (M0_CONF_PVER_LVL_CTRLS). After the above checks, m0_fd__tile_init() is called and it calls pool_width_calc() which asserts that the minimum number of children at the top 3 level (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS) is not 0, but as the number at the M0_CONF_PVER_LVL_CTRLS is 0, that leads to the panic. Solution: to avoid the panic, adding a check in symm_tree_attr_get() to ensure the minimum number of children at each level must be greater than 0, otherwise -EINVAL is returned. * conf: check pvs_tolerance is greater than 0 before decreasing it Signed-off-by: Sining Wu <sining.wu@seagate.com>
{panel:bgColor=#c1c7d0}h2. motr - main branch build pipeline SUCCESS
h3. Image Location :
|
Marking this issue Closed and corresponding PR [https://github.com//pull/1856] is merged |
…ate#1856) Problem: as desribed in issue Seagate#1838, there exists a case in which the failure domain tree built for a pool version has 4 levels (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS, M0_CONF_PVER_LVL_DRIVES), the minimum number of children at top 3 level are 3, 1, 0. Although at the end of symm_tree_attr_get(), it calls tolerance_check() to check failure settings, tolerance_check() only checks the top 2 levels and ignores the 3rd level (M0_CONF_PVER_LVL_CTRLS). After the above checks, m0_fd__tile_init() is called and it calls pool_width_calc() which asserts that the minimum number of children at the top 3 level (root, M0_CONF_PVER_LVL_ENCLS, M0_CONF_PVER_LVL_CTRLS) is not 0, but as the number at the M0_CONF_PVER_LVL_CTRLS is 0, that leads to the panic. Solution: to avoid the panic, adding a check in symm_tree_attr_get() to ensure the minimum number of children at each level must be greater than 0, otherwise -EINVAL is returned. * conf: check pvs_tolerance is greater than 0 before decreasing it Signed-off-by: Sining Wu <sining.wu@seagate.com>
Versions:
Container images versions from solution.yaml file:
Script to reproduce the issue:
To see the motr errors from rgw run this command:
The text was updated successfully, but these errors were encountered: