-
Notifications
You must be signed in to change notification settings - Fork 142
panic at m0_balloc_load_extents() (balloc/balloc.c:1347) #1845
Comments
For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-31906. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue. |
For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-31907. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue. |
Last m0trace records:
|
from gdb:
|
Looks like there is no more free extents in the balloc group? |
Found series of crash dumps on ssc-vm-g4-rhev4-1491 in timeframe May 31 09:16 to May 31 10:20 (MDT time zone) Further analysis from core.1654010193.35,
|
From core dump details of group_extents and group_desc btree's are as follows |
Further debug shows that the root node in Group Descriptor Btree holds the group 2 Descriptor at kv array index 2. |
Please use this patch to dump the ex before checking it:
So we can see what the 'ex' is. |
[~520428] so far we are doing static analysis with available core dump. Also for group 2 we are able to extract extent details from dump , please refer my previous comments. |
[~522123], Oh, I see. {quote}
We need to figure out if: |
The other extents length 0x100 is derived based on the group 2 Descriptor in btree ( total free blocks are 0x300 and first extent is of length 0x200.) There could be 2 possibilities,
Looking as code so far #1 looks to more accurate case as btree update order is, extent btree is updated first then group descriptor btree is updated later. |
This issue/pull request has been marked as |
Balloc Extent btree and Balloc Group Desc btree can go out of sync with following sequence of events,
To confirm sequence really occurs, debug build is created with,
The custom build details are as follows, CORTX images are available at, [~520414] can you please use above build for your experimentations and also share the steps you did before ending ending up in balloc assert. |
[~522123], where is the patch (to build custom-build-6911)? |
Sure, I have attached patch "balloc_debug_0623.txt" |
There was regression issue with previous custom build #6911 (dur to recent changes in rgw) Created new custom build #6930 with other component commit #s from last sanity passed build #837 ([https://github.com/seagate/cortx/pkgs/container/cortx-rgw/26506950?tag=2.0.0-837)] CORTX images are available at, CFT Sanity is successful on custom build#6930, please refer CORTX-32425 for more details. |
This issue/pull request has been marked as |
As part of clean up activity [https://github.com/Seagate/cortx-motr/tree/CORTX-31907_balloc_debug] branch is moved to forked repo [https://github.com/mukundkanekar/cortx-motr/tree/CORTX-31907_balloc_debug] |
So far there was no luck while trying to reproduce on custom build. [~530903] can you please share Jira or support bundle details if has seen recently ? |
Similar panic is also observed in Happy path Type-1 IOstability run with #869 build. System details : Logs : Support bundle location: /root/deploy-scripts/k8_cortx_cloud/logs-cortx-cloud-2022-07-27_21-59.tar cc : [~522123], [~520428], [~522059], [~531171] |
Sorry, [~530903], [~522123] |
Found below core dumps on 'sc-vm-rhev4-2478' vm Analyzed 'Jul 31 18:25 core.1659313526.42' further using gdb and build #869. 2nd extent/fregment( e_start = 0xa0000,e_end = 0xc0000) under process belongs to next group(group 5) range. Next step to extract group descriptor info for group 4 from group descriptor btree. |
Extracted 'struct m0_balloc' pointer from frame 5 Extracted group descriptor btree info, Extracted keys array details from group descriptor btree Extracted value( group descriptor info for group 4)from group descriptor btree Summary :
The trace files for the period before crash are not available, which could have helped to understand events which leads into this situation. |
Observed similar issue while testing IO stability type-1 workload on custom 7267 with dtm enabled build with IO stability resource limit changes. |
Observed similar panic in main build 869. cortx-all:2.0.0-869 | service script version : v0.8.0 (Resource limits defined in F-99D) Setup Configuration: Workload executed for 315 hrs. More details can be found under https://seagate-systems.atlassian.net/wiki/spaces/PRIVATECOR/pages/1086750721/Test+Summary+PI8 |
The assert occurs while trying to load extents and discrepancies is seen between 2 balloc btrees, but thee is not enough trace/debug info available confirm how we ended up in this situation as the inconsistencies happened in past. Custom build [#7478|https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7478/] is created to assert early in possible places which could lead such discrepancies, this can help to confirm the RCA. [~931947], [~522059] can we rerun the similar tests on one of the setup with Custom build [#7478|https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7478/] to reproduce issues ? CORTX images are available at, |
Had restarted test after panic mentioned in above comment, Observing continuous container restart due to panic. Not able to continue further testing due to above. |
Raising Severity to 1 as not able to continue further testing on setup.
|
Can we restart test on "Setup-2" with Custom build #7478 to reproduce issue. |
Created new custom build [#7496|https://eos-jenkins.colo.seagate.com/job/GitHub-custom-ci-builds/job/generic/job/custom-ci/7496/] Deployment is started on setup 2 using Jenkins job https://eos-jenkins.colo.seagate.com/view/QA-R2/job/QA/job/K8s_Cortx_Continuous_Deployment_sns_config/1469/ Thank you [~535790] for providing details. |
Deployment of custom build #7496 succeeded on setup2 with below Jenkins job. |
Added assert when there is an intermediate return in alloc/free_db_update() due to error. This will avoid inconstancies in 2 balloc btree's to become persistent and will avoid multiple subsequent panic at m0_balloc_load_extents(). |
{panel:bgColor=#c1c7d0}h2. motr - main branch build pipeline SUCCESS
h3. Image Location :
|
IO-Stability Test is running on custom build #7496 from last 2 days 21hrs (https://eos-jenkins.colo.seagate.com/job/QA/job/IOStabilityTestRuns/367/ ) PR #2064 is merged, we will resolve current JIRA once we confirm that multiple subsequent panic at m0_balloc_load_extents() are no longer seen. |
Custom build #7496 is running fine for |
Custom build #7496 is running fine for last 12+ days |
Custom build #7496 is running fine for last 13+ days Closing for now, the main build #895 and onwards has assert when there is an intermediate return in alloc/free_db_update() due to error. So in case if issue is seen again in future we will have additional data to debug further. |
Intermediate return in alloc/free_db_update() due to error(-2 (ENOENT)) was hit on build [2.0.0-916][v0.9.0]. Analysis,
Fix, For more details please refer,
|
pods:
Such situation and state was reproduced while trying to reproduce #1838 issue.
The text was updated successfully, but these errors were encountered: