CORTX-29713: m0_conf_pver_status() now returns CRITICAL if max failures reached at any level #1571

AbhishekSahaSeagate · 2022-03-29T10:15:00Z

When the allowance was set to less than K at any level, and that many numbers of failures
was reached, then bytecount had become critical, but it was still marked as degraded
according to old logic as the failures were still less than K.

ex: in a 3 node cluster with SNS 4+2+0, we support at max 1 node failure.
So when a node failed, data should've become critical, but with old logic,
it was marked as degraded, as number of failures was 1 which is < K.

Updated the logic to check if failures at any level has reached it's max or not.
If it does then pool version is marked as CRITICAL even if failures < K.

Signed-off-by: Abhishek Saha abhishek.saha@seagate.com

Problem Statement

Problem statement

Design

For Bug, Describe the fix here.
For Feature, Post the link for design

Coding

Checklist for Author

Coding conventions are followed and code is consistent

Testing

Checklist for Author

Unit and System Tests are added
Test Cases cover Happy Path, Non-Happy Path and Scalability
Testing was performed with RPM

Impact Analysis

Checklist for Author/Reviewer/GateKeeper

Interface change (if any) are documented
Side effects on other features (deployment/upgrade)
Dependencies on other component(s)

Review Checklist

Checklist for Author

JIRA number/GitHub Issue added to PR
PR is self reviewed
Jira and state/status is updated and JIRA is updated with PR link
Check if the description is clear and explained

Documentation

Checklist for Author

Changes done to WIKI / Confluence page / Quick Start Guide

…es reached at any level When the allowance was set to less than K at any level, and that many numbers of failures was reached, then bytecount had become critical, but it was still marked as degraded according to old logic as the failures were still less than K. ex: in a 3 node cluster with SNS 4+2+0, we support at max 1 node failure. So when a node failed, data should've become critical, but with old logic, it was marked as degraded, as number of failures was 1 which is < K. Updated the logic to check if failures at any level has reached it's max or not. If it does then pool version is marked as CRITICAL even if failures < K. Signed-off-by: Abhishek Saha <abhishek.saha@seagate.com>

conf/pvers.c

cortx-admin · 2022-03-30T17:18:29Z

Jenkins CI Result : Motr#1134

Motr Test Summary

Test Result	Count	Info
❌Failed	1	📁 01motr-single-node/00userspace-tests
🏁Skipped	32	📁 01motr-single-node/28sys-kvs 01motr-single-node/35m0singlenode 01motr-single-node/04initscripts 01motr-single-node/37protocol 02motr-single-node/51kem 02motr-single-node/20rpc-session-cancel 02motr-single-node/10pver-assign 02motr-single-node/21fsync-single-node 02motr-single-node/13dgmode-io 02motr-single-node/14poolmach 02motr-single-node/11m0t1fs 02motr-single-node/26motr-user-kernel-tests 02motr-single-node/08spiel 03motr-single-node/06conf 03motr-single-node/36spare-reservation 04motr-single-node/34sns-repair-1n-1f 04motr-single-node/08spiel-sns-repair-quiesce 04motr-single-node/28sys-kvs-kernel 04motr-single-node/11m0t1fs-rconfc-fail 04motr-single-node/08spiel-sns-repair 04motr-single-node/19sns-repair-abort 04motr-single-node/22sns-repair-ios-fail 05motr-single-node/18sns-repair-quiesce 05motr-single-node/12fwait 05motr-single-node/16sns-repair-multi 05motr-single-node/07mount-fail 05motr-single-node/15sns-repair-single 05motr-single-node/23sns-abort-quiesce 05motr-single-node/17sns-repair-concurrent-io 05motr-single-node/07mount 05motr-single-node/07mount-multiple 05motr-single-node/12fsync
✔️Passed	40	📁 01motr-single-node/43m0crate 01motr-single-node/05confgen 01motr-single-node/06hagen 01motr-single-node/52motr-singlenode-sanity 01motr-single-node/01net 01motr-single-node/01kernel-tests 01motr-single-node/03console 01motr-single-node/02rpcping 02motr-single-node/07m0d-fatal 02motr-single-node/67fdmi-plugin-multi-filters 02motr-single-node/53clusterusage-alert 02motr-single-node/41motr-conf-update 03motr-single-node/61sns-repair-motr-1n-1f 03motr-single-node/08spiel-multi-confd 03motr-single-node/69sns-repair-motr-quiesce 03motr-single-node/62sns-repair-motr-mf 03motr-single-node/70sns-failure-after-repair-quiesce 03motr-single-node/63sns-repair-motr-1k-1f 03motr-single-node/60sns-repair-motr-1f 03motr-single-node/66sns-repair-motr-abort-quiesce 03motr-single-node/24motr-dix-repair-lookup-insert-spiel 03motr-single-node/68sns-repair-motr-shutdown 03motr-single-node/64sns-repair-motr-ios-fail 03motr-single-node/24motr-dix-repair-lookup-insert-m0repair 03motr-single-node/04sss 03motr-single-node/65sns-repair-motr-abort 04motr-single-node/48motr-raid0-io 04motr-single-node/49motr-rpc-cancel 04motr-single-node/25m0kv 04motr-single-node/44motr-rm-lock-cc-io 04motr-single-node/45motr-rmw 05motr-single-node/23dix-repair-m0repair 05motr-single-node/43motr-sync-replication 05motr-single-node/42motr-utils 05motr-single-node/45motr-sns-repair-N-1 05motr-single-node/40motr-dgmode 05motr-single-node/23dix-repair-quiesce-m0repair 05motr-single-node/23spiel-dix-repair-quiesce 05motr-single-node/44motr-sns-repair 05motr-single-node/23spiel-dix-repair
Total	73	🔗

CppCheck Summary

Cppcheck: No new warnings found 👍

conf/pvers.c

Now instead of returning whether any level has reached max failures or not, it returns an integer marking MAX_FAILURE_NOT_REACHED, MAX_FAILURE_REACHED, MAX_FAILURE_EXCEEDED. This further helps in more accurate marking of DEGRADED, CRITICAL and DAMAGED states. Changed the logic of m0_conf_pver_status() accordingly. Signed-off-by: Abhishek Saha <abhishek.saha@seagate.com>

conf/pvers.c

mehjoshi · 2022-04-04T07:30:45Z

@madhavemuri, @yeshpal-jain-seagate, The code was updated after Madhav's last review and approval.
Could you re-review the code changes?

Signed-off-by: Abhishek Saha <abhishek.saha@seagate.com>

AbhishekSahaSeagate requested review from mehjoshi, madhavemuri, nikitadanilov, yeshpal-jain-seagate, huanghua78, andriytk, siningwuseagate, vidyadhar-pinglikar, shashank-parulekar, nkommuri, sg-shankar, ivan-alekhin and sergey-shilov as code owners March 29, 2022 10:15

cla-bot bot added the cla-signed label Mar 29, 2022

madhavemuri approved these changes Mar 30, 2022

View reviewed changes

mehjoshi reviewed Mar 30, 2022

View reviewed changes

conf/pvers.c Outdated Show resolved Hide resolved

ivan-alekhin reviewed Mar 31, 2022

View reviewed changes

conf/pvers.c Show resolved Hide resolved

mehjoshi reviewed Apr 1, 2022

View reviewed changes

conf/pvers.c Show resolved Hide resolved

mehjoshi requested a review from madhavemuri April 1, 2022 10:08

madhavemuri approved these changes Apr 5, 2022

View reviewed changes

AbhishekSahaSeagate added 2 commits April 5, 2022 08:17

Logging tolerance of pver as well

2927cc7

Signed-off-by: Abhishek Saha <abhishek.saha@seagate.com>

Merge branch 'main' into CORTX-29713

e771230

yeshpal-jain-seagate approved these changes Apr 5, 2022

View reviewed changes

mehjoshi merged commit 2d8a769 into Seagate:main Apr 6, 2022

cortx-admin mentioned this pull request Apr 8, 2022

list unit benchmarks failed #1578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORTX-29713: m0_conf_pver_status() now returns CRITICAL if max failures reached at any level #1571

CORTX-29713: m0_conf_pver_status() now returns CRITICAL if max failures reached at any level #1571

AbhishekSahaSeagate commented Mar 29, 2022

cortx-admin commented Mar 30, 2022

mehjoshi commented Apr 4, 2022

CORTX-29713: m0_conf_pver_status() now returns CRITICAL if max failures reached at any level #1571

CORTX-29713: m0_conf_pver_status() now returns CRITICAL if max failures reached at any level #1571

Conversation

AbhishekSahaSeagate commented Mar 29, 2022

Problem Statement

Design

Coding

Testing

Impact Analysis

Review Checklist

Documentation

cortx-admin commented Mar 30, 2022

Jenkins CI Result : Motr#1134

Motr Test Summary

CppCheck Summary

mehjoshi commented Apr 4, 2022