Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

CORTX-29713: m0_conf_pver_status() now returns CRITICAL if max failures reached at any level #1571

Merged
merged 4 commits into from
Apr 6, 2022

Conversation

AbhishekSahaSeagate
Copy link
Contributor

When the allowance was set to less than K at any level, and that many numbers of failures
was reached, then bytecount had become critical, but it was still marked as degraded
according to old logic as the failures were still less than K.

ex: in a 3 node cluster with SNS 4+2+0, we support at max 1 node failure.
So when a node failed, data should've become critical, but with old logic,
it was marked as degraded, as number of failures was 1 which is < K.

Updated the logic to check if failures at any level has reached it's max or not.
If it does then pool version is marked as CRITICAL even if failures < K.

Signed-off-by: Abhishek Saha abhishek.saha@seagate.com

Problem Statement

  • Problem statement

Design

  • For Bug, Describe the fix here.
  • For Feature, Post the link for design

Coding

Checklist for Author

  • Coding conventions are followed and code is consistent

Testing

Checklist for Author

  • Unit and System Tests are added
  • Test Cases cover Happy Path, Non-Happy Path and Scalability
  • Testing was performed with RPM

Impact Analysis

Checklist for Author/Reviewer/GateKeeper

  • Interface change (if any) are documented
  • Side effects on other features (deployment/upgrade)
  • Dependencies on other component(s)

Review Checklist

Checklist for Author

  • JIRA number/GitHub Issue added to PR
  • PR is self reviewed
  • Jira and state/status is updated and JIRA is updated with PR link
  • Check if the description is clear and explained

Documentation

Checklist for Author

  • Changes done to WIKI / Confluence page / Quick Start Guide

…es reached at any level

When the allowance was set to less than K at any level, and that many numbers of failures
was reached, then bytecount had become critical, but it was still marked as degraded
according to old logic as the failures were still less than K.

ex: in a 3 node cluster with SNS 4+2+0, we support at max 1 node failure.
So when a node failed, data should've become critical, but with old logic,
it was marked as degraded, as number of failures was 1 which is < K.

Updated the logic to check if failures at any level has reached it's max or not.
If it does then pool version is marked as CRITICAL even if failures < K.

Signed-off-by: Abhishek Saha <abhishek.saha@seagate.com>
conf/pvers.c Outdated Show resolved Hide resolved
@cortx-admin
Copy link

Jenkins CI Result : Motr#1134

Motr Test Summary

Test ResultCountInfo
❌Failed1
📁

01motr-single-node/00userspace-tests

🏁Skipped32
📁

01motr-single-node/28sys-kvs
01motr-single-node/35m0singlenode
01motr-single-node/04initscripts
01motr-single-node/37protocol
02motr-single-node/51kem
02motr-single-node/20rpc-session-cancel
02motr-single-node/10pver-assign
02motr-single-node/21fsync-single-node
02motr-single-node/13dgmode-io
02motr-single-node/14poolmach
02motr-single-node/11m0t1fs
02motr-single-node/26motr-user-kernel-tests
02motr-single-node/08spiel
03motr-single-node/06conf
03motr-single-node/36spare-reservation
04motr-single-node/34sns-repair-1n-1f
04motr-single-node/08spiel-sns-repair-quiesce
04motr-single-node/28sys-kvs-kernel
04motr-single-node/11m0t1fs-rconfc-fail
04motr-single-node/08spiel-sns-repair
04motr-single-node/19sns-repair-abort
04motr-single-node/22sns-repair-ios-fail
05motr-single-node/18sns-repair-quiesce
05motr-single-node/12fwait
05motr-single-node/16sns-repair-multi
05motr-single-node/07mount-fail
05motr-single-node/15sns-repair-single
05motr-single-node/23sns-abort-quiesce
05motr-single-node/17sns-repair-concurrent-io
05motr-single-node/07mount
05motr-single-node/07mount-multiple
05motr-single-node/12fsync

✔️Passed40
📁

01motr-single-node/43m0crate
01motr-single-node/05confgen
01motr-single-node/06hagen
01motr-single-node/52motr-singlenode-sanity
01motr-single-node/01net
01motr-single-node/01kernel-tests
01motr-single-node/03console
01motr-single-node/02rpcping
02motr-single-node/07m0d-fatal
02motr-single-node/67fdmi-plugin-multi-filters
02motr-single-node/53clusterusage-alert
02motr-single-node/41motr-conf-update
03motr-single-node/61sns-repair-motr-1n-1f
03motr-single-node/08spiel-multi-confd
03motr-single-node/69sns-repair-motr-quiesce
03motr-single-node/62sns-repair-motr-mf
03motr-single-node/70sns-failure-after-repair-quiesce
03motr-single-node/63sns-repair-motr-1k-1f
03motr-single-node/60sns-repair-motr-1f
03motr-single-node/66sns-repair-motr-abort-quiesce
03motr-single-node/24motr-dix-repair-lookup-insert-spiel
03motr-single-node/68sns-repair-motr-shutdown
03motr-single-node/64sns-repair-motr-ios-fail
03motr-single-node/24motr-dix-repair-lookup-insert-m0repair
03motr-single-node/04sss
03motr-single-node/65sns-repair-motr-abort
04motr-single-node/48motr-raid0-io
04motr-single-node/49motr-rpc-cancel
04motr-single-node/25m0kv
04motr-single-node/44motr-rm-lock-cc-io
04motr-single-node/45motr-rmw
05motr-single-node/23dix-repair-m0repair
05motr-single-node/43motr-sync-replication
05motr-single-node/42motr-utils
05motr-single-node/45motr-sns-repair-N-1
05motr-single-node/40motr-dgmode
05motr-single-node/23dix-repair-quiesce-m0repair
05motr-single-node/23spiel-dix-repair-quiesce
05motr-single-node/44motr-sns-repair
05motr-single-node/23spiel-dix-repair

Total73🔗

CppCheck Summary

   Cppcheck: No new warnings found 👍

Now instead of returning whether any level has reached max failures or not,
it returns an integer marking MAX_FAILURE_NOT_REACHED, MAX_FAILURE_REACHED,
MAX_FAILURE_EXCEEDED. This further helps in more accurate marking of
DEGRADED, CRITICAL and DAMAGED states.

Changed the logic of m0_conf_pver_status() accordingly.

Signed-off-by: Abhishek Saha <abhishek.saha@seagate.com>
conf/pvers.c Show resolved Hide resolved
@mehjoshi mehjoshi requested a review from madhavemuri April 1, 2022 10:08
@mehjoshi
Copy link

mehjoshi commented Apr 4, 2022

@madhavemuri, @yeshpal-jain-seagate, The code was updated after Madhav's last review and approval.
Could you re-review the code changes?

Signed-off-by: Abhishek Saha <abhishek.saha@seagate.com>
@mehjoshi mehjoshi merged commit 2d8a769 into Seagate:main Apr 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants