Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with the message mgr operator() Failed to run module in active mode #4077

SrinivasaBharath · 2024-09-10T04:04:39Z

Automation of the customer BZ# -
https://bugzilla.redhat.com/show_bug.cgi?id=2305677
Jira task - https://issues.redhat.com/browse/RHCEPHQE-15808

Description

Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs

click to expand checklist

Create a test case in Polarion reviewed and approved.
Create a design/automation approach doc. Optional for tests with similar tests already automated.
Review the automation design
Implement the test script and perform test runs
Submit PR for code review and approve
Update Polarion Test with Automation script details and update automation fields
If automation is part of Close loop, update BZ flag qe-test_coverage “+” and link Polarion test

pdhiran · 2024-09-10T05:32:30Z

Comments provided for the test in Meet.

Please add the test as part of brownfield suite in 8.x.
Add flags to optionally recreate the hotfix bug scenario as needed.

pdhiran

Changes needed

openshift-ci · 2024-09-25T01:06:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SrinivasaBharath

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SrinivasaBharath · 2024-09-25T01:11:28Z

Squid Log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-5TE0PX
Reef pass log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-PKO4HJ

SrinivasaBharath · 2024-09-27T03:38:47Z

Changes needed

The new logic is implemented and added into a separate file

pdhiran · 2024-10-01T05:14:19Z

tests/rados/test_node_drain_customer_bug.py

+ osd_list = []
+
+ for node in ceph_nodes:
+ cmd_host_chk = f"ceph orch host ls --host_pattern {node.hostname}"


this check is not required here. all this does is check if the host exists. After test execution, we should make sure that we leave the cluster in same state before we started tests.

this check can be removed.

HOST ADDR LABELS STATUS ceph-pdhiran-3az-11h4o2-node1-installer 10.0.59.134 _admin,alertmanager,grafana,mgr,prometheus,osd,installer,mon 1 hosts in cluster whose hostname matched ceph-pdhiran-3az-11h4o2-node1-installer ```

pdhiran · 2024-10-01T05:23:28Z

tests/rados/test_node_drain_customer_bug.py

+ if node.role == "osd":
+ node_osds = rados_obj.collect_osd_daemon_ids(node)
+ osd_list = osd_list + node_osds
+ osd_weight_chk = check_set_reweight(rados_obj, osd_list)


With the removal failure in earlier test,it is expected that the weights be 0.

Our test should be that we verify the removal after the failure + upgrade to fixed builds.

After the upgrade, we should try to remove the same host that we were trying to remove earlier, be it in any state.

This should not be a reason for test failure. This is what we are verifying here.

I removed the code.

pdhiran · 2024-10-01T05:28:52Z

tests/rados/test_node_drain_customer_bug.py

@@ -0,0 +1,206 @@
+"""


Workflow:

Deploy cluster where bug exists.
1.1. Write data into the cluster. create few pools. this would ensure drain takes some time to complete.

2.0. Select 1 host for drain. (host-1). perform mgr failover. We should see the issue now.
2.1. Check the logs and check for crashes here. (this are the OSDs that were being drained)
2.2. make sure that ceph orch commands are stuck.
2.3. In finally, apply the WA.
2.4 Try reproduce the issue. At this point, issue would be reproduced here.
-> At this point, make sure the ceph orch commands will work now, so that we can proceed to next steps without issues.

Proceed to upgrade the cluster.

Now, post upgrade, Start the removal of the same host (host-1), which failed earlier.

Now with upgrade, the host removal should work as expected, since the bug is fixed.

Add back the host (host-1), to complete the LC of removal and addition of host.

Code is added with the steps mentioned above.

harshkumarRH · 2024-10-01T05:36:24Z

tests/rados/test_node_drain_customer_bug.py

+ bug_exists = True
+ elif int(major) == 18 and int(minor) == 2 and int(patch) < 1:
+ bug_exists = True
+ elif int(major) == 18 and int(minor) == 2 and int(patch) == 1 and int(build) <= 194:


Please change the comparison to build # just before the fixed build

I changed the comparison according to the fixed build.

SrinivasaBharath · 2024-10-10T04:21:13Z

The code is reviewed by Pawan and Harsh who provided the following comments. According to the review comments, the code is modified.-

1.Log the initial and end time of the tests

- Included in the code

The finally block should be one and move the workaround code from finally block to the Exception block

- Move the code from the finally block to the Exception block

After execution of every mgr fail check the active mgr

- Included the logic and added a method in the mgr_workflows.py

Check the config-key before and after the workaround

Code implemented.

Remove the "--rm-crush-entry" check from the code.

- Removed from the test case and the "remove_custom_host" method

6. Check the OSD count before draining the host and after adding the host.
- Logic is included.

suites/squid/rados/tier-2_rados_test-osd-rebalance.yaml

pdhiran · 2024-10-10T05:23:50Z

tests/rados/test_node_drain_customer_bug.py

+ # Printing the hosts in cluster
+ cmd_host_ls = "ceph orch host ls"
+ out = rados_obj.run_ceph_command(cmd=cmd_host_ls)
+ log.info(f"The hosts in the cluster before starting the test are - {out}")


this can be debug level

Debug level changed.

tests/rados/test_node_drain_customer_bug.py

pdhiran · 2024-10-10T05:25:54Z

tests/rados/test_node_drain_customer_bug.py

+ bug_exists = True
+ elif int(major) == 18 and int(minor) == 2 and int(patch) == 1 and int(build) <= 234:
+ bug_exists = True
+


print bug_exists in the cluster and repro possible as debug log.

Added the debug statement.

pdhiran · 2024-10-10T05:26:51Z

tests/rados/test_node_drain_customer_bug.py

+ drain_host = None
+ for node in out:
+ if "_no_schedule" in node["labels"]:
+ drain_host = node["hostname"]


break once the host with label is found.

I added the break statement.

pdhiran · 2024-10-10T05:31:38Z

tests/rados/test_node_drain_customer_bug.py

+ time.sleep(300)
+ end_time, _ = installer.exec_command(cmd="sudo date '+%Y-%m-%d %H:%M:%S'")
+ log.info(f"The test execution ends at - {end_time}")
+ if not verify_mgr_traceback_log(


Copy this block to exception as well.

In exception, error logs should be seen, and in try block, we should not see the error logs

Added the check in the exception block.

pdhiran · 2024-10-10T05:35:07Z

tests/rados/test_node_drain_customer_bug.py

+ )
+ return 1
+ try:
+ service_obj.add_new_hosts(add_nodes=[drain_host], deploy_osd=False)


add a comment explaining why the deploy_osd flag is false.

Added the log message.

pdhiran

looks good with minor changes

SrinivasaBharath · 2024-10-10T16:29:13Z

Reef Log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-FL8VC9
Squid Log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-HRUBMS

SrinivasaBharath · 2024-10-14T09:53:17Z

Latest Reef log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-887U7T/
Latest Squid log - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-8RERBA

… the message mgr operator() Failed to run module in active mode Signed-off-by: Srinivasa Bharath Kanta <skanta@redhat.com>

SrinivasaBharath added Polarion testcase RADOS Rados Core do-not-merge/work-in-progress labels Sep 10, 2024

pdhiran requested review from harshkumarRH, pdhiran and tintumathew10 and removed request for harshkumarRH and pdhiran September 19, 2024 05:59

pdhiran requested changes Sep 19, 2024

View reviewed changes

openshift-ci bot assigned pdhiran Sep 19, 2024

SrinivasaBharath force-pushed the wip_rados_auto branch from cc28c43 to 028e6c7 Compare September 25, 2024 00:29

SrinivasaBharath removed the do-not-merge/work-in-progress label Sep 25, 2024

SrinivasaBharath requested review from pdhiran and neha-gangadhar and removed request for tintumathew10 September 25, 2024 01:11

SrinivasaBharath added the DNM Do Not Merge label Sep 25, 2024

SrinivasaBharath force-pushed the wip_rados_auto branch from 08bf5bb to 386e2f6 Compare September 27, 2024 03:20

SrinivasaBharath removed the DNM Do Not Merge label Sep 27, 2024

pdhiran reviewed Oct 1, 2024

View reviewed changes

harshkumarRH reviewed Oct 1, 2024

View reviewed changes

SrinivasaBharath added the DNM Do Not Merge label Oct 10, 2024

pdhiran reviewed Oct 10, 2024

View reviewed changes

suites/squid/rados/tier-2_rados_test-osd-rebalance.yaml Outdated Show resolved Hide resolved

pdhiran reviewed Oct 10, 2024

View reviewed changes

tests/rados/test_node_drain_customer_bug.py Show resolved Hide resolved

pdhiran reviewed Oct 10, 2024

View reviewed changes

SrinivasaBharath requested review from pdhiran and harshkumarRH October 10, 2024 16:29

SrinivasaBharath removed the DNM Do Not Merge label Oct 10, 2024

SrinivasaBharath force-pushed the wip_rados_auto branch 3 times, most recently from 296f5f1 to 3839bd4 Compare October 14, 2024 07:34

SrinivasaBharath force-pushed the wip_rados_auto branch from efde761 to 8127e92 Compare October 14, 2024 13:06

Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with…

e3816d7

… the message mgr operator() Failed to run module in active mode Signed-off-by: Srinivasa Bharath Kanta <skanta@redhat.com>

SrinivasaBharath force-pushed the wip_rados_auto branch from 8dff885 to e3816d7 Compare October 14, 2024 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with the message mgr operator() Failed to run module in active mode #4077

Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with the message mgr operator() Failed to run module in active mode #4077

SrinivasaBharath commented Sep 10, 2024

pdhiran commented Sep 10, 2024

pdhiran left a comment

openshift-ci bot commented Sep 25, 2024

SrinivasaBharath commented Sep 25, 2024 •

edited

Loading

SrinivasaBharath commented Sep 27, 2024

pdhiran Oct 1, 2024

pdhiran Oct 1, 2024

SrinivasaBharath Oct 10, 2024

pdhiran Oct 1, 2024 •

edited

Loading

SrinivasaBharath Oct 10, 2024

harshkumarRH Oct 1, 2024

SrinivasaBharath Oct 10, 2024

SrinivasaBharath commented Oct 10, 2024

pdhiran Oct 10, 2024

SrinivasaBharath Oct 10, 2024

pdhiran Oct 10, 2024

SrinivasaBharath Oct 10, 2024

pdhiran Oct 10, 2024

SrinivasaBharath Oct 10, 2024

pdhiran Oct 10, 2024

SrinivasaBharath Oct 10, 2024

pdhiran Oct 10, 2024

SrinivasaBharath Oct 10, 2024

pdhiran left a comment

SrinivasaBharath commented Oct 10, 2024

SrinivasaBharath commented Oct 14, 2024 •

edited

Loading

Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with the message mgr operator() Failed to run module in active mode #4077

Are you sure you want to change the base?

Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with the message mgr operator() Failed to run module in active mode #4077

Conversation

SrinivasaBharath commented Sep 10, 2024

Description

pdhiran commented Sep 10, 2024

pdhiran left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 25, 2024

SrinivasaBharath commented Sep 25, 2024 • edited Loading

SrinivasaBharath commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdhiran Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SrinivasaBharath commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdhiran left a comment

Choose a reason for hiding this comment

SrinivasaBharath commented Oct 10, 2024

SrinivasaBharath commented Oct 14, 2024 • edited Loading

SrinivasaBharath commented Sep 25, 2024 •

edited

Loading

pdhiran Oct 1, 2024 •

edited

Loading

SrinivasaBharath commented Oct 14, 2024 •

edited

Loading