Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with the message mgr operator() Failed to run module in active mode #4077

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SrinivasaBharath
Copy link
Contributor

Automation of the customer BZ# -
https://bugzilla.redhat.com/show_bug.cgi?id=2305677
Jira task - https://issues.redhat.com/browse/RHCEPHQE-15808

Description

Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs

click to expand checklist
  • Create a test case in Polarion reviewed and approved.
  • Create a design/automation approach doc. Optional for tests with similar tests already automated.
  • Review the automation design
  • Implement the test script and perform test runs
  • Submit PR for code review and approve
  • Update Polarion Test with Automation script details and update automation fields
  • If automation is part of Close loop, update BZ flag qe-test_coverage “+” and link Polarion test

@pdhiran
Copy link
Contributor

pdhiran commented Sep 10, 2024

Comments provided for the test in Meet.

Please add the test as part of brownfield suite in 8.x.
Add flags to optionally recreate the hotfix bug scenario as needed.

@pdhiran pdhiran requested review from harshkumarRH, pdhiran and tintumathew10 and removed request for harshkumarRH and pdhiran September 19, 2024 05:59
Copy link
Contributor

@pdhiran pdhiran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes needed

Copy link
Contributor

openshift-ci bot commented Sep 25, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SrinivasaBharath

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SrinivasaBharath
Copy link
Contributor Author

SrinivasaBharath commented Sep 25, 2024

@SrinivasaBharath
Copy link
Contributor Author

Changes needed

The new logic is implemented and added into a separate file

osd_list = []

for node in ceph_nodes:
cmd_host_chk = f"ceph orch host ls --host_pattern {node.hostname}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is not required here. all this does is check if the host exists. After test execution, we should make sure that we leave the cluster in same state before we started tests.

this check can be removed.

HOST                                     ADDR         LABELS                                                        STATUS
ceph-pdhiran-3az-11h4o2-node1-installer  10.0.59.134  _admin,alertmanager,grafana,mgr,prometheus,osd,installer,mon
1 hosts in cluster whose hostname matched ceph-pdhiran-3az-11h4o2-node1-installer ```

if node.role == "osd":
node_osds = rados_obj.collect_osd_daemon_ids(node)
osd_list = osd_list + node_osds
osd_weight_chk = check_set_reweight(rados_obj, osd_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the removal failure in earlier test,it is expected that the weights be 0.

Our test should be that we verify the removal after the failure + upgrade to fixed builds.

After the upgrade, we should try to remove the same host that we were trying to remove earlier, be it in any state.

This should not be a reason for test failure. This is what we are verifying here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the code.

@@ -0,0 +1,206 @@
"""
Copy link
Contributor

@pdhiran pdhiran Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow:

  1. Deploy cluster where bug exists.
    1.1. Write data into the cluster. create few pools. this would ensure drain takes some time to complete.

2.0. Select 1 host for drain. (host-1). perform mgr failover. We should see the issue now.
2.1. Check the logs and check for crashes here. (this are the OSDs that were being drained)
2.2. make sure that ceph orch commands are stuck.
2.3. In finally, apply the WA.
2.4 Try reproduce the issue. At this point, issue would be reproduced here.
-> At this point, make sure the ceph orch commands will work now, so that we can proceed to next steps without issues.

  1. Proceed to upgrade the cluster.

  2. Now, post upgrade, Start the removal of the same host (host-1), which failed earlier.

  3. Now with upgrade, the host removal should work as expected, since the bug is fixed.

  4. Add back the host (host-1), to complete the LC of removal and addition of host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is added with the steps mentioned above.

bug_exists = True
elif int(major) == 18 and int(minor) == 2 and int(patch) < 1:
bug_exists = True
elif int(major) == 18 and int(minor) == 2 and int(patch) == 1 and int(build) <= 194:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the comparison to build # just before the fixed build

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the comparison according to the fixed build.

@SrinivasaBharath SrinivasaBharath added the DNM Do Not Merge label Oct 10, 2024
@SrinivasaBharath
Copy link
Contributor Author

The code is reviewed by Pawan and Harsh who provided the following comments. According to the review comments, the code is modified.-

1.Log the initial and end time of the tests

  - Included in the code

  1. The finally block should be one and move the workaround code from finally block to the Exception block

- Move the code from the finally block to the Exception block

  1. After execution of every mgr fail check the active mgr

- Included the logic and added a method in the mgr_workflows.py

  1. Check the config-key before and after the workaround

Code implemented.

  1. Remove the  "--rm-crush-entry" check from the code.

- Removed from the test case and the "remove_custom_host" method

 6. Check the OSD count before draining the host and after adding the host.
 - Logic is included.

# Printing the hosts in cluster
cmd_host_ls = "ceph orch host ls"
out = rados_obj.run_ceph_command(cmd=cmd_host_ls)
log.info(f"The hosts in the cluster before starting the test are - {out}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be debug level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug level changed.

bug_exists = True
elif int(major) == 18 and int(minor) == 2 and int(patch) == 1 and int(build) <= 234:
bug_exists = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print bug_exists in the cluster and repro possible as debug log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the debug statement.

drain_host = None
for node in out:
if "_no_schedule" in node["labels"]:
drain_host = node["hostname"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break once the host with label is found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the break statement.

time.sleep(300)
end_time, _ = installer.exec_command(cmd="sudo date '+%Y-%m-%d %H:%M:%S'")
log.info(f"The test execution ends at - {end_time}")
if not verify_mgr_traceback_log(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy this block to exception as well.

In exception, error logs should be seen, and in try block, we should not see the error logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the check in the exception block.

)
return 1
try:
service_obj.add_new_hosts(add_nodes=[drain_host], deploy_osd=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment explaining why the deploy_osd flag is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the log message.

Copy link
Contributor

@pdhiran pdhiran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good with minor changes

@SrinivasaBharath
Copy link
Contributor Author

@SrinivasaBharath SrinivasaBharath removed the DNM Do Not Merge label Oct 10, 2024
@SrinivasaBharath SrinivasaBharath force-pushed the wip_rados_auto branch 3 times, most recently from 296f5f1 to 3839bd4 Compare October 14, 2024 07:34
@SrinivasaBharath
Copy link
Contributor Author

SrinivasaBharath commented Oct 14, 2024

… the message mgr operator() Failed to run module in active mode

Signed-off-by: Srinivasa Bharath Kanta <skanta@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants