-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automation of - BZ#2305677-Ceph mgr crashed after a mgr failover with the message mgr operator() Failed to run module in active mode #4077
base: master
Are you sure you want to change the base?
Conversation
Comments provided for the test in Meet. Please add the test as part of brownfield suite in 8.x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes needed
cc28c43
to
028e6c7
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SrinivasaBharath The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
08bf5bb
to
386e2f6
Compare
The new logic is implemented and added into a separate file |
osd_list = [] | ||
|
||
for node in ceph_nodes: | ||
cmd_host_chk = f"ceph orch host ls --host_pattern {node.hostname}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is not required here. all this does is check if the host exists. After test execution, we should make sure that we leave the cluster in same state before we started tests.
this check can be removed.
HOST ADDR LABELS STATUS
ceph-pdhiran-3az-11h4o2-node1-installer 10.0.59.134 _admin,alertmanager,grafana,mgr,prometheus,osd,installer,mon
1 hosts in cluster whose hostname matched ceph-pdhiran-3az-11h4o2-node1-installer ```
if node.role == "osd": | ||
node_osds = rados_obj.collect_osd_daemon_ids(node) | ||
osd_list = osd_list + node_osds | ||
osd_weight_chk = check_set_reweight(rados_obj, osd_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the removal failure in earlier test,it is expected that the weights be 0.
Our test should be that we verify the removal after the failure + upgrade to fixed builds.
After the upgrade, we should try to remove the same host that we were trying to remove earlier, be it in any state.
This should not be a reason for test failure. This is what we are verifying here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the code.
@@ -0,0 +1,206 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workflow:
- Deploy cluster where bug exists.
1.1. Write data into the cluster. create few pools. this would ensure drain takes some time to complete.
2.0. Select 1 host for drain. (host-1). perform mgr failover. We should see the issue now.
2.1. Check the logs and check for crashes here. (this are the OSDs that were being drained)
2.2. make sure that ceph orch commands are stuck.
2.3. In finally, apply the WA.
2.4 Try reproduce the issue. At this point, issue would be reproduced here.
-> At this point, make sure the ceph orch commands will work now, so that we can proceed to next steps without issues.
-
Proceed to upgrade the cluster.
-
Now, post upgrade, Start the removal of the same host (host-1), which failed earlier.
-
Now with upgrade, the host removal should work as expected, since the bug is fixed.
-
Add back the host (host-1), to complete the LC of removal and addition of host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is added with the steps mentioned above.
bug_exists = True | ||
elif int(major) == 18 and int(minor) == 2 and int(patch) < 1: | ||
bug_exists = True | ||
elif int(major) == 18 and int(minor) == 2 and int(patch) == 1 and int(build) <= 194: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the comparison to build # just before the fixed build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the comparison according to the fixed build.
The code is reviewed by Pawan and Harsh who provided the following comments. According to the review comments, the code is modified.- 1.Log the initial and end time of the tests - Included in the code
- Move the code from the finally block to the Exception block
- Included the logic and added a method in the mgr_workflows.py
Code implemented.
- Removed from the test case and the "remove_custom_host" method 6. Check the OSD count before draining the host and after adding the host. |
# Printing the hosts in cluster | ||
cmd_host_ls = "ceph orch host ls" | ||
out = rados_obj.run_ceph_command(cmd=cmd_host_ls) | ||
log.info(f"The hosts in the cluster before starting the test are - {out}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be debug level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug level changed.
bug_exists = True | ||
elif int(major) == 18 and int(minor) == 2 and int(patch) == 1 and int(build) <= 234: | ||
bug_exists = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print bug_exists in the cluster and repro possible as debug log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the debug statement.
drain_host = None | ||
for node in out: | ||
if "_no_schedule" in node["labels"]: | ||
drain_host = node["hostname"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break once the host with label is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the break statement.
time.sleep(300) | ||
end_time, _ = installer.exec_command(cmd="sudo date '+%Y-%m-%d %H:%M:%S'") | ||
log.info(f"The test execution ends at - {end_time}") | ||
if not verify_mgr_traceback_log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy this block to exception as well.
In exception, error logs should be seen, and in try block, we should not see the error logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the check in the exception block.
) | ||
return 1 | ||
try: | ||
service_obj.add_new_hosts(add_nodes=[drain_host], deploy_osd=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment explaining why the deploy_osd flag is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good with minor changes
296f5f1
to
3839bd4
Compare
Latest Reef log- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-887U7T/ |
efde761
to
8127e92
Compare
… the message mgr operator() Failed to run module in active mode Signed-off-by: Srinivasa Bharath Kanta <skanta@redhat.com>
8dff885
to
e3816d7
Compare
Automation of the customer BZ# -
https://bugzilla.redhat.com/show_bug.cgi?id=2305677
Jira task - https://issues.redhat.com/browse/RHCEPHQE-15808
Description
Please include Automation development guidelines. Source of Test case - New Feature/Regression Test/Close loop of customer BZs
click to expand checklist