-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from pod restarts during cluster creation during setup #499
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if this might cause recover from complete outage to be triggered in other cases that are destructive, but I have no specific case in mind
…ng and current unit is offline
) | ||
|
||
logger.info("Deleting pod") | ||
delete_pod(ops_test, leader_unit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a method to add to the common helpers repo (if not already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add in a follow up PR
@@ -619,6 +619,18 @@ def cluster_initialized(self) -> bool: | |||
|
|||
return False | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget the libpatch bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated libpatch - will make sure that these changes get propagated to the libs in the vm repo and published correctly
return 0 | ||
|
||
total_cluster_nodes = 0 | ||
for unit in self.app_units: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method name is misleading.
Iterate over all units can lead to instances being counted in duplicity if there's more than one unit added to the cluster.
I understand the usage though, where if it sums to one, is possible to frame the case.
Since we already have a method for getting cluster node count, how about changing this for a boolean returning method that test for cluster metadata in only a single unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed, and refactored to be more precise in f0daa96 (with the property renamed in a following commit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤞
Issue
During testing of Issue 476, we encountered a case where the leader unit pod that is creating the cluster exits with code 137 before it can successfully create the cluster. The pod then gets rescheduled, but is unable to recover from the conditions, as full cluster crash recovery is only done if all units are
offline
(which is not the case because there are 2 units inwaiting
to join cluster)Solution
Attempt to
reboot_cluster_from_complete_outage
in above scenario. If the reboot from complete outage fails in above scenario, we should be safe todrop_metadata_schema()
on unit whose pod was rescheduled, and recreate the cluster.Added integration test
TODO
Update the shared mysql charm lib code in the vm repo