Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from pod restarts during cluster creation during setup #499

Merged
merged 10 commits into from
Sep 16, 2024

Conversation

shayancanonical
Copy link
Contributor

Issue

During testing of Issue 476, we encountered a case where the leader unit pod that is creating the cluster exits with code 137 before it can successfully create the cluster. The pod then gets rescheduled, but is unable to recover from the conditions, as full cluster crash recovery is only done if all units are offline (which is not the case because there are 2 units in waiting to join cluster)

Solution

Attempt to reboot_cluster_from_complete_outage in above scenario. If the reboot from complete outage fails in above scenario, we should be safe to drop_metadata_schema() on unit whose pod was rescheduled, and recreate the cluster.
Added integration test

TODO

Update the shared mysql charm lib code in the vm repo

Copy link
Contributor

@carlcsaposs-canonical carlcsaposs-canonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if this might cause recover from complete outage to be triggered in other cases that are destructive, but I have no specific case in mind

)

logger.info("Deleting pod")
delete_pod(ops_test, leader_unit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a method to add to the common helpers repo (if not already)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add in a follow up PR

@@ -619,6 +619,18 @@ def cluster_initialized(self) -> bool:

return False

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the libpatch bump

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated libpatch - will make sure that these changes get propagated to the libs in the vm repo and published correctly

return 0

total_cluster_nodes = 0
for unit in self.app_units:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method name is misleading.
Iterate over all units can lead to instances being counted in duplicity if there's more than one unit added to the cluster.

I understand the usage though, where if it sums to one, is possible to frame the case.

Since we already have a method for getting cluster node count, how about changing this for a boolean returning method that test for cluster metadata in only a single unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed, and refactored to be more precise in f0daa96 (with the property renamed in a following commit)

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞

@shayancanonical shayancanonical merged commit 2009919 into main Sep 16, 2024
93 checks passed
@shayancanonical shayancanonical deleted the feature/crash_during_cluster_setup branch September 16, 2024 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants