Recover from pod restarts during cluster creation during setup #499

shayancanonical · 2024-09-05T03:54:13Z

Issue

During testing of Issue 476, we encountered a case where the leader unit pod that is creating the cluster exits with code 137 before it can successfully create the cluster. The pod then gets rescheduled, but is unable to recover from the conditions, as full cluster crash recovery is only done if all units are offline (which is not the case because there are 2 units in waiting to join cluster)

Solution

Attempt to reboot_cluster_from_complete_outage in above scenario. If the reboot from complete outage fails in above scenario, we should be safe to drop_metadata_schema() on unit whose pod was rescheduled, and recreate the cluster.
Added integration test

TODO

Update the shared mysql charm lib code in the vm repo

carlcsaposs-canonical

wondering if this might cause recover from complete outage to be triggered in other cases that are destructive, but I have no specific case in mind

…ng and current unit is offline

paulomach · 2024-09-05T17:31:27Z

tests/integration/high_availability/test_crash_during_setup.py

+        )
+
+    logger.info("Deleting pod")
+    delete_pod(ops_test, leader_unit)


This is a method to add to the common helpers repo (if not already)

will add in a follow up PR

paulomach · 2024-09-05T17:32:07Z

lib/charms/mysql/v0/mysql.py

@@ -619,6 +619,18 @@ def cluster_initialized(self) -> bool:

        return False

+    @property


Don't forget the libpatch bump

updated libpatch - will make sure that these changes get propagated to the libs in the vm repo and published correctly

paulomach · 2024-09-05T17:45:45Z

lib/charms/mysql/v0/mysql.py

+            return 0
+
+        total_cluster_nodes = 0
+        for unit in self.app_units:


I think this method name is misleading.
Iterate over all units can lead to instances being counted in duplicity if there's more than one unit added to the cluster.

I understand the usage though, where if it sums to one, is possible to frame the case.

Since we already have a method for getting cluster node count, how about changing this for a boolean returning method that test for cluster metadata in only a single unit.

renamed, and refactored to be more precise in f0daa96 (with the property renamed in a following commit)

…g cluster creation

taurus-forever

🤞

Recover from pod restarts during cluster creation during setup

2cb54cb

shayancanonical requested review from paulomach, taurus-forever and carlcsaposs-canonical September 5, 2024 03:54

shayancanonical added 2 commits September 5, 2024 03:55

Commit missed file after format

b7d7cde

Re-add unnecessarily removed code

10d5405

carlcsaposs-canonical approved these changes Sep 5, 2024

View reviewed changes

Only run create_cluster in update_status if all other units are waiti…

ced154e

…ng and current unit is offline

paulomach reviewed Sep 5, 2024

View reviewed changes

shayancanonical added 3 commits September 5, 2024 20:29

Address PR feedback

f0daa96

Fix failing scale down and then scale up integration test

ab63d53

Add comments to explain measures to recover from pod reschedule durin…

915d71d

…g cluster creation

shayancanonical requested a review from paulomach September 6, 2024 15:21

taurus-forever approved these changes Sep 9, 2024

View reviewed changes

shayancanonical mentioned this pull request Sep 10, 2024

Port over charm lib changes from k8s charm relating to pod reschdule during cluster setup canonical/mysql-operator#518

Merged

shayancanonical added 3 commits September 16, 2024 13:40

Merge branch 'main' into feature/crash_during_cluster_setup

c8d8c21

Pull latest charm libs for mysql and tracing

45070d5

Pull in latest mysql backups charm lib

1f4fb72

shayancanonical merged commit 2009919 into main Sep 16, 2024
93 checks passed

shayancanonical deleted the feature/crash_during_cluster_setup branch September 16, 2024 18:14

shayancanonical mentioned this pull request Sep 16, 2024

mysql-k8s does not create service account when related too early #476

Closed

paulomach mentioned this pull request Sep 30, 2024

DPE-5582 Timeout node count query #514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from pod restarts during cluster creation during setup #499

Recover from pod restarts during cluster creation during setup #499

shayancanonical commented Sep 5, 2024

carlcsaposs-canonical left a comment

paulomach Sep 5, 2024

shayancanonical Sep 6, 2024

paulomach Sep 5, 2024

shayancanonical Sep 6, 2024

paulomach Sep 5, 2024

shayancanonical Sep 6, 2024

taurus-forever left a comment

		@@ -619,6 +619,18 @@ def cluster_initialized(self) -> bool:

		return False

		@property

Recover from pod restarts during cluster creation during setup #499

Recover from pod restarts during cluster creation during setup #499

Conversation

shayancanonical commented Sep 5, 2024

Issue

Solution

TODO

carlcsaposs-canonical left a comment

Choose a reason for hiding this comment

paulomach Sep 5, 2024

Choose a reason for hiding this comment

shayancanonical Sep 6, 2024

Choose a reason for hiding this comment

paulomach Sep 5, 2024

Choose a reason for hiding this comment

shayancanonical Sep 6, 2024

Choose a reason for hiding this comment

paulomach Sep 5, 2024

Choose a reason for hiding this comment

shayancanonical Sep 6, 2024

Choose a reason for hiding this comment

taurus-forever left a comment

Choose a reason for hiding this comment