-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The MySQL cluster is not recovered automatically under certain scenarios. #485
Comments
Thanks for the detailed report @peterctl , it will be useful to create a test. |
Facing this bug too |
for documentation: if the pod is killed while
|
@peterctl i have tried, unsuccessfully, to replicate the issue locally (with rev 153 of mysql-k8s). i deployed a cluster of 3 nodes and tried at least 15 times to delete all pods in the cluster without facing the issue. Furthermore, Also, if you run into the above issue again, please collect the logs if you are able (in particular the debug logs, pebble logs, and mysqld logs in |
@shayancanonical as I meantioned we're also facing this bug, it's in the ongoing deployment and the workaround doesn't help. I would like to raise a field-critical In my case the charms are restarted/failed after initial redeployment in the next day or 2 |
@natalytvinova can you please share your Juju version? Thank you! |
Just got internal reply. It should be Juju 3.5.1. |
Since there are 2 issues being discussed in this issue, please discuss the OOM kill issue in #485 Update: we saw the original issue in this thread (the |
Killing multiple MySQL pods at the same time without waiting for them to fully come online will put the cluster in an unhealthy state, but it will not trigger the reboot cluster from complete outage flow to recover the cluster.
This will also happen during a networking outage, which in this case was simulated by taking down all NICs on all the microk8s machines in a single AZ.
Steps to reproduce
a. Kill multiple mysql pods at the same time.
a. Take down all NICs of the microk8s machines.
b. Wait 15 minutes for the network outage to affect mysql, then reboot all the machines in the AZ to bring the network back online.
Expected behavior
The cluster will go offline and the reboot cluster from complete outage flow will be triggered to recover the cluster.
Actual behavior
The leader unit will not go offline and the reboot cluster flow will not be triggered, leaving the cluster in an inconsistent state.
Versions
Operating system: Ubuntu 22.04 Jammy Jellyfish
Juju CLI: 3.4.3
Juju agent: 3.4.3
Charm revision: 153 (channel 8.0/stable)
microk8s: 1.28
Additional context
To recover the cluster, the
mysqld_safe
Pebble service needs to be restarted inside the leader unit:The text was updated successfully, but these errors were encountered: