Data Loss and Cluster Failure in Kubernetes StatefulSet Due to Missing Disk for MySQL Replica-0

I encountered an unexpected failure during node replacement in my Kubernetes cluster, leading to a critical issue with the MySQL StatefulSet. The failure resulted in the loss of the disk for the MySQL replica with index 0, causing the replica to be unable to start. While the other two replicas had up-to-date data, they couldn't initiate due to the StatefulSet's hanging startup process for the first replica, which experienced data loss.

To address such issues, I propose leveraging the new Kubernetes v1.24 feature - .spec.updateStrategy.rollingUpdate.maxUnavailable. You can set it equal to the number of replicas in the StatefulSet, for instance, with three replicas and maxUnavailable = 3. This way, the remaining replicas with valid data might be able to launch successfully.

The current situation left me with no apparent method to utilize the data from the other replicas to recover from the failure. Consequently, I had to resort to restoring from a backup, causing additional downtime and administrative efforts.

I believe adopting the suggested feature could significantly enhance the reliability and fault-tolerance of StatefulSets in similar scenarios, preventing potential data loss and cluster failures.

Feature State: Kubernetes v1.24 [alpha]

Thank you for considering this proposal. 
Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Data Loss and Cluster Failure in Kubernetes StatefulSet Due to Missing Disk for MySQL Replica-0 #898

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Data Loss and Cluster Failure in Kubernetes StatefulSet Due to Missing Disk for MySQL Replica-0 #898

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions