Skip to content

Data Loss and Cluster Failure in Kubernetes StatefulSet Due to Missing Disk for MySQL Replica-0 #898

Open
@tebaly

Description

@tebaly

I encountered an unexpected failure during node replacement in my Kubernetes cluster, leading to a critical issue with the MySQL StatefulSet. The failure resulted in the loss of the disk for the MySQL replica with index 0, causing the replica to be unable to start. While the other two replicas had up-to-date data, they couldn't initiate due to the StatefulSet's hanging startup process for the first replica, which experienced data loss.

To address such issues, I propose leveraging the new Kubernetes v1.24 feature - .spec.updateStrategy.rollingUpdate.maxUnavailable. You can set it equal to the number of replicas in the StatefulSet, for instance, with three replicas and maxUnavailable = 3. This way, the remaining replicas with valid data might be able to launch successfully.

The current situation left me with no apparent method to utilize the data from the other replicas to recover from the failure. Consequently, I had to resort to restoring from a backup, causing additional downtime and administrative efforts.

I believe adopting the suggested feature could significantly enhance the reliability and fault-tolerance of StatefulSets in similar scenarios, preventing potential data loss and cluster failures.

Feature State: Kubernetes v1.24 [alpha]

Thank you for considering this proposal.
Best regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions