You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My cluster ended up in a state where one Cassandra pod wouldn't start because the data volume was corrupted (or something like that). In this case, the management API tries over and over again to restart Cassandra and the pod gets stuck with the "Starting" cass-operator label.
One "logical" thing to do then is to use the replaceNodes setting of cass-operator to replace the faulty pod with a new one (including a new PV), and bootstrap it safely by replacing the previous instance of that node. Sadly, cass-operator prevents that from happening and the node never gets replaced.
The manual fix wasn't really easy, and involved:
removing the node from the cluster through nodetool removenode
deleting the PV and PVC
then deleting the pod
The additional streaming session and token movements triggered by the node removal phase could really be avoided, as well as the follow up cleanup operation if we could make it so that cass-operator allows such replacements.
┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1475
┆priority: Medium
The text was updated successfully, but these errors were encountered:
sync-by-unitobot
changed the title
Handle replacement of nodes stuck in "Starting" mode
K8SSAND-1475 ⁃ Handle replacement of nodes stuck in "Starting" mode
Apr 26, 2022
My cluster ended up in a state where one Cassandra pod wouldn't start because the data volume was corrupted (or something like that). In this case, the management API tries over and over again to restart Cassandra and the pod gets stuck with the "Starting" cass-operator label.
One "logical" thing to do then is to use the
replaceNodes
setting of cass-operator to replace the faulty pod with a new one (including a new PV), and bootstrap it safely by replacing the previous instance of that node. Sadly, cass-operator prevents that from happening and the node never gets replaced.The manual fix wasn't really easy, and involved:
nodetool removenode
The additional streaming session and token movements triggered by the node removal phase could really be avoided, as well as the follow up cleanup operation if we could make it so that cass-operator allows such replacements.
┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1475
┆priority: Medium
The text was updated successfully, but these errors were encountered: