-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8SSAND-1698 ⁃ cass-operator can stop several nodes at the same time during a rolling restart #382
Comments
Are you sure the pods are actually getting restarted correctly? The logs indicate the event: And this isn't very fast operation, that kill reason requires the -sts-1 pod's cassandra container to have been terminated for 10 minutes. What is preventing the pod from restarting once cassandra container has died? One of the containers is still alive after cassandra container was killed, was it medusa or busybox (jmx-credentials) ? |
What do you mean by that? Everything starts with a rolling restart where
Could be medusa indeed, it is deployed on this cluster. |
That's not what the logs you pasted said. It does not say anything about restarting -sts-1, it's not the rolling restart process that caused the -sts-1 to be restarted in this case. It is triggering this code for -sts-1:
And that means Kubernetes has reported the -sts-1 has had
|
What happened?
After requesting a rolling restart on a datacenter with 3 Cassandra nodes, cass-operator restarts the
-sts-2
pod and sometimes a few seconds later-sts-1
gets terminated by cass-operator, making two replicas unavailable in the rack and lowering availability.Did you expect to see something different?
cass-operator should make it so that restarting pods gets delayed to avoid too much sensitivity, and take into account other down nodes to evaluate what can be safely done or not.
How to reproduce it (as minimally and precisely as possible):
Request a rolling restart on a cluster. This doesn't happen everytime though.
Environment
Cass Operator version:
* Kubernetes version information:v1.12.0
* Kubernetes cluster kind:kubectl version
GKE
Manifests:
Anything else we need to know?:
┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1698
┆priority: Medium
The text was updated successfully, but these errors were encountered: