Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SSAND-1698 ⁃ cass-operator can stop several nodes at the same time during a rolling restart #382

Closed
adejanovski opened this issue Jul 26, 2022 · 3 comments · Fixed by #647 or #654
Labels
bug Something isn't working done Issues in the state 'done'

Comments

@adejanovski
Copy link
Contributor

adejanovski commented Jul 26, 2022

What happened?
After requesting a rolling restart on a datacenter with 3 Cassandra nodes, cass-operator restarts the -sts-2 pod and sometimes a few seconds later -sts-1 gets terminated by cass-operator, making two replicas unavailable in the rack and lowering availability.

Did you expect to see something different?
cass-operator should make it so that restarting pods gets delayed to avoid too much sensitivity, and take into account other down nodes to evaluate what can be safely done or not.

How to reproduce it (as minimally and precisely as possible):
Request a rolling restart on a cluster. This doesn't happen everytime though.

Environment

  • Cass Operator version:

    v1.12.0

    * Kubernetes version information:

    kubectl version

    * Kubernetes cluster kind:

    GKE

  • Manifests:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  annotations:
    k8ssandra.io/resource-hash: A8dzZIjuvAAoVvW4lKyLYxNPuFWlDc4xdbx8F4+0IB4=
  creationTimestamp: '2022-02-17T14:09:22Z'
  finalizers:
    - finalizer.cassandra.datastax.com
  generation: 101
  labels:
    app.kubernetes.io/component: cassandra
    app.kubernetes.io/created-by: k8ssandracluster-controller
    app.kubernetes.io/name: k8ssandra-operator
    app.kubernetes.io/part-of: k8ssandra
    k8ssandra.io/cluster-name: dogfood
    k8ssandra.io/cluster-namespace: k8ssandra-operator
  name: dc2
  namespace: k8ssandra-operator
  resourceVersion: '416324174'
  uid: b2cd8c7c-fbcf-45c6-be36-1943cc3f1f91
  selfLink: >-
    /apis/cassandra.datastax.com/v1beta1/namespaces/k8ssandra-operator/cassandradatacenters/dc2
status:
  cassandraOperatorProgress: Ready
  conditions:
    - lastTransitionTime: '2022-05-17T15:37:15Z'
      message: ''
      reason: ''
      status: 'False'
      type: Stopped
    - lastTransitionTime: '2022-02-17T14:26:55Z'
      message: ''
      reason: ''
      status: 'False'
      type: ReplacingNodes
    - lastTransitionTime: '2022-07-22T07:29:39Z'
      message: ''
      reason: ''
      status: 'False'
      type: Updating
    - lastTransitionTime: '2022-07-26T08:34:53Z'
      message: ''
      reason: ''
      status: 'False'
      type: RollingRestart
    - lastTransitionTime: '2022-05-17T15:44:07Z'
      message: ''
      reason: ''
      status: 'False'
      type: Resuming
    - lastTransitionTime: '2022-02-17T14:26:55Z'
      message: ''
      reason: ''
      status: 'False'
      type: ScalingDown
    - lastTransitionTime: '2022-02-17T14:26:55Z'
      message: ''
      reason: ''
      status: 'True'
      type: Valid
    - lastTransitionTime: '2022-02-17T14:26:56Z'
      message: ''
      reason: ''
      status: 'True'
      type: Initialized
    - lastTransitionTime: '2022-05-17T15:44:08Z'
      message: ''
      reason: ''
      status: 'True'
      type: Ready
    - lastTransitionTime: '2022-07-21T12:00:19Z'
      message: ''
      reason: ''
      status: 'True'
      type: Healthy
  lastRollingRestart: '2022-07-26T08:29:59Z'
  lastServerNodeStarted: '2022-07-26T08:34:13Z'
  nodeStatuses:
    dogfood-dc2-default-sts-0:
      hostID: 6adc7220-4067-4bb4-9612-71c0fc0b52c8
    dogfood-dc2-default-sts-1:
      hostID: 7ea10675-aead-44a1-990f-281b17e24e13
    dogfood-dc2-default-sts-2:
      hostID: 555fbf43-a7a8-44ed-9799-da1108f5f782
  observedGeneration: 99
  quietPeriod: '2022-07-26T14:51:18Z'
  superUserUpserted: '2022-07-26T14:51:13Z'
  usersUpserted: '2022-07-26T14:51:13Z'
spec:
  additionalServiceConfig:
    additionalSeedService: {}
    allpodsService: {}
    dcService: {}
    nodePortService: {}
    seedService: {}
  clusterName: dogfood
  config:
    cassandra-env-sh:
      additional-jvm-opts:
        - '-Dcassandra.allow_alter_rf_during_range_movement=true'
        - '-Dcassandra.system_distributed_replication=dc1:3,dc2:3'
        - '-Dcom.sun.management.jmxremote.authenticate=true'
    cassandra-yaml:
      authenticator: PasswordAuthenticator
      authorizer: CassandraAuthorizer
      num_tokens: 16
      role_manager: CassandraRoleManager
    jvm-server-options:
      initial_heap_size: 524288000
      max_heap_size: 524288000
  configBuilderResources: {}
  managementApiAuth: {}
  podTemplateSpec:
    metadata: {}
    spec:
      containers:
        - env:
            - name: LOCAL_JMX
              value: 'no'
            - name: METRIC_FILTERS
              value: >-
                deny:org.apache.cassandra.metrics.Table
                deny:org.apache.cassandra.metrics.table
                allow:org.apache.cassandra.metrics.table.live_ss_table_count
                allow:org.apache.cassandra.metrics.Table.LiveSSTableCount
                allow:org.apache.cassandra.metrics.table.live_disk_space_used
                allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
                allow:org.apache.cassandra.metrics.Table.Pending
                allow:org.apache.cassandra.metrics.Table.Memtable
                allow:org.apache.cassandra.metrics.Table.Compaction
                allow:org.apache.cassandra.metrics.table.read
                allow:org.apache.cassandra.metrics.table.write
                allow:org.apache.cassandra.metrics.table.range
                allow:org.apache.cassandra.metrics.table.coordinator
                allow:org.apache.cassandra.metrics.table.dropped_mutations
            - name: MANAGEMENT_API_HEAP_SIZE
              value: '67108864'
          name: cassandra
          resources: {}
        - env:
            - name: MEDUSA_MODE
              value: GRPC
            - name: MEDUSA_TMP_DIR
              value: /var/lib/cassandra
            - name: CQL_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-reaper-secret
            - name: CQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-reaper-secret
          image: docker.io/k8ssandra/medusa:0.13.4
          imagePullPolicy: IfNotPresent
          name: medusa
          ports:
            - containerPort: 50051
              name: grpc
              protocol: TCP
          resources:
            limits:
              memory: 8Gi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - mountPath: /etc/cassandra
              name: server-config
            - mountPath: /var/lib/cassandra
              name: server-data
            - mountPath: /etc/medusa
              name: dogfood-medusa
            - mountPath: /etc/podinfo
              name: podinfo
            - mountPath: /etc/medusa-secrets
              name: medusa-bucket-key
      initContainers:
        - args:
            - /bin/sh
            - '-c'
            - >-
              echo "$SUPERUSER_JMX_USERNAME $SUPERUSER_JMX_PASSWORD" >>
              /config/jmxremote.password && echo "$REAPER_JMX_USERNAME
              $REAPER_JMX_PASSWORD" >> /config/jmxremote.password
          env:
            - name: SUPERUSER_JMX_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-superuser-secret
            - name: SUPERUSER_JMX_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-superuser-secret
            - name: REAPER_JMX_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-reaper-jmx-secret
            - name: REAPER_JMX_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-reaper-jmx-secret
          image: docker.io/library/busybox:1.34.1
          imagePullPolicy: IfNotPresent
          name: jmx-credentials
          resources: {}
          volumeMounts:
            - mountPath: /config
              name: server-config
        - name: server-config-init
          resources: {}
        - env:
            - name: MEDUSA_MODE
              value: RESTORE
            - name: MEDUSA_TMP_DIR
              value: /var/lib/cassandra
            - name: CQL_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-reaper-secret
            - name: CQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-reaper-secret
            - name: BACKUP_NAME
              value: medusa-backup-20220517-1
            - name: RESTORE_KEY
              value: 61be3cb6-f8d3-47c1-a5e2-169823c0f9f2
          image: docker.io/k8ssandra/medusa:0.13.4
          imagePullPolicy: IfNotPresent
          name: medusa-restore
          resources:
            limits:
              memory: 8Gi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - mountPath: /etc/cassandra
              name: server-config
            - mountPath: /var/lib/cassandra
              name: server-data
            - mountPath: /etc/medusa
              name: dogfood-medusa
            - mountPath: /etc/podinfo
              name: podinfo
            - mountPath: /etc/medusa-secrets
              name: medusa-bucket-key
      volumes:
        - configMap:
            name: dogfood-medusa
          name: dogfood-medusa
        - name: medusa-bucket-key
          secret:
            secretName: medusa-bucket-key
        - downwardAPI:
            items:
              - fieldRef:
                  fieldPath: metadata.labels
                path: labels
          name: podinfo
  resources:
    requests:
      memory: 2Gi
  serverType: cassandra
  serverVersion: 4.0.3
  size: 3
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: standard
  superuserSecretName: dogfood-superuser-secret
  systemLoggerResources: {}
  tolerations:
    - effect: NoSchedule
      key: k8ssandra-version
      operator: Equal
      value: 2.x
  users:
    - secretName: dogfood-reaper-secret
      superuser: true
    - secretName: dogfood-reaper-secret
      superuser: true
  • Cass Operator Logs:
1.6588242052190342e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	Restarting Cassandra for pod dogfood-dc2-default-sts-2	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "reason": "RestartingCassandra", "eventType": "Normal"}
1.6588242052191195e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	calling Management API drain node - POST /api/v0/ops/node/drain	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "pod": "dogfood-dc2-default-sts-2"}
1.6588242052191548e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	client::callNodeMgmtEndpoint	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator"}
1.6588242052195036e+09	DEBUG	events	Normal	{"object": {"kind":"CassandraDatacenter","namespace":"k8ssandra-operator","name":"dc2","uid":"b2cd8c7c-fbcf-45c6-be36-1943cc3f1f91","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"416000148"}, "reason": "RestartingCassandra", "message": "Restarting Cassandra for pod dogfood-dc2-default-sts-2"}
1.6588242161249676e+09	INFO	controllers.CassandraDatacenter	Reconcile loop completed	{"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "79913910-38ef-4b25-810f-12005d1bbd31", "duration": 10.956039281}
1.6588242161250792e+09	INFO	controllers.CassandraDatacenter	======== handler::Reconcile has been called	{"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "9a0d915f-d4e3-4d1e-a17e-b4ed92333406"}
1.6588242161251044e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	handler::CreateReconciliationContext	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator"}
1.6588242161255727e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	handler::calculateReconciliationActions	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161256244e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_services::ReconcileHeadlessServices	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161262634e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_endpoints::CheckAdditionalSeedEndpoints	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161262882e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::calculateRackInformation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161263132e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconciliationContext::reconcileAllRacks	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161263268e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::listPods	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161268623e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	requesting Cassandra metadata endpoints from Node Management API	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "pod": "dogfood-dc2-default-sts-2"}
1.6588242161268892e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	client::callNodeMgmtEndpoint	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator"}
1.658824216134607e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckConfigSecret	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161346457e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckRackCreation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161346512e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::getStatefulSetForRack	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216134793e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckRackLabels	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161350152e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckSuperuserSecretCreation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161350894e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckInternodeCredentialCreation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161351364e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	starting CheckRackForceUpgrade()	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161351483e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckRackScale	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216135153e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckPodsReady	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161351576e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::findStartedNotReadyNodes	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162215562e+09	INFO	controllers.CassandraDatacenter	Reconcile loop completed	{"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "9a0d915f-d4e3-4d1e-a17e-b4ed92333406", "duration": 0.096493093}
1.658824216222452e+09	INFO	controllers.CassandraDatacenter	======== handler::Reconcile has been called	{"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "7eb8f2e1-d802-4dc8-a81e-5fff8d72fae8"}
1.6588242162224867e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	handler::CreateReconciliationContext	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator"}
1.6588242162232192e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	handler::calculateReconciliationActions	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162232454e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_services::ReconcileHeadlessServices	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216223856e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_endpoints::CheckAdditionalSeedEndpoints	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162238789e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::calculateRackInformation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162238858e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconciliationContext::reconcileAllRacks	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162238944e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::listPods	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216224577e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	requesting Cassandra metadata endpoints from Node Management API	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "pod": "dogfood-dc2-default-sts-2"}
1.6588242162246015e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	client::callNodeMgmtEndpoint	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator"}
1.6588242162314386e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckConfigSecret	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216231472e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckRackCreation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162314782e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::getStatefulSetForRack	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162325387e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckRackLabels	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162330978e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckSuperuserSecretCreation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162331574e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckInternodeCredentialCreation	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336335e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	starting CheckRackForceUpgrade()	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336566e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckRackScale	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336626e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::CheckPodsReady	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336743e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::findStartedNotReadyNodes	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336845e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	reconcile_racks::deleteStuckNodes	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336988e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	Deleting stuck pod: dogfood-dc2-default-sts-1. Reason: Pod got stuck after Cassandra container terminated	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162337089e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	Pod got stuck after Cassandra container terminated	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "reason": "DeletingStuckPod", "eventType": "Warning"}
1.6588242162366488e+09	DEBUG	events	Warning	{"object": {"kind":"CassandraDatacenter","namespace":"k8ssandra-operator","name":"dc2","uid":"b2cd8c7c-fbcf-45c6-be36-1943cc3f1f91","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"416000148"}, "reason": "DeletingStuckPod", "message": "Pod got stuck after Cassandra container terminated"}
1.6588242162792397e+09	ERROR	controllers.CassandraDatacenter	calculateReconciliationActions returned an error	{"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "7eb8f2e1-d802-4dc8-a81e-5fff8d72fae8", "error": "pods \"dogfood-dc2-default-sts-1\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
1.6588242163423305e+09	INFO	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	pods "dogfood-dc2-default-sts-1" not found	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "reason": "ReconcileFailed", "eventType": "Warning"}
1.6588242163423748e+09	INFO	controllers.CassandraDatacenter	Reconcile loop completed	{"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "7eb8f2e1-d802-4dc8-a81e-5fff8d72fae8", "duration": 0.11994769}
1.6588242163424113e+09	ERROR	controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller	Reconciler error	{"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "error": "pods \"dogfood-dc2-default-sts-1\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
1.6588242163424983e+09	INFO	controllers.CassandraDatacenter	======== handler::Reconcile has been called	{"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "f7d49596-dd9e-4c88-bf95-1b268da56449"}

Anything else we need to know?:

┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1698
┆priority: Medium

@adejanovski adejanovski added the bug Something isn't working label Jul 26, 2022
@sync-by-unito sync-by-unito bot changed the title cass-operator can stop several nodes at the same time during a rolling restart K8SSAND-1698 ⁃ cass-operator can stop several nodes at the same time during a rolling restart Jul 26, 2022
@burmanm
Copy link
Contributor

burmanm commented Jul 26, 2022

Are you sure the pods are actually getting restarted correctly? The logs indicate the event: Deleting stuck pod: dogfood-dc2-default-sts-1. Reason: Pod got stuck after Cassandra container terminated.

And this isn't very fast operation, that kill reason requires the -sts-1 pod's cassandra container to have been terminated for 10 minutes.

What is preventing the pod from restarting once cassandra container has died? One of the containers is still alive after cassandra container was killed, was it medusa or busybox (jmx-credentials) ?

@adejanovski
Copy link
Contributor Author

Are you sure the pods are actually getting restarted correctly?

What do you mean by that? Everything starts with a rolling restart where -sts-2 gets restarted, but followed too quickly by -sts-1. I can assure you that only a few seconds have passed between these restarts.

What is preventing the pod from restarting once cassandra container has died? One of the containers is still alive after cassandra container was killed, was it medusa or busybox (jmx-credentials) ?

Could be medusa indeed, it is deployed on this cluster.

@burmanm
Copy link
Contributor

burmanm commented Jul 27, 2022

What do you mean by that? Everything starts with a rolling restart where -sts-2 gets restarted, but followed too quickly by -sts-1. I can assure you that only a few seconds have passed between these restarts.

That's not what the logs you pasted said. It does not say anything about restarting -sts-1, it's not the rolling restart process that caused the -sts-1 to be restarted in this case.

It is triggering this code for -sts-1:

if isNodeStuckAfterTerminating(pod) {

And that means Kubernetes has reported the -sts-1 has had cassandra container dead for 10 minutes. The actual rolling restart logs another line, which is not where your logs are pointing at (indicating that either that pod was never restarted by cass-operator or that the log is not the entire log, but a snippet telling incomplete story).

"Restarting Cassandra for pod %s", pod.Name is an event it would create when rolling restart process is triggered. But we only see that for -sts-2 in the logs, -sts-1 and -sts-0 were never part of that process in that log.

@adejanovski adejanovski moved this to To Groom in K8ssandra Nov 8, 2022
@burmanm burmanm moved this to Assess/Investigate in K8ssandra Mar 5, 2024
@adejanovski adejanovski added the assess Issues in the state 'assess' label Mar 5, 2024
@burmanm burmanm moved this from Assess/Investigate to Ready For Review in K8ssandra May 17, 2024
@adejanovski adejanovski added ready-for-review Issues in the state 'ready-for-review' and removed assess Issues in the state 'assess' labels May 17, 2024
@github-project-automation github-project-automation bot moved this from Ready For Review to Done in K8ssandra May 30, 2024
@adejanovski adejanovski added done Issues in the state 'done' and removed ready-for-review Issues in the state 'ready-for-review' labels May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working done Issues in the state 'done'
Projects
No open projects
Archived in project
2 participants