Scaling down fails if RpcAddress is removed before decommission is done #261

burmanm · 2022-01-17T17:21:50Z

What happened?
The decommission failed in the scale down, because the pod was removed before the decommission had finished. This is because the status of decommission compares the RPC_ADDRESS from EndpointState and that information might be removed before the node has finished.

This then causes the pod to be terminated because it was removed from the StatefulSet. I only saw this happen with 4.0.1, but I guess it could happen with other versions also.

Did you expect to see something different?
Detect correct decommission status before proceeding.

Data from decommissioned and alive node (TOKENS removed for proper visualization):

    {
      "DC": "dc2",
      "ENDPOINT_IP": "10.244.1.18",
      "HOST_ID": "403ef37a-4179-44e6-ab3e-1fe20e1f6ec3",
      "INTERNAL_ADDRESS_AND_PORT": "10.244.1.18:7000",
      "IS_ALIVE": "true",
      "LOAD": "70371.0",
      "NATIVE_ADDRESS_AND_PORT": "10.244.1.18:9042",
      "NET_VERSION": "12",
      "RACK": "r1",
      "RELEASE_VERSION": "4.0.1",
      "RPC_READY": "true",
      "SCHEMA": "2207c2a9-f598-3971-986b-2926e09e239d",
      "SSTABLE_VERSIONS": "big-nb",
      "STATUS_WITH_PORT": "LEFT,-8028340446623319586,1642697214009",
    },
    {
      "DC": "dc1",
      "ENDPOINT_IP": "10.244.2.17",
      "HOST_ID": "6e84945d-6203-416e-9d2f-b85b5a279d4a",
      "INTERNAL_ADDRESS_AND_PORT": "10.244.2.17:7000",
      "INTERNAL_IP": "10.244.2.17",
      "IS_ALIVE": "true",
      "LOAD": "80522.0",
      "NATIVE_ADDRESS_AND_PORT": "10.244.2.17:9042",
      "NET_VERSION": "12",
      "RACK": "r1",
      "RELEASE_VERSION": "4.0.1",
      "RPC_ADDRESS": "10.244.2.17",
      "RPC_READY": "true",
      "SCHEMA": "2207c2a9-f598-3971-986b-2926e09e239d",
      "SSTABLE_VERSIONS": "big-nb",
      "STATUS": "NORMAL,4638307320372049348",
      "STATUS_WITH_PORT": "NORMAL,4638307320372049348",
    }

…P if RPC address is removed and with HostID if nothing else helps. Fixes k8ssandra#261

burmanm added the bug Something isn't working label Jan 17, 2022

burmanm self-assigned this Jan 17, 2022

burmanm added a commit to burmanm/cass-operator that referenced this issue Jan 17, 2022

Decommission status checks will match the pod based on the Endpoint I…

cfd6afa

…P if RPC address is removed and with HostID if nothing else helps. Fixes k8ssandra#261

burmanm closed this as completed in 9d1c58a Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling down fails if RpcAddress is removed before decommission is done #261

Scaling down fails if RpcAddress is removed before decommission is done #261

burmanm commented Jan 17, 2022 •

edited

Loading

Scaling down fails if RpcAddress is removed before decommission is done #261

Scaling down fails if RpcAddress is removed before decommission is done #261

Comments

burmanm commented Jan 17, 2022 • edited Loading

burmanm commented Jan 17, 2022 •

edited

Loading