[BUG]scale-in ops makes the pod of an etcd cluster into a CrashLoopBackOff #8699

tianyue86 · 2024-12-24T04:16:11Z

Describe the env
Kubernetes: v1.31.1-aliyun.1
KubeBlocks: 1.0.0-beta.19
kbcli: 1.0.0-beta.7

To Reproduce
Steps to reproduce the behavior:

Create an etcd cluster

k get cluster -A
NAMESPACE   NAME                     CLUSTER-DEFINITION   TERMINATION-POLICY   STATUS     AGE
default     etcd-hitvrr                                   WipeOut              Creating   57s

scaleIn ops succeed

apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  generateName: etcd-hitvrr-hscaleoffinstance-
  labels:
    app.kubernetes.io/instance: etcd-hitvrr
    app.kubernetes.io/managed-by: kubeblocks
  namespace: default
spec:
  type: HorizontalScaling
  clusterName: etcd-hitvrr
  force: true
  horizontalScaling:
  - componentName: etcd
    scaleIn:
      replicaChanges: 1

kubectl create -f etcdsi.yaml
opsrequest.operations.kubeblocks.io/etcd-hitvrr-hscaleoffinstance-57lgq created
 kbcli cluster list-ops etcd-hitvrr --status all  --namespace default
NAME                                  NAMESPACE   TYPE                CLUSTER       COMPONENT   STATUS    PROGRESS   CREATED-TIME                 
etcd-hitvrr-hscaleoffinstance-57lgq   default     HorizontalScaling   etcd-hitvrr   etcd        Succeed   1/1        Dec 24,2024 11:54 UTC+0800

Check pod

k get pod|grep etcd
etcd-hitvrr-etcd-0               2/2     Running            0               46m
etcd-hitvrr-etcd-1               2/2     Running            0               45m
etcd-hitvrr-etcd-2               1/2     CrashLoopBackOff   7 (4m27s ago)   15m


k describe pod etcd-hitvrr-etcd-2
Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Warning  FailedScheduling        16m                 default-scheduler        0/6 nodes are available: persistentvolumeclaim "data-etcd-hitvrr-etcd-2" not found. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.,
  Normal   Scheduled               16m                 default-scheduler        Successfully assigned default/etcd-hitvrr-etcd-2 to cn-zhangjiakou.10.0.0.128
  Normal   SuccessfulAttachVolume  16m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "d-8vbdqy8pql46etehv90r"
  Normal   AllocIPSucceed          16m                 terway-daemon            Alloc IP 10.0.0.224/24 took 32.372902ms
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/bash-busybox:1.37.0-musl" already present on machine
  Normal   Created                 16m                 kubelet                  Created container inject-bash
  Normal   Started                 16m                 kubelet                  Started container inject-bash
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks-tools:1.0.0-beta.19" already present on machine
  Normal   Created                 16m                 kubelet                  Created container init-kbagent
  Normal   Started                 16m                 kubelet                  Started container init-kbagent
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/etcd:v3.5.15" already present on machine
  Normal   Created                 16m                 kubelet                  Created container kbagent-worker
  Normal   Started                 16m                 kubelet                  Started container kbagent-worker
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/etcd:v3.5.15" already present on machine
  Normal   Created                 16m                 kubelet                  Created container kbagent
  Normal   Started                 16m                 kubelet                  Started container kbagent
  Normal   Pulled                  15m (x3 over 16m)   kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/etcd:v3.5.15" already present on machine
  Normal   Created                 15m (x3 over 16m)   kubelet                  Created container etcd
  Normal   Started                 15m (x3 over 16m)   kubelet                  Started container etcd
  Warning  BackOff                 78s (x72 over 16m)  kubelet                  Back-off restarting failed container etcd in pod etcd-hitvrr-etcd-2_default(21bbd0a2-bcc2-4496-a50f-b71396492a08)
  Normal   roleProbe               18s (x16 over 15m)  kbagent                  {"instance":"etcd-hitvrr-etcd","probe":"roleProbe","code":-1,"message":"exit code: 1, stderr: grep: /var/run/etcd/etcd.conf: No such file or directory\nERROR: bad etcdctl args: clientProtocol:, endpoints:127.0.0.1:2379, tlsDir:/etc/pki/tls, please check!\nbad role, please check!\n: failed"}

check log

kubectl logs etcd-hitvrr-etcd-2 -n default -c etcd > podlogs

[2024-12-24 04:00:17] start to rebuild etcd configuration...
Error: Failed to get current pod: etcd-hitvrr-etcd-2 fqdn from peer fqdn list: etcd-hitvrr-etcd-0.etcd-hitvrr-etcd-headless.default.svc.cluster.local,etcd-hitvrr-etcd-1.etcd-hitvrr-etcd-headless.default.svc.cluster.local. Exiting.
[2024-12-24 04:00:17] Failed to get my endpoint. Exiting.
[2024-12-24 04:00:17] Failed to rebuild etcd configuration.

But the cluster is still running
k get cluster -A
NAMESPACE   NAME                     CLUSTER-DEFINITION   TERMINATION-POLICY   STATUS    AGE
default     etcd-hitvrr                                   WipeOut              Running   50m

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

shanshanying · 2024-12-25T02:03:00Z

kgagent logs

2024/12/25 02:01:44 The incoming connection cannot be served, because 8 concurrent connections are served. Try increasing Server.Concurrency
2024-12-25T02:02:07Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": -1, "output": "", "message": "exit code: 1, stderr: grep: /var/run/etcd/etcd.conf: No such file or directory\nERROR: bad etcdctl args: clientProtocol:, endpoints:127.0.0.1:2379, tlsDir:/etc/pki/tls, please check!\nbad role, please check!\n: failed"}

tianyue86 added the kind/bug Something isn't working label Dec 24, 2024

tianyue86 assigned loomts, shanshanying and Y-Rookie Dec 24, 2024

shanshanying assigned leon-inf Dec 24, 2024

leon-inf mentioned this issue Dec 25, 2024

chore: remove the rerender at h/v-scale of etcd apecloud/kubeblocks-addons#1367

Merged

leon-inf closed this as completed in apecloud/kubeblocks-addons#1367 Dec 26, 2024

github-actions bot added this to the Release 0.9.2 milestone Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]scale-in ops makes the pod of an etcd cluster into a CrashLoopBackOff #8699

[BUG]scale-in ops makes the pod of an etcd cluster into a CrashLoopBackOff #8699

tianyue86 commented Dec 24, 2024

shanshanying commented Dec 25, 2024

[BUG]scale-in ops makes the pod of an etcd cluster into a CrashLoopBackOff #8699

[BUG]scale-in ops makes the pod of an etcd cluster into a CrashLoopBackOff #8699

Comments

tianyue86 commented Dec 24, 2024

shanshanying commented Dec 25, 2024