Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]scale-in ops makes the pod of an etcd cluster into a CrashLoopBackOff #8699

Closed
tianyue86 opened this issue Dec 24, 2024 · 1 comment · Fixed by apecloud/kubeblocks-addons#1367
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@tianyue86
Copy link

Describe the env
Kubernetes: v1.31.1-aliyun.1
KubeBlocks: 1.0.0-beta.19
kbcli: 1.0.0-beta.7

To Reproduce
Steps to reproduce the behavior:

  1. Create an etcd cluster
k get cluster -A
NAMESPACE   NAME                     CLUSTER-DEFINITION   TERMINATION-POLICY   STATUS     AGE
default     etcd-hitvrr                                   WipeOut              Creating   57s
image
  1. scaleIn ops succeed
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  generateName: etcd-hitvrr-hscaleoffinstance-
  labels:
    app.kubernetes.io/instance: etcd-hitvrr
    app.kubernetes.io/managed-by: kubeblocks
  namespace: default
spec:
  type: HorizontalScaling
  clusterName: etcd-hitvrr
  force: true
  horizontalScaling:
  - componentName: etcd
    scaleIn:
      replicaChanges: 1
kubectl create -f etcdsi.yaml
opsrequest.operations.kubeblocks.io/etcd-hitvrr-hscaleoffinstance-57lgq created
 kbcli cluster list-ops etcd-hitvrr --status all  --namespace default
NAME                                  NAMESPACE   TYPE                CLUSTER       COMPONENT   STATUS    PROGRESS   CREATED-TIME                 
etcd-hitvrr-hscaleoffinstance-57lgq   default     HorizontalScaling   etcd-hitvrr   etcd        Succeed   1/1        Dec 24,2024 11:54 UTC+0800
  1. Check pod
k get pod|grep etcd
etcd-hitvrr-etcd-0               2/2     Running            0               46m
etcd-hitvrr-etcd-1               2/2     Running            0               45m
etcd-hitvrr-etcd-2               1/2     CrashLoopBackOff   7 (4m27s ago)   15m


k describe pod etcd-hitvrr-etcd-2
Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Warning  FailedScheduling        16m                 default-scheduler        0/6 nodes are available: persistentvolumeclaim "data-etcd-hitvrr-etcd-2" not found. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.,
  Normal   Scheduled               16m                 default-scheduler        Successfully assigned default/etcd-hitvrr-etcd-2 to cn-zhangjiakou.10.0.0.128
  Normal   SuccessfulAttachVolume  16m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "d-8vbdqy8pql46etehv90r"
  Normal   AllocIPSucceed          16m                 terway-daemon            Alloc IP 10.0.0.224/24 took 32.372902ms
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/bash-busybox:1.37.0-musl" already present on machine
  Normal   Created                 16m                 kubelet                  Created container inject-bash
  Normal   Started                 16m                 kubelet                  Started container inject-bash
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks-tools:1.0.0-beta.19" already present on machine
  Normal   Created                 16m                 kubelet                  Created container init-kbagent
  Normal   Started                 16m                 kubelet                  Started container init-kbagent
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/etcd:v3.5.15" already present on machine
  Normal   Created                 16m                 kubelet                  Created container kbagent-worker
  Normal   Started                 16m                 kubelet                  Started container kbagent-worker
  Normal   Pulled                  16m                 kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/etcd:v3.5.15" already present on machine
  Normal   Created                 16m                 kubelet                  Created container kbagent
  Normal   Started                 16m                 kubelet                  Started container kbagent
  Normal   Pulled                  15m (x3 over 16m)   kubelet                  Container image "apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/etcd:v3.5.15" already present on machine
  Normal   Created                 15m (x3 over 16m)   kubelet                  Created container etcd
  Normal   Started                 15m (x3 over 16m)   kubelet                  Started container etcd
  Warning  BackOff                 78s (x72 over 16m)  kubelet                  Back-off restarting failed container etcd in pod etcd-hitvrr-etcd-2_default(21bbd0a2-bcc2-4496-a50f-b71396492a08)
  Normal   roleProbe               18s (x16 over 15m)  kbagent                  {"instance":"etcd-hitvrr-etcd","probe":"roleProbe","code":-1,"message":"exit code: 1, stderr: grep: /var/run/etcd/etcd.conf: No such file or directory\nERROR: bad etcdctl args: clientProtocol:, endpoints:127.0.0.1:2379, tlsDir:/etc/pki/tls, please check!\nbad role, please check!\n: failed"}
  1. check log
kubectl logs etcd-hitvrr-etcd-2 -n default -c etcd > podlogs

[2024-12-24 04:00:17] start to rebuild etcd configuration...
Error: Failed to get current pod: etcd-hitvrr-etcd-2 fqdn from peer fqdn list: etcd-hitvrr-etcd-0.etcd-hitvrr-etcd-headless.default.svc.cluster.local,etcd-hitvrr-etcd-1.etcd-hitvrr-etcd-headless.default.svc.cluster.local. Exiting.
[2024-12-24 04:00:17] Failed to get my endpoint. Exiting.
[2024-12-24 04:00:17] Failed to rebuild etcd configuration.

But the cluster is still running
k get cluster -A
NAMESPACE   NAME                     CLUSTER-DEFINITION   TERMINATION-POLICY   STATUS    AGE
default     etcd-hitvrr                                   WipeOut              Running   50m

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@shanshanying
Copy link
Contributor

kgagent logs

2024/12/25 02:01:44 The incoming connection cannot be served, because 8 concurrent connections are served. Try increasing Server.Concurrency
2024-12-25T02:02:07Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": -1, "output": "", "message": "exit code: 1, stderr: grep: /var/run/etcd/etcd.conf: No such file or directory\nERROR: bad etcdctl args: clientProtocol:, endpoints:127.0.0.1:2379, tlsDir:/etc/pki/tls, please check!\nbad role, please check!\n: failed"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants