Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

static etcd container failed to start in kubeadm based k8s deployment #17772

Closed
4 tasks
faszhang opened this issue Apr 11, 2024 · 9 comments
Closed
4 tasks

static etcd container failed to start in kubeadm based k8s deployment #17772

faszhang opened this issue Apr 11, 2024 · 9 comments

Comments

@faszhang
Copy link

faszhang commented Apr 11, 2024

Bug report criteria

What happened?

/data$ sudo crictl logs 16a103b0bf94e

bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: assertion failed: Page expected to be: 476, but self identifies as 0

goroutine 197 [running]:
go.etcd.io/bbolt._assert(...)
go.etcd.io/bbolt@v1.3.8/db.go:1387
go.etcd.io/bbolt.(*page).fastCheck(0x7f1c4bb87000, 0x1dc)
go.etcd.io/bbolt@v1.3.8/page.go:57 +0x1df
go.etcd.io/bbolt.(*Tx).page(0x0?, 0xc00010f4b8?)
go.etcd.io/bbolt@v1.3.8/tx.go:534 +0x8a
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x0?, {0xc000042140?, 0x1, 0xa}, 0xc00010f5b0)
go.etcd.io/bbolt@v1.3.8/tx.go:546 +0x65
go.etcd.io/bbolt.(*Tx).forEachPage(...)
go.etcd.io/bbolt@v1.3.8/tx.go:542
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00040e540, 0xc00040e558, 0xc00010f778, 0xc00010f748, {0x12ec138?, 0x1ab5fa8}, 0xc0004523c0)
go.etcd.io/bbolt@v1.3.8/tx_check.go:83 +0x126
go.etcd.io/bbolt.(*DB).freepages(0x114a511?)
go.etcd.io/bbolt@v1.3.8/db.go:1205 +0x229
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
go.etcd.io/bbolt@v1.3.8/db.go:417 +0xd1
sync.(*Once).doSlow(0x40df67?, 0x9d94c0?)
sync/once.go:74 +0xc2
sync.(*Once).Do(...)
sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc0000f06c0?)
go.etcd.io/bbolt@v1.3.8/db.go:413 +0x47
go.etcd.io/bbolt.Open({0xc000044ca0, 0x19}, 0x44f8f2?, 0xc0000f2c00)
go.etcd.io/bbolt@v1.3.8/db.go:295 +0x44f
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000044ca0, 0x19}, 0x5f5e100, 0x2710, {0x114a511, 0x7}, 0x280000000, 0xc0001185a0, 0x0, 0x0, ...})
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:187 +0x226
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0x7ffe83987e2e, 0x13}, {0x0, 0x0}, {0x0, 0x0}, {0xc0001f7b00, 0x1, 0x1}, {0xc0001f7d40, ...}, ...}, ...)
go.etcd.io/etcd/server/v3/etcdserver/backend.go:55 +0x399
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
go.etcd.io/etcd/server/v3/etcdserver/backend.go:76 +0x78
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
go.etcd.io/etcd/server/v3/etcdserver/backend.go:75 +0x18a

What did you expect to happen?

etcd restarted or expose specific issue

How can we reproduce it (as minimally and precisely as possible)?

drbd based ha, primary server power off. switchover happened, but the etcd cannot start because above error

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here
3.5.12
$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@ahrtr
Copy link
Member

ahrtr commented Apr 11, 2024

It seems that the bbolt db file has corrupted. Would you mind share the db file if it isn't production env and there is any sensitive data?

Also please run go env and share the output. Thanks. Also let's know what's the filesystem, e.g. ext4, xfs etc?

@faszhang
Copy link
Author

ext4 filesystem

`apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://x.x.x.x:2379
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:

  • command:
    • etcd
    • --advertise-client-urls=https://127.0.0.1:2379
    • --cert-file=/data/kubernetes/pki/etcd/server.crt
    • --cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
    • --client-cert-auth=true
    • --data-dir=/data/etcd
    • --experimental-initial-corrupt-check=true
    • --experimental-watch-progress-notify-interval=5s
    • --initial-advertise-peer-urls=https://127.0.0.1:2380
    • --initial-cluster=psi1-cm-primary=https://127.0.0.1:2380
    • --key-file=/data/kubernetes/pki/etcd/server.key
    • --listen-client-urls=https://127.0.0.1:2379,https://x.x.x.x:2379
    • --listen-metrics-urls=http://0.0.0.0:2381
    • --listen-peer-urls=https://127.0.0.1:2380
    • --name=psi1-cm-primary
    • --peer-cert-file=/data/kubernetes/pki/etcd/peer.crt
    • --peer-client-cert-auth=true
    • --peer-key-file=/data/kubernetes/pki/etcd/peer.key
    • --peer-trusted-ca-file=/data/kubernetes/pki/etcd/ca.crt
    • --snapshot-count=10000
    • --trusted-ca-file=/data/kubernetes/pki/etcd/ca.crt
      image: registry.k8s.io/etcd:3.5.10-0
      imagePullPolicy: IfNotPresent
      livenessProbe:
      failureThreshold: 8
      httpGet:
      host: 0.0.0.0
      path: /health?exclude=NOSPACE&serializable=true
      port: 2381
      scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
      name: etcd
      resources:
      requests:
      cpu: 100m
      memory: 100Mi
      startupProbe:
      failureThreshold: 24
      httpGet:
      host: 0.0.0.0
      path: /health?serializable=false
      port: 2381
      scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
      volumeMounts:
    • mountPath: /data/etcd
      name: etcd-data
    • mountPath: /data/kubernetes/pki/etcd
      name: etcd-certs
      hostNetwork: true
      priority: 2000001000
      priorityClassName: system-node-critical
      securityContext:
      seccompProfile:
      type: RuntimeDefault
      volumes:
  • hostPath:
    path: /data/kubernetes/pki/etcd
    type: DirectoryOrCreate
    name: etcd-certs
  • hostPath:
    path: /data/etcd
    type: DirectoryOrCreate
    name: etcd-data
    status: {}`

@faszhang
Copy link
Author

faszhang commented Apr 11, 2024

db.txt

This is the DB file.

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2024

db.txt

This is the DB file.

Unfortunately, confirmed that the db is corrupted.

@geotransformer
Copy link

geotransformer commented Apr 14, 2024

nately, confirmed that the db is corrupted.

Thanks for checking @ahrtr Any idea how it is corrupted? The HA is DRBD based. Power off primary to trigger the switch over.

Kernel 5.4.0-174-generic
FS: ext4
The k8s version 1.28 with etcd 3.5.10.

We tried downgrade to 3.5.9 and upgrade 3.5.12. Still having the issue.

On the setup with k8s 1.25 and etcd 3.5.9. we dont see this issue though. The kernel version is 5.4 as well

@geotransformer
Copy link

geotransformer commented Apr 14, 2024

@ahrtr i saw you opened an issue for it
etcd-io/bbolt#581

Other issues opened
etcd-io/bbolt#705
etcd-io/bbolt#562

Containerd issue
containerd/containerd#9929

@ahrtr
Copy link
Member

ahrtr commented Apr 14, 2024

Thanks for checking @ahrtr Any idea how it is corrupted? The HA is DRBD based. Power off primary to trigger the switch over.

Most likely it's the file system's issue. The data wasn't successfully synced to disk when powering off but the syscall.Fdatasync returned no error. Eventually it may run into a situation that some page data got lost, but the meta page got updated successfully. Accordingly a page may point to a corrupted/lost page.

We can do strict check on each TXN, but it definitely will have big impact on performance. I don't see an easy way for now.

@jmhbnz
Copy link
Member

jmhbnz commented May 9, 2024

Discussed during sig-etcd triage meeting, @ahrtr can we close this now as this was a corrupt db file?

@jmhbnz
Copy link
Member

jmhbnz commented May 23, 2024

Discussed during sig-etcd triage meeting. Confirmed this is a boltdb issue, closing as we can't do anything from the etcd main repo side.

@jmhbnz jmhbnz closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants