static etcd container failed to start in kubeadm based k8s deployment #17772

faszhang · 2024-04-11T05:10:23Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

/data$ sudo crictl logs 16a103b0bf94e

bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: assertion failed: Page expected to be: 476, but self identifies as 0

goroutine 197 [running]:
go.etcd.io/bbolt._assert(...)
go.etcd.io/bbolt@v1.3.8/db.go:1387
go.etcd.io/bbolt.(*page).fastCheck(0x7f1c4bb87000, 0x1dc)
go.etcd.io/bbolt@v1.3.8/page.go:57 +0x1df
go.etcd.io/bbolt.(*Tx).page(0x0?, 0xc00010f4b8?)
go.etcd.io/bbolt@v1.3.8/tx.go:534 +0x8a
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x0?, {0xc000042140?, 0x1, 0xa}, 0xc00010f5b0)
go.etcd.io/bbolt@v1.3.8/tx.go:546 +0x65
go.etcd.io/bbolt.(*Tx).forEachPage(...)
go.etcd.io/bbolt@v1.3.8/tx.go:542
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00040e540, 0xc00040e558, 0xc00010f778, 0xc00010f748, {0x12ec138?, 0x1ab5fa8}, 0xc0004523c0)
go.etcd.io/bbolt@v1.3.8/tx_check.go:83 +0x126
go.etcd.io/bbolt.(*DB).freepages(0x114a511?)
go.etcd.io/bbolt@v1.3.8/db.go:1205 +0x229
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
go.etcd.io/bbolt@v1.3.8/db.go:417 +0xd1
sync.(*Once).doSlow(0x40df67?, 0x9d94c0?)
sync/once.go:74 +0xc2
sync.(*Once).Do(...)
sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc0000f06c0?)
go.etcd.io/bbolt@v1.3.8/db.go:413 +0x47
go.etcd.io/bbolt.Open({0xc000044ca0, 0x19}, 0x44f8f2?, 0xc0000f2c00)
go.etcd.io/bbolt@v1.3.8/db.go:295 +0x44f
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000044ca0, 0x19}, 0x5f5e100, 0x2710, {0x114a511, 0x7}, 0x280000000, 0xc0001185a0, 0x0, 0x0, ...})
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:187 +0x226
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0x7ffe83987e2e, 0x13}, {0x0, 0x0}, {0x0, 0x0}, {0xc0001f7b00, 0x1, 0x1}, {0xc0001f7d40, ...}, ...}, ...)
go.etcd.io/etcd/server/v3/etcdserver/backend.go:55 +0x399
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
go.etcd.io/etcd/server/v3/etcdserver/backend.go:76 +0x78
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
go.etcd.io/etcd/server/v3/etcdserver/backend.go:75 +0x18a

What did you expect to happen?

etcd restarted or expose specific issue

How can we reproduce it (as minimally and precisely as possible)?

drbd based ha, primary server power off. switchover happened, but the etcd cannot start because above error

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here
3.5.12
$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

ahrtr · 2024-04-11T07:36:14Z

It seems that the bbolt db file has corrupted. Would you mind share the db file if it isn't production env and there is any sensitive data?

Also please run go env and share the output. Thanks. Also let's know what's the filesystem, e.g. ext4, xfs etc?

faszhang · 2024-04-11T13:58:52Z

ext4 filesystem

`apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://x.x.x.x:2379
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:

command:
- etcd
- --advertise-client-urls=https://127.0.0.1:2379
- --cert-file=/data/kubernetes/pki/etcd/server.crt
- --cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
- --client-cert-auth=true
- --data-dir=/data/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://127.0.0.1:2380
- --initial-cluster=psi1-cm-primary=https://127.0.0.1:2380
- --key-file=/data/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://x.x.x.x:2379
- --listen-metrics-urls=http://0.0.0.0:2381
- --listen-peer-urls=https://127.0.0.1:2380
- --name=psi1-cm-primary
- --peer-cert-file=/data/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/data/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/data/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/data/kubernetes/pki/etcd/ca.crt
  image: registry.k8s.io/etcd:3.5.10-0
  imagePullPolicy: IfNotPresent
  livenessProbe:
  failureThreshold: 8
  httpGet:
  host: 0.0.0.0
  path: /health?exclude=NOSPACE&serializable=true
  port: 2381
  scheme: HTTP
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 15
  name: etcd
  resources:
  requests:
  cpu: 100m
  memory: 100Mi
  startupProbe:
  failureThreshold: 24
  httpGet:
  host: 0.0.0.0
  path: /health?serializable=false
  port: 2381
  scheme: HTTP
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 15
  volumeMounts:
- mountPath: /data/etcd
  name: etcd-data
- mountPath: /data/kubernetes/pki/etcd
  name: etcd-certs
  hostNetwork: true
  priority: 2000001000
  priorityClassName: system-node-critical
  securityContext:
  seccompProfile:
  type: RuntimeDefault
  volumes:
hostPath:
path: /data/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-certs
hostPath:
path: /data/etcd
type: DirectoryOrCreate
name: etcd-data
status: {}`

faszhang · 2024-04-11T14:00:26Z

db.txt

This is the DB file.

ahrtr · 2024-04-12T20:56:46Z

db.txt

This is the DB file.

Unfortunately, confirmed that the db is corrupted.

geotransformer · 2024-04-14T02:52:44Z

nately, confirmed that the db is corrupted.

Thanks for checking @ahrtr Any idea how it is corrupted? The HA is DRBD based. Power off primary to trigger the switch over.

Kernel 5.4.0-174-generic
FS: ext4
The k8s version 1.28 with etcd 3.5.10.

We tried downgrade to 3.5.9 and upgrade 3.5.12. Still having the issue.

On the setup with k8s 1.25 and etcd 3.5.9. we dont see this issue though. The kernel version is 5.4 as well

geotransformer · 2024-04-14T02:58:31Z

@ahrtr i saw you opened an issue for it
etcd-io/bbolt#581

Other issues opened
etcd-io/bbolt#705
etcd-io/bbolt#562

Containerd issue
containerd/containerd#9929

ahrtr · 2024-04-14T18:19:09Z

Thanks for checking @ahrtr Any idea how it is corrupted? The HA is DRBD based. Power off primary to trigger the switch over.

Most likely it's the file system's issue. The data wasn't successfully synced to disk when powering off but the syscall.Fdatasync returned no error. Eventually it may run into a situation that some page data got lost, but the meta page got updated successfully. Accordingly a page may point to a corrupted/lost page.

We can do strict check on each TXN, but it definitely will have big impact on performance. I don't see an easy way for now.

jmhbnz · 2024-05-09T18:08:43Z

Discussed during sig-etcd triage meeting, @ahrtr can we close this now as this was a corrupt db file?

jmhbnz · 2024-05-23T18:07:19Z

Discussed during sig-etcd triage meeting. Confirmed this is a boltdb issue, closing as we can't do anything from the etcd main repo side.

faszhang added the type/bug label Apr 11, 2024

ahrtr added the area/bbolt label Apr 11, 2024

ahrtr added the data/corruption label May 9, 2024

jmhbnz closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

static etcd container failed to start in kubeadm based k8s deployment #17772

static etcd container failed to start in kubeadm based k8s deployment #17772

faszhang commented Apr 11, 2024 •

edited

Loading

paste your configuration here

ahrtr commented Apr 11, 2024 •

edited

Loading

faszhang commented Apr 11, 2024

faszhang commented Apr 11, 2024 •

edited

Loading

ahrtr commented Apr 12, 2024

geotransformer commented Apr 14, 2024 •

edited

Loading

geotransformer commented Apr 14, 2024 •

edited

Loading

ahrtr commented Apr 14, 2024

jmhbnz commented May 9, 2024

jmhbnz commented May 23, 2024

static etcd container failed to start in kubeadm based k8s deployment #17772

static etcd container failed to start in kubeadm based k8s deployment #17772

Comments

faszhang commented Apr 11, 2024 • edited Loading

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented Apr 11, 2024 • edited Loading

faszhang commented Apr 11, 2024

faszhang commented Apr 11, 2024 • edited Loading

ahrtr commented Apr 12, 2024

geotransformer commented Apr 14, 2024 • edited Loading

geotransformer commented Apr 14, 2024 • edited Loading

ahrtr commented Apr 14, 2024

jmhbnz commented May 9, 2024

jmhbnz commented May 23, 2024

faszhang commented Apr 11, 2024 •

edited

Loading

ahrtr commented Apr 11, 2024 •

edited

Loading

faszhang commented Apr 11, 2024 •

edited

Loading

geotransformer commented Apr 14, 2024 •

edited

Loading

geotransformer commented Apr 14, 2024 •

edited

Loading