-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
static etcd container failed to start in kubeadm based k8s deployment #17772
Comments
It seems that the bbolt db file has corrupted. Would you mind share the db file if it isn't production env and there is any sensitive data? Also please run |
ext4 filesystem `apiVersion: v1
|
This is the DB file. |
Unfortunately, confirmed that the db is corrupted. |
Thanks for checking @ahrtr Any idea how it is corrupted? The HA is DRBD based. Power off primary to trigger the switch over. Kernel 5.4.0-174-generic We tried downgrade to 3.5.9 and upgrade 3.5.12. Still having the issue. On the setup with k8s 1.25 and etcd 3.5.9. we dont see this issue though. The kernel version is 5.4 as well |
@ahrtr i saw you opened an issue for it Other issues opened Containerd issue |
Most likely it's the file system's issue. The data wasn't successfully synced to disk when powering off but the syscall.Fdatasync returned no error. Eventually it may run into a situation that some page data got lost, but the meta page got updated successfully. Accordingly a page may point to a corrupted/lost page. We can do strict check on each TXN, but it definitely will have big impact on performance. I don't see an easy way for now. |
Discussed during sig-etcd triage meeting, @ahrtr can we close this now as this was a corrupt db file? |
Discussed during sig-etcd triage meeting. Confirmed this is a boltdb issue, closing as we can't do anything from the etcd main repo side. |
Bug report criteria
What happened?
/data$ sudo crictl logs 16a103b0bf94e
bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: assertion failed: Page expected to be: 476, but self identifies as 0
goroutine 197 [running]:
go.etcd.io/bbolt._assert(...)
go.etcd.io/bbolt@v1.3.8/db.go:1387
go.etcd.io/bbolt.(*page).fastCheck(0x7f1c4bb87000, 0x1dc)
go.etcd.io/bbolt@v1.3.8/page.go:57 +0x1df
go.etcd.io/bbolt.(*Tx).page(0x0?, 0xc00010f4b8?)
go.etcd.io/bbolt@v1.3.8/tx.go:534 +0x8a
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x0?, {0xc000042140?, 0x1, 0xa}, 0xc00010f5b0)
go.etcd.io/bbolt@v1.3.8/tx.go:546 +0x65
go.etcd.io/bbolt.(*Tx).forEachPage(...)
go.etcd.io/bbolt@v1.3.8/tx.go:542
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00040e540, 0xc00040e558, 0xc00010f778, 0xc00010f748, {0x12ec138?, 0x1ab5fa8}, 0xc0004523c0)
go.etcd.io/bbolt@v1.3.8/tx_check.go:83 +0x126
go.etcd.io/bbolt.(*DB).freepages(0x114a511?)
go.etcd.io/bbolt@v1.3.8/db.go:1205 +0x229
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
go.etcd.io/bbolt@v1.3.8/db.go:417 +0xd1
sync.(*Once).doSlow(0x40df67?, 0x9d94c0?)
sync/once.go:74 +0xc2
sync.(*Once).Do(...)
sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc0000f06c0?)
go.etcd.io/bbolt@v1.3.8/db.go:413 +0x47
go.etcd.io/bbolt.Open({0xc000044ca0, 0x19}, 0x44f8f2?, 0xc0000f2c00)
go.etcd.io/bbolt@v1.3.8/db.go:295 +0x44f
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000044ca0, 0x19}, 0x5f5e100, 0x2710, {0x114a511, 0x7}, 0x280000000, 0xc0001185a0, 0x0, 0x0, ...})
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:187 +0x226
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0x7ffe83987e2e, 0x13}, {0x0, 0x0}, {0x0, 0x0}, {0xc0001f7b00, 0x1, 0x1}, {0xc0001f7d40, ...}, ...}, ...)
go.etcd.io/etcd/server/v3/etcdserver/backend.go:55 +0x399
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
go.etcd.io/etcd/server/v3/etcdserver/backend.go:76 +0x78
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
go.etcd.io/etcd/server/v3/etcdserver/backend.go:75 +0x18a
What did you expect to happen?
etcd restarted or expose specific issue
How can we reproduce it (as minimally and precisely as possible)?
drbd based ha, primary server power off. switchover happened, but the etcd cannot start because above error
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: