DB cannot be opened after node hard reset #562

fyfyrchik · 2023-09-06T08:28:24Z

Hello! I am not sure whether this is a bug in bolt, but I think you might find this interesting:

Create ext4 with fast_commit feature enabled (we use 5.10 kernel)
Create DB with default settings (i.e. no unsafe settings, (my only custom parameters are batch size and batch delay, they seem unrelated).
Perform server hard-reset under write load.
DB cannot be opened, bbolt check reports:

panic: assertion failed: Page expected to be: 23092, but self identifies as 0

goroutine 6 [running]:
go.etcd.io/bbolt._assert(...)
        /repo/bbolt/db.go:1359
go.etcd.io/bbolt.(*page).fastCheck(0x7f0b95834000, 0x5a34)
        /repo/bbolt/page.go:57 +0x1d9
go.etcd.io/bbolt.(*Tx).page(0x7f0b909be000?, 0x4f9cc0?)
        /repo/bbolt/tx.go:534 +0x79
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f0b8fe11000?, {0xc000024140?, 0x3, 0xa}, 0xc00008db68)
        /repo/bbolt/tx.go:546 +0x5d
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f0b8fe19000?, {0xc000024140?, 0x2, 0xa}, 0xc00008db68)
        /repo/bbolt/tx.go:555 +0xc8
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x0?, {0xc000024140?, 0x1, 0xa}, 0xc00008db68)
        /repo/bbolt/tx.go:555 +0xc8
go.etcd.io/bbolt.(*Tx).forEachPage(...)
        /repo/bbolt/tx.go:542
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00013e000, 0xc00002a280, 0xc00008deb0, 0xc00008dee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0)
        /repo/bbolt/tx_check.go:83 +0x111
go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f0b8fe3e0a6?, 0xc000104d28?, 0xc00013e000?})
        /repo/bbolt/tx_check.go:110 +0x90
go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc00008dd70)
        /repo/bbolt/bucket.go:403 +0x96
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00013e000, 0xc00013e018, 0xc000104eb0, 0xc000104ee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0)
        /repo/bbolt/tx_check.go:108 +0x252
go.etcd.io/bbolt.(*Tx).check(0xc00013e000, {0x54b920, 0x62ec60}, 0x0?)
        /repo/bbolt/tx_check.go:61 +0x365
created by go.etcd.io/bbolt.(*Tx).CheckWithOptions in goroutine 1
        /repo/bbolt/tx_check.go:31 +0x118

The last 64 pages of the file seem to be filled with zeroes.

When the ext4 fast_commit is disabled, our tests pass, and db can be opened.
I have reproduced this on both 1.3.6 and 1.3.7.

Here is the output for bbolt pages https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730

Meta pages

$ ./bbolt page ./1 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=26>
Freelist:   <pgid=97>
HWM:        <pgid=23157>
Txn ID:     1198
Checksum:   6397aaef7230fab5

$ ./bbolt page ./1 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=62>
Freelist:   <pgid=102>
HWM:        <pgid=23157>
Txn ID:     1199
Checksum:   8ec692c6ba15e06b

The text was updated successfully, but these errors were encountered:

ahrtr · 2023-09-06T10:08:50Z

Here is the output for bbolt pages https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730

There are only 2754 lines, but there are 23157 pages (each page has one line). Did you manually remove some lines from the outputted of bbolt pages path-2-db?

Just to double check, you easily reproduce this issue with fast_commit enabled, and never reproduced it with fast_commit disabled?

What's OS and Arch?

Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?

fyfyrchik · 2023-09-06T11:49:53Z

Did you manually remove some lines from the outputted of bbolt pages path-2-db?

No, that's the full output.

Just to double check, you easily reproduce this issue with fast_commit enabled, and never reproduced it with fast_commit disabled?

Yes

What's OS and Arch?

Linux (Debian), 5.10 arch

Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?

All objects were less or equal than 128 KiB.
We have 3 buckets (for 32, 64, 128 KiB objects respectively).
On a single disk there are about 100 identical DB being written to, then we just start load and do multiple hard-reboots during the night, in the morning see the result.

ahrtr · 2023-09-06T17:00:03Z

thx @fyfyrchik for the feedback.

Could you share the db file?

Perform server hard-reset under write load.

Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?

fyfyrchik · 2023-09-08T14:21:51Z

Yes, I can: the file is about ~100 MB, where should I send it?

Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?

Yes, we poweroff the storage node, loader is on a different one.

ahrtr · 2023-09-08T14:29:59Z

Yes, I can: the file is about ~100 MB, where should I send it?

Is it possible that you put the file somewhere that I can download? If yes, please send the link to wachao@vmware.com. If not, could you try to transfer the file to me via slack? Please reach out in K8s workspace (Benjamin Wang).

Yes, we poweroff the storage node, loader is on a different one.

Thanks for the info. You mount the storage node to the loader, and the loader accesses the storage node as a NFS?

ahrtr · 2023-10-10T11:54:29Z

Thanks @fyfyrchik for sharing the db file offline. The db file is corrupted. Unfortunately the existing check command can't provide any useful info.

I will enhance the existing check command:

try not to panic
supporting checking starting from a pageId.

fuweid · 2023-11-13T08:55:10Z

Hi @fyfyrchik ,

For the Perform server hard-reset under write load., did you mean something like echo b > /proc/sysrq-trigger to force reboot?

fyfyrchik · 2023-11-13T12:44:18Z

Hi @fuweid,
Yes (or like echo c?). In our case, the reboot was done via BMC console.

fuweid · 2023-11-13T12:46:16Z

@fyfyrchik thanks! I will try to reproduce this by dm-flakey

fuweid · 2023-12-20T09:27:12Z

Updated.

@fyfyrchik since my local env is using v6.4 kernel, I build v5.10 kernel and run it with qemu.

#!/bin/bash

set -exuo pipefail

readonly db_file=/var/lib/etcd/xxx.db
mkdir -p /var/lib/etcd

if [[ -f "${db_file}" ]]; then
  ls -alh $db_file
  bbolt check $db_file
fi


bbolt bench -work -path $db_file -count=1000000000 -batch-size=100 -value-size=4096 2> /dev/null &
sleep $(( ( RANDOM % 10 )  + 1 ))
echo b > /proc/sysrq-trigger

I can reproduce this issue with main branch.

panic: assertion failed: Page expected to be: 56763, but self identifies as 0

goroutine 6 [running]:
go.etcd.io/bbolt/internal/common.Assert(...)
        /home/fuwei/workspace/bbolt/internal/common/verify.go:65
go.etcd.io/bbolt/internal/common.(*Page).FastCheck(0x7f59cd5bb000, 0xddbb)
        /home/fuwei/workspace/bbolt/internal/common/page.go:83 +0x1d9
go.etcd.io/bbolt.(*Tx).page(0x7f59cd51e000?, 0x5dc120?)
        /home/fuwei/workspace/bbolt/tx.go:545 +0x79
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c61b6000?, {0xc0000200f0?, 0x4, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:557 +0x5d
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6cfe000?, {0xc0000200f0?, 0x3, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6d95000?, {0xc0000200f0?, 0x2, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc000104b38?, {0xc0000200f0?, 0x1, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPage(...)
        /home/fuwei/workspace/bbolt/tx.go:553
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000024140, 0xc000131eb0, 0xc000131ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
        /home/fuwei/workspace/bbolt/tx_check.go:82 +0x111
go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f59c6d96020?, 0xc000146000?, 0xc000104d20?})
        /home/fuwei/workspace/bbolt/tx_check.go:109 +0x90
go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc000131d68)
        /home/fuwei/workspace/bbolt/bucket.go:451 +0x96
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000146018, 0xc000104eb0, 0xc000104ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
        /home/fuwei/workspace/bbolt/tx_check.go:107 +0x252
go.etcd.io/bbolt.(*Tx).check(0xc000146000, {0x67bd40, 0x80b420}, 0x0?)
        /home/fuwei/workspace/bbolt/tx_check.go:60 +0x365
created by go.etcd.io/bbolt.(*Tx).Check in goroutine 1
        /home/fuwei/workspace/bbolt/tx_check.go:30 +0x118

56886    unknown<00> 0
56887    unknown<00> 0
56888    unknown<00> 0
56889    unknown<00> 0
56890    unknown<00> 0
56891    unknown<00> 0
56892    unknown<00> 0
56893    unknown<00> 0
56894    unknown<00> 0
56895    unknown<00> 0
56896    unknown<00> 0
56897    unknown<00> 0
56898    unknown<00> 0
56899    unknown<00> 0
56900    unknown<00> 0
56901    unknown<00> 0
56902    unknown<00> 0
56903    unknown<00> 0
56904    unknown<00> 0
56905    unknown<00> 0
56906    unknown<00> 0
56907    unknown<00> 0
56908    unknown<00> 0
56909    unknown<00> 0
56910    unknown<00> 0
56911    unknown<00> 0
56912    unknown<00> 0
56913    free
56914    free
56915    leaf       2      2
56918    leaf       2      2
56921    leaf       2      2
56924    leaf       2      2
56927    leaf       2      2
56930    leaf       2      2
56933    leaf       2      2
56936    leaf       2      2
56939    leaf       2      2
56942    leaf       2      2
56945    leaf       2      2
56948    leaf       2      2
56951    leaf       2      2
56954    leaf       2      2
56957    leaf       2      2

I am not sure it's related to fast-commit because I didn't find relative patch for this.
Trying to test it without fast-commit.

cc @ahrtr

fuweid · 2023-12-20T12:36:45Z

Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...

@fyfyrchik I think I would like to suggest to upgrade kernel or disable fc option 😂

NOTE:

Other patch is here https://lore.kernel.org/lkml/202201091544.W5HHEXAp-lkp@intel.com/T/ which is covered by https://kernel.googlesource.com/pub/scm/fs/xfs/xfstests-dev/+/refs/heads/master/tests/generic/455

fuweid · 2023-12-20T12:48:33Z

I can't reproduce this issue in new kernel with fc option or v5.10 without fc option.

And talked to patch's author and I believe that this is kernel bug instead of bbolt.

@fyfyrchik Thanks for that reporting!

ahrtr · 2023-12-20T13:32:37Z

Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...

Great finding.

Just as we discussed earlier today, we received around 18 data corruption cases in the past 2 years. Most likely fast commit wasn't enabled at all in most cases.

Enable the nightly workflow.
Let's keep the robustness testing running for a long time (e.g. > 4 weeks).
Can we also keep the test "bbolt xxxx; echo b > /proc/sysrq-trigger" for a long time (e.g > 4 weeks)

ahrtr · 2023-12-20T13:40:16Z

Also please let's add a known issue in the readme to clarify

do not enable fast_commit on which kernel version
which kernel version contains the fix

fuweid · 2023-12-20T14:15:31Z

Can we also keep the test "bbolt xxxx; echo b > /proc/sysrq-trigger" for a long time (e.g > 4 weeks)

Will do and think about how to submit it as code in repo. Assign this issue to me.

/assign

fyfyrchik · 2023-12-21T08:21:04Z

@fuweid thanks for your research!
In case someone will be wondering, I have tracked down the version this patch was added:
The second link is present in 5.17 (handle parameter) https://elixir.bootlin.com/linux/v5.17/source/fs/ext4/fast_commit.c#L307
And in 5.15.144, 5.16.... with the latest fixes..
So to fix the problem one might want to try >= 5.17 kernel.

fuweid · 2023-12-21T08:42:38Z

Thanks @fyfyrchik ! Yeah, fast commit is new feature and when we enable this, we need to run failpoint test cases on that. It's not just bbolt, but also other applications which cares data persistence.

ahrtr · 2024-02-24T14:31:11Z

For anyone reference, from application (bbolt in this case)'s perspective, the root cause of the issue is that the system call Fdatasync returns successful (err == nil) but actually the kernel hasn't actually synced the data successfully yet. It may lead to a situation that the meta page is persisted successfully, but some data pages are not yet. Eventually some pageIds may point to corrupted data pages.

fyfyrchik changed the title ~~Possible corruption after node hard reset~~ DB cannot be opened after node hard reset Sep 6, 2023

cenkalti added the area/corruption label Sep 6, 2023

k8s-ci-robot assigned fuweid Dec 20, 2023

fuweid mentioned this issue Jan 12, 2024

README.md: introduce known issue section #675

Merged

ahrtr closed this as completed in #675 Jan 12, 2024

fuweid mentioned this issue Mar 7, 2024

containerd start panic because of abnormal shutdown containerd/containerd#9929

Open

geotransformer mentioned this issue Apr 14, 2024

static etcd container failed to start in kubeadm based k8s deployment etcd-io/etcd#17772

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB cannot be opened after node hard reset #562

DB cannot be opened after node hard reset #562

fyfyrchik commented Sep 6, 2023

ahrtr commented Sep 6, 2023

fyfyrchik commented Sep 6, 2023

ahrtr commented Sep 6, 2023 •

edited

Loading

fyfyrchik commented Sep 8, 2023 •

edited

Loading

ahrtr commented Sep 8, 2023

ahrtr commented Oct 10, 2023

fuweid commented Nov 13, 2023

fyfyrchik commented Nov 13, 2023

fuweid commented Nov 13, 2023 •

edited

Loading

fuweid commented Dec 20, 2023 •

edited

Loading

fuweid commented Dec 20, 2023 •

edited

Loading

fuweid commented Dec 20, 2023

ahrtr commented Dec 20, 2023

ahrtr commented Dec 20, 2023

fuweid commented Dec 20, 2023

fyfyrchik commented Dec 21, 2023 •

edited

Loading

fuweid commented Dec 21, 2023

ahrtr commented Feb 24, 2024

DB cannot be opened after node hard reset #562

DB cannot be opened after node hard reset #562

Comments

fyfyrchik commented Sep 6, 2023

ahrtr commented Sep 6, 2023

fyfyrchik commented Sep 6, 2023

ahrtr commented Sep 6, 2023 • edited Loading

fyfyrchik commented Sep 8, 2023 • edited Loading

ahrtr commented Sep 8, 2023

ahrtr commented Oct 10, 2023

fuweid commented Nov 13, 2023

fyfyrchik commented Nov 13, 2023

fuweid commented Nov 13, 2023 • edited Loading

fuweid commented Dec 20, 2023 • edited Loading

fuweid commented Dec 20, 2023 • edited Loading

fuweid commented Dec 20, 2023

ahrtr commented Dec 20, 2023

ahrtr commented Dec 20, 2023

fuweid commented Dec 20, 2023

fyfyrchik commented Dec 21, 2023 • edited Loading

fuweid commented Dec 21, 2023

ahrtr commented Feb 24, 2024

ahrtr commented Sep 6, 2023 •

edited

Loading

fyfyrchik commented Sep 8, 2023 •

edited

Loading

fuweid commented Nov 13, 2023 •

edited

Loading

fuweid commented Dec 20, 2023 •

edited

Loading

fuweid commented Dec 20, 2023 •

edited

Loading

fyfyrchik commented Dec 21, 2023 •

edited

Loading