-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB cannot be opened after node hard reset #562
Comments
There are only 2754 lines, but there are 23157 pages (each page has one line). Did you manually remove some lines from the outputted of Just to double check, you easily reproduce this issue with What's OS and Arch? Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)? |
No, that's the full output.
Yes
Linux (Debian), 5.10 arch
All objects were less or equal than 128 KiB. |
thx @fyfyrchik for the feedback. Could you share the db file?
Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running? |
Yes, I can: the file is about ~100 MB, where should I send it?
Yes, we poweroff the storage node, loader is on a different one. |
Is it possible that you put the file somewhere that I can download? If yes, please send the link to wachao@vmware.com. If not, could you try to transfer the file to me via slack? Please reach out in K8s workspace (Benjamin Wang).
Thanks for the info. You mount the storage node to the loader, and the loader accesses the storage node as a NFS? |
Thanks @fyfyrchik for sharing the db file offline. The db file is corrupted. Unfortunately the existing check command can't provide any useful info. I will enhance the existing check command:
|
Hi @fyfyrchik , For the |
Hi @fuweid, |
@fyfyrchik thanks! I will try to reproduce this by dm-flakey |
Updated. @fyfyrchik since my local env is using v6.4 kernel, I build v5.10 kernel and run it with qemu.
I can reproduce this issue with main branch.
I am not sure it's related to fast-commit because I didn't find relative patch for this. cc @ahrtr |
Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful... @fyfyrchik I think I would like to suggest to upgrade kernel or disable fc option 😂
|
I can't reproduce this issue in new kernel with fc option or v5.10 without fc option. And talked to patch's author and I believe that this is kernel bug instead of bbolt. @fyfyrchik Thanks for that reporting! |
Great finding. Just as we discussed earlier today, we received around 18 data corruption cases in the past 2 years. Most likely fast commit wasn't enabled at all in most cases.
|
Also please let's add a known issue in the readme to clarify
|
Will do and think about how to submit it as code in repo. Assign this issue to me. /assign |
@fuweid thanks for your research! |
Thanks @fyfyrchik ! Yeah, fast commit is new feature and when we enable this, we need to run failpoint test cases on that. It's not just bbolt, but also other applications which cares data persistence. |
For anyone reference, from application (bbolt in this case)'s perspective, the root cause of the issue is that the system call |
Hello! I am not sure whether this is a bug in bolt, but I think you might find this interesting:
fast_commit
feature enabled (we use 5.10 kernel)bbolt check
reports:The last 64 pages of the file seem to be filled with zeroes.
When the ext4
fast_commit
is disabled, our tests pass, and db can be opened.I have reproduced this on both 1.3.6 and 1.3.7.
Here is the output for
bbolt pages
https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730Meta pages
The text was updated successfully, but these errors were encountered: