-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missing stratis pool after update to Fedora 39: thin_repair out of metadata space #3520
Comments
the failure to setup the pool is logged here:
These errors appear only in the most recent journalctl boot log (after upgrading to Fedora 39). Prior to this boot the stratis pool is set up:
Can the error for There is very little info on the web which I can find about this:
And none of it is in the context of repairing the volume under Stratis management. Any information is appreciated. |
@erickj Thanks for the information. Can you report the results of running |
@erickj What is happening is that the |
|
@erickj Again, thanks. Plz install the stratisd-tools package. This should make the tool stratis-dumpmetadata available. Run this tool on |
@mulkieran thanks for following up on this with me. Here is the output
|
Thanks again. Unfortunately, I think the first idea will not work because /dev/sdc is fully allocated. I ran a script to summarize the output, and it looks like the following:
Notice how "5860524032" reoccurs. It looks like everything is allocated from /dev/sdc to the cap device and then everything available in the cap device has then been redistributed to the thinpool. I would like to verify that, though. Please post the following: That will also show the signature buffer of the device which will contain the precise size as understood by Stratis. Also please post the output of If it turns out that there is no space to expand and that the thin meta device is set up properly (which is likely) then I propose that we try, first to simply restart and bring the pool up, and if that fails, to repair the thin meta device using thin tools. But it would be best if we confirm what the situation is before we move further. |
I've pasted the output below (edit: the below output was run for /dev/sdb, instead of sdc, copied from your comment above. At first I thought you asked for this for comparison, but upon rereading your comment I think sdb was a typo and I should have collected the data for sdc... I will rerun the command again when I get home from work later today)
|
You are correct, sorry about the typo.
is the table for the thin meta device and it certainly looks correctly set up, so that is likely not what is causing thin_check to report an error. |
Absolutely not a problem. I'm very happy for the support and to help debug. here is the updated output for sdc
|
@erickj would you also be able to provide dmsetup status? |
@drckeefe thank you for taking a look. the following is the output:
|
One thing that I realized is that the thin-pool device affected here won't be created, most likely because of running out of metadata space. But why did it run out? Can you search for kernel events (i.e.: Here is a quick example of the log event on a test system, where I created a pool on a small test device, and it resulted in a "low water mark" event:
|
thanks @bgurney-rh , I've copied the output of the last 4 boots (spanning back to the upgrade to F39), I see no logs similar to what you've called out
|
@erickj Thanks for the further information. I've confirmed that the pool is fully allocated, so there isn't any space to allocate more room for the thin meta spare. Since we have confirmed that the meta device is set up properly, and we know the values used in the device mapper table, you should try to see if this is just a transient problem. To do that, you can try to stop the pool and restart it again.
|
Thanks @mulkieran . I've done as you've suggested here. Running Starting the pool again fails with the same thin_repair error I've reported above: Stratis report shows the pool still in the partially constructed state. Is there any documentation on how to re-setup the stratis pool using All output copied below:
|
@erickj That it is stoppable is good. You could set up the pool with dmsetup commands, and that is generally what I want to try. But I don't want to be precipitate and ruin the thin-meta device by moving too fast. Please do the following one more time:
I know this seems redundant, but I want to make absolutely sure the device numbers are matching up properly in the output of dmsetup and of lsblk before any operations are performed. |
No worries about redundancies, still very appreciative for the help. It was unclear to me here if you wanted to see Order of output:
output:
|
@erickj Thanks. That's perfect. Exactly everything checks out up until the thin_repair failure. It will require many fewer steps for you if we provide you with a stratisd rpm on COPR which contains a patched version of stratisd which will expose the thin_check output when you use stratisd to start the stopped pool. Once we've gathered that information, it may be possible for us to give you another patched version of stratisd which would allow you to set up this particular pool. Please let me know if you need instructions for doing a COPR installation or any other concerns. We'll let you know when the COPR rpm is ready. |
@erickj The rpm is ready: https://dashboard.packit.dev/results/copr-builds/1240598 . |
thanks @mulkieran - before I install the patched build from copr I just have a question about restoring the system state to the mainline package. Will |
I went ahead and just installed the package, I assume reverting to the previous version won't be a problem. I've stopped/started the pool again and see the following warnings now in the journalctl logs:
if there are further steps to enable more detailed debug logging or any other details required please let me know. |
@erickj The COPR package is masquerading as a pre-release of |
@erickj And, if you succeed, could you run the command again with xml format specified.
and store that information in a safe place for further processing? |
I've added 2 files to the attached tarball:
|
@mulkieran thanks for this. For some background on the update path here which may or may not prove useful. I had been previously running Fedora 37, about 1 month over its EOL. In the same evening, on Jan 1, I updated to Fedora 38 then to Fedora 39. I quickly verified the update to F38, (this includes the journaltctl logs above which log the successful set up of the pool). Immediately after verifying the F38 upgrade, I upgraded straight away to F39 the same evening. dnf history of the 2 back to back system-upgrade events below show the
I've double checked the journaltctl logs again, and have verified again that the first instance of the log message indicating a failure with thin_check/thin_repair is on the first boot immediately after upgrading to to F39. The boot logs on F38 and prior all indicate the pool is set up successfully. |
@erickj Thanks that is quite helpful. |
@erickj We are continuing to work on this. My current plan is that we provide you with the ability to update your Stratis pool level metadata to carve out space for a thin meta spare. If we can demonstrate that that will work that will be easiest for you, as the pool ought to come up and, if thin_check ever turns up a problem again, stratisd will be able to run thin_repair on that pool with an appropriately sized target device. But we need to put in some more work so that we know we are able to fix up the thin pool metadata even if that does not work. |
@mulkieran thank you for the continued communication on this issue. It's very much appreciated. |
@erickj Can you run thin_metadata_pack on the device? That will allow the structure of the metadata to be better inspected to try to understand the cause of the thin_check error.
And thank you for your continued patience. |
thank you for the follow up @mulkieran thinmeta.pack.tar.gz additionally as an aside... I see something odd with the
|
Thanks for uploading the data. I filed a PR to address the missing long options upstream, thanks for mentioning that. |
@erickj It seems like my initial plan of adjusting the Stratis metadata may not work. There is a step that you can proceed with now that is recommended and ought to gain some ground toward the goal of getting the pool set up. The step is to run thin_repair taking the set up meta data device as your input and choosing an external device not on the pool as your output. The external device should be at least as large as the thin meta device. We expect this step to succeed. When that step is done, you should run thin_check on the external device. You should also expect thin_check to succeed without errors on the repaired metadata. Plz let me know how that goes. |
Hi @erickj, I've checked the thinmeta you provided. Basically there's no problem with the metadata, except some non-zero bytes in the unused region of the index block, which is unexpected so the new thin_check v1.0.6 treats it as an error. It's unusual although not harmful, so i would like to know how it happened:
|
It's still okay to adopt the original plan since only the first-half of thinmeta is being used currently, so thin_repair will work in this case (i've confirmed locally). Furthermore, as we have the thinmeta backup in both packed and xml form, we can even manually rebuild the thinmeta onto the adjusted volumes, without the need of running thin_repair through stratisd. |
@mingnus thanks for the further information! |
@mingnus thanks for the questions, I hope the answers here provide some benefit:
The pool was created ~5 years ago on the same workstation as it currently runs. At the time the installed OS was Fedora ~29 (or whatever the mainline Fedora server version was in January 2019). The pool has been upgraded from Fedora ~29 to Fedora 39 through each major Fedora release over the last 5 years without issue until the F39 upgrade.
The edit: the following dnf history shows the exact versions the initial pool creation was made with, If there are any other packages of interest please let me know.
|
@mulkieran thank you very much for the follow up, unfortunately I'm not quite sure the first step to take here with regards to this comments and the follow up from @mingnus above (re: adopting the original plan). Would you mind further clarifying the action to take? |
IGNORE THIS COMMENT You have two choices:
Pros:
Cons:
Basically, this is a somewhat fragile solution, because your pool ends up not quite right, and refinements of the solution are also a bit fragile.
Pros:
Cons:
Let me know what you think and feel free to ask me for any clarification or make a request. There are a bunch of potential refinements, but that's the basics. |
@erickj I'm sorry, somehow I didn't properly tag you on the previous comment. But I came up with a better plan in the interim, so you should ignore the previous comment. At present the pool is partially setup and the thin meta device is available. The error that caused the thin_check failure is innocuous. So, you should be able to proceed directly to fix the meta data on the thin meta device. The procedure should be straightforward:
I doubt that the
There is a possibility that the thin_restore operation will not work because the thin meta device appears to be in use. If that is the case then I will provide you with another version of stratisd that will just bring up the meta device. Let me know how the above steps go, first, though. If the above succeeds then your pool will be relatively stable for a while. But there is still the problem of the too small thin metadata spare which could cause you a problem in future. We would fix that by overwriting your Stratis pool-level metadata. However, working out that will take longer to write and test, and so you may prefer to do the steps above right now. [1] https://github.com/jthornber/thin-provisioning-tools |
Thank you for the detailed information!
That's out of my expectation since the kernel driver must had zero-initialized every blocks, but now only one pool is having issues.
The formal releases of I think it could be sort of in-memory data degradation, and since it's not harmful, maybe I'll allow those junk bytes in further releases. |
Agree that above is the simplest & easiest way. Considering the small metadata spare that could cause inconveniences in the future, personally I would prefer overwriting the Stratis pool-level metadata to enlarge the metadata spare volume. The affected pool is just 250069680 sectors (if i'm not mistaken) so reducing the size of thinmetadata to 50% (1.4 GB) is still sufficient. Anyway, we're opening for the both options. Applying the
Without missing mappings, the |
@erickj Could you let us know what your status is? |
@mulkieran apologies for the late reply, I was unavailable yesterday. Good news is that your suggestions above seem to have worked.
Remounting the filesystem has succeeded and the drive is accessible again. Thank you very very much for the help with this issue 🙏 Just a few remaining questions:
@mingnus re:
No, AFAIR no other tools have been used to manipulate the filesystem |
@erickj I'm pleased to hear that your pool is back up. Regarding question (2) I believe that it will be safe for you to reinstall the current version of stratisd. Regarding question (1), there are really three issues that affected you, in sequence. The first was that there were stray zeros in a particular region in the thin metadata on this pool only. The second was that the new version of I expect we will close this issue in about a week assuming your pool continues well. I've opened a new issue[1] for the remediation task. Thanks for your patience and clear communication around all of this. |
@erickj @mulkieran yes, I'll remove the constraint from thin_check. |
Previously, we assumed unused entries in the index block were all zero-initialized, leading to issues while loading the block with unexpected bytes and a valid checksum [1]. The updated approach loads index entries based on actual size information from superblock and therefore improves compatibility. [1] stratis-storage/stratisd#3520
So I seem to have hit a very similar issue on stratis 3.7.3, specifically because I apparently exhausted all my free space (again!) from, uhh, trying to upgrade to F41 with only 7GiB of space left in the pool. However, I couldn't run any repair tools manually because I also got the The fix was to boot into the initramfs and do a (This isn't entirely relevant to the original issue, but this is basically the only google search result for |
Hello, I've just updated to Fedora 39 and have noticed that 1 of 3 stratis pools has disappeared from stratis management. How can I recover this data?
Running stratis version:
The missing pool is the
net.ejjohnson.home
pool listed below in thestratis report
result underpartially_constructed_pools
looking at the block devices the the missing pool should be on
sdc
:The text was updated successfully, but these errors were encountered: