-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC: blkptr at ffff881f18601440 has invalid CHECKSUM 0 #6414
Comments
We've used the (truly terrifying) zfs_revert script (from https://gist.github.com/jshoward/5685757). Removing 1 transaction was not sufficient -- still got the panic -- but after removing five transactions, we can mount again! The filesystem seems to be functional at that point -- we're able to create 100,000 files and write a GB of data -- and we've brought Lustre back up. We still have the original dd copies of the raw corrupted disks, so we can still try to learn what's happened here and how to "properly" recover. |
That is a scary script! I'm glad you were able to use it to rollback to a point where you could import the pool and mount the filesystems. This is exactly the sort of thing As for the issue itself we saw something very similar to this on a test system while testing out Lustre 2.10 prior to it being officially released. Based on the pool damage we observed, our best guess as to what must have happened is that somehow Lustre overwrote the some memory containing block pointers due, valid checksums were then generated for the garbage block pointers and that was written to disk. This is why the To my knowledge we haven't had any reports of this sort of thing happening when using ZFS without Lustre. Perhaps @adilger knows if any issues like this were recently we resolved in Lustre? |
When I went looking for this error in the code:
...what I found suggested that the checksum "type" is an invalid value, and that it therefore didn't even know which algorithm to use to validate the checksum? Or have I misinterpreted? |
You've got it exactly right. The other warnings you reported show variations on similar forms of damage (unknown compression types, impossible DVAs, etc). All of which suggest the block pointer somehow got stomped on. |
Then I don't know how to square that with your previous statement:
How can the block checksum be valid, if it doesn't know which algorithm to use to validate? |
Exactly! ZFS inherently trusts that if the checksum for a block is good, which it was, the contents of that block will be sane. Somehow blocks were written to the pool with valid checksums and nonsense values. That should be impossible, the only plausible way that can happen I'm aware of is due to memory corruption. The block pointers blocks gets damaged in memory after being created but before they're checksummed and written. |
I'm still confused about how it can know that the checksum is valid if it doesn't even know which checksum algorithm was used -- but I trust you. :) |
The checksum for the damaged block and it's type are stored in the parent. The block pointer it's complaining about is for one of its children. :) |
Ah ha! You've drawn the veil from my eyes. I'm now very slightly less ignorant. Thank you! |
Thanks for the insight. How do you want the pool? Its currently 3 x 10T raw disk images (dd of the raw device). I suspect only ~2.1T of each disk is really in use. |
I've just read this again and wondered why, when this corruption is identified, isn't the previous txg used? Continually dropping back until a sane txg is found? Surely almost any issue should either cause the FS not to mount and allow the user to take alternative action, or drop back (as requested by the user) to a sane block? |
My guess would be that noone anticipated that this could happen, and the code to handle it isn't present. |
@phik we think we've identified the root cause here and opened #6439 with a proposed fix for review. My suggestion for the moment is to set
Yes, this is exactly how the |
Innnteresting. I definitely don't understand the internals well enough to comment on the patch, but your description would certainly seem to fit the symptom. Thanks very much for the prompt followup! I'll discuss your suggestion with the team ASAP. |
FWIW we just got this panic again on a 0.7.0-rc3 system [375528.898326] PANIC: blkptr at ffff881785cfc440 has invalid CHECKSUM 0 the machine was still up and running... hopefully comes back painlessly after reboot... |
Rebooted just fine. |
@sdm900 the exact issue reported here was introduced in 0.7.0-rc5 and will be fixed in 0.7.1. |
That concerns me a little. |
When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6414 Closes #6439
This issue was resolved in the v0.7.1 release tagged on August 8th. |
I have a few lingering questions, if I may...
|
It reproduced the issue 100% of the time in about 5 seconds. After applying the fix I left it running in a loop for about 8 hours without issue. We were able to reproduce the issue on one of our Lustre test systems although it took longer to reproduce, after applying the fix a week ago we haven't seen it since.
I agree, that is very troubling. The PANIC only indicates that a damaged block pointer was encountered, it doesn't reveal how it ended up damaged. The issue fixed here was caused by two dnodes stomping on each other and is understood, unfortunately the root cause for @smd900's PANIC must be different.
Yes. We've talked about doing this development work for quite a while but I don't think we've ever opened an issue for it. It's a fairly large development effort.
Yes, it would be great to open an issue for that too. We're going to need to investigate where exactly |
I'd love to scrub the file system (and I'll set it going)... but I don't think it will ever finish. It is on a production file system that is heavily used and we will no doubt loose power before it finishes :( |
As far as I can tell, zfs scrub is useless...
and this file system has gone quiet over the weekend. I expect the rate to drop to almost 0 when people are in the office. |
@sdm900 for filesystems with lots of small files or heavy fragmentation I agree it can be unreasonably slow. There are patches in #6256 under development to improve the situation. But it sounds like you may want to just cancel, or pause, the scrub for now if it's not going to complete in a reasonable amount of time. |
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit dbeb879 (PR #6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5396 Closes #6522 Closes #6414 Closes #6564
When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6414 Closes openzfs#6439
@behlendorf Sorry to ask here, but... Is there currently any tool to fix the I'm not sure, but the error could have been caused by something related to the kernel bug reported at #7723 (https://bugzilla.redhat.com/show_bug.cgi?id=1598462) Sadly, the host panicked a few seconds after loading zfs driver, could never ever try to run |
When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6414 Closes openzfs#6439
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit dbeb879 (PR openzfs#6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes openzfs#5396 Closes openzfs#6522 Closes openzfs#6414 Closes openzfs#6564
Describe the problem you're observing
This particular ZFS filesystem is the backing store for a Lustre MDS. After a Lustre LBUG:
...and subsequent reboot, any attempt to import the pool now gives this panic:
We've tried every suggested workaround we could find in other issues and mailing list threads, to no avail. Including:
swapping the disks into a different chassis (because hope springs eternal in the human breast, I suppose) -- panic
importing the pool with just a single disk -- panic
import -FXn
and justimport -FX
-- panicthe same but with
zfs_recover=1
-- paniclisting the uberblocks with
zdb -ul
, getting the txg for the last uberblock (by date), and trying azpool import -T txg -o readonly
. This at least did something, which was to read the disks at 2 MB/s. Unfortunately at that rate we were looking at 23 days to read the whole thing, so we gave up after 4 hours.At one point -- I'm not sure if it was
import -FX
orzfs_recover=1
-- something changed the nature of the failure from an instant panic at the first "invalid CHECKSUM" error into a series of warnings followed by an assertion:We get the same set of warnings from
zdb -AAA -F -e
:I guess there are two paths forward from here:
A fairly benign kernel panic in a largely-unrelated component somehow leads to an unusable file system? This is terrifying!
Having reached this stage, is there anything at all that can be done to recover from it? It seems like the data must all be there -- after all, the system went from fully operational to unusable in an instant -- we just can't get to it.
I'd be eternally grateful for any advice!
The text was updated successfully, but these errors were encountered: