-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERIFY3(0 == dmu_object_claim_dnsize()) failed (0 == 28) #7151
Comments
One more thing, how should I proceed in order to recover my system?
|
This looks like it could potentially be a large dnode corner case involving ZIL replay. The VERIFY implies it's attempting to write a large dnode in to a slot which is too small for it. If you're OK with loosing the last 5 or so secords of data written to the pool you can skip the log replay step when mounting the filesystem. This should allow you to mount it, at which point I'd suggest scrubbing the pool. You'll need to.
This is filesystem specific, so it may only impact the one filesystem. It's not pool wide.
As long as you're running 0.7.2 or newer on the release branch you shouldn't have problems. On the master branch there's obviously more churn and thus risk, but all the issues we're aware of have been addressed. If you absolutely want to play it safe set the dnodesize back to legacy.
If you've enabled large dnodes then they will be preserved when sending to a new pool. |
If you still have the pool available and can consistently get it to crash like this, could I ask you to run the following command? |
The file system now mounts properly after a reboot. The only change was that a two-week snapshot got destroyed in the meanwhile, though I'm not sure if that's what fixed it. EDIT: Hmm, it's almost as if it got rolled back to the previous snapshot. |
As mentioned in #7147 (comment), I no longer have a way to reproduce this issue. I'll close it, but feel free to reopen if you want to investigate the error further. |
The bug is still there. I'm seeing the same failure on master Actually I'm using master+the TRIM branch. Can this be TRIM related? |
I'm not using TRIM. Did you enable encryption or large dnodes? |
I'm not using encryption, but large_dnode is active |
I think I've seen #7147 happen too |
@yshui Do you have any steps to reproduce the bug(s)? And what commit of the code are you running? |
@tcaputi I'm on e086e71 (and with TRIM branch merged on top of that). And I have no idea have to reproduce this bug. So before today, I would have occasional kernel general protection faults, same as the one describe in #7147. I would just force reboot, and everything will be fine. But today, after I reboot from the same kernel fault, I got the VERIFY failure as described in this issue. |
Hmmmm. Can you tell me a little about your workload? What kinds of files do you have in your zpool? Are you doing a lot of sends / receives? |
@tcaputi It's just a normal desktop workload. No send/receive, no snapshot. |
Are you using zfs as your root filesystem? |
Yes |
Ok. I will look into reproducing this on Monday. |
Does anyone have any more information on this bug? I have been unable to reproduce it.... |
I suppose it's still an issue, but it didn't happen again for me. Not that I miss it, unmountable pools aren't fun. |
I haven't seen it again yet since I set |
I might have also done that, can't check right now. |
No, I'm using |
We've had two instances of the same issue (and I think #8910 is a duplicate as well?). Both of our cases had ZFS 0.8.x, which we originally suspected as the culprit but that doesn't seem to be the cause given other folks' experiences (ie the version of this issue). We're in the same scenario, we periodically have to hard-reset a given node due to unrelated problems, and in two cases fairly recently we had to disable the ZIL replay to get the pool back online after the reboot. Same issue, a panic because dmu_object_claim_dnsize returns ENOSPC. One node was 0.8.1 and the other was 0.8.0, on Ubuntu Bionic from jonathonf's PPA. Both are kernel 4.15, but different revisions, and we are using legacy dnodesize so none of the prior advice seems to apply. We have fairly heavy snapshot use, and deduplication is enabled. We managed to get a raw snapshot of the entire underlying disk before the fix was applied to the production machine, so I have a machine with the broken pool where I can test things with impunity, and through some nasty hacking I've determined that it's failing this particular check: https://github.com/zfsonlinux/zfs/blob/3b03ff22761da0f5fad9a781025facfc6e555522/module/zfs/dnode.c#L1468 That's unfortunately as far as I've gotten. I can reproduce the panic at will by simply mounting the broken dataset, but I don't know how to reproduce the situation that led to the broken dataset. I've tried checking things like |
I am on vacation this week. Do you mind if I ask you more about the issue when I return next Wednesday? |
@tcaputi Not at all! |
From the log of @fwaggle
The relevant entries in the ZIL are
So this has the same pattern as in my repro. Edit: I'm not sure why the timestamp is reversed between trace and panic, maybe the trace log is handle asynchronously? |
#9061 is irrelevant to this issue. The repro I used was to test another issue but happened to trigger this issue under certain condition. |
Ahh, understood! Thanks! |
@tuxoko I was going to make a post when I had done more confirmation, but this diagnosis matches what I was finding. |
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_replay.c#L476 @behlendorf Is that change really needed? Because in contrast to the commit message, originally we don't return ENOENT until the txg is synced. |
Actually, it seems it seems more complicated than that. |
Make sure dnode_hold_impl return EEXIST if the object hasn't been completely freed. Otherwise, if there's TX_REMOVE followed by TX_CREATE on the same object id, zil replay won't wait for the remove txg to complete and would panic when doing create. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. Instead, we check if DNODE_MUST_BE_FREE returns successful. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7151 Closes #8910 Closes #9123 Closes #9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Closes openzfs#9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Closes openzfs#9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Closes openzfs#9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Closes openzfs#9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Closes openzfs#9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Closes openzfs#9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#7151 Closes openzfs#8910 Closes openzfs#9123 Closes openzfs#9145
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7151 Closes #8910 Closes #9123 Closes #9145
I encountered this specific issue on Ubuntu 17.10, which has ZFS version 0.8.1. My pool had 22TB of data, and would cause a ZFS panic upon import (original cause, hardware power failure), Because I'm search-skill-deficient, I didn't find the fix referenced here for a few hours. But as mentioned, the official fix for this issue is found in ZFS releases 0.8.2 and 0.8.3. I updated to 0.8.3. You can also use the quick fix posted above by behlendorf if you don't need the last few transactions for the pool you are trying to import. |
System information
Describe the problem you're observing
I have an unmountable filesystem that yields:
I had
large_dnode
enabled for a while, so this is probably related to the changes in #6864, which fixed #6366 for me.Referencing #7059 (comment) and #7147.
28
isENOSPC
, I think?Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: