Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERIFY3(0 == dmu_object_claim_dnsize()) failed (0 == 28) #7151

Closed
lnicola opened this issue Feb 9, 2018 · 43 comments · Fixed by #9145
Closed

VERIFY3(0 == dmu_object_claim_dnsize()) failed (0 == 28) #7151

lnicola opened this issue Feb 9, 2018 · 43 comments · Fixed by #9145
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@lnicola
Copy link
Contributor

lnicola commented Feb 9, 2018

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version
Linux Kernel 4.15.1-2-ARCH
Architecture x86-64
ZFS Version 0.7.0-283_g6d82b7969
SPL Version 0.7.0-24_g23602fd

Describe the problem you're observing

I have an unmountable filesystem that yields:

VERIFY3(0 == dmu_object_claim_dnsize(zfsvfs->z_os, obj, DMU_OT_PLAIN_FILE_CONTENTS, 0, obj_type, bonuslen, dnondesize, tx)) failed (0 == 28)
PANIC at zfs_znode.c:757:zfs_mknode()

I had large_dnode enabled for a while, so this is probably related to the changes in #6864, which fixed #6366 for me.

Referencing #7059 (comment) and #7147.

28 is ENOSPC, I think?

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

@lnicola lnicola changed the title VERIFY3(0 == dmu_object_claim_dnsize()) VERIFY3(0 == dmu_object_claim_dnsize()) failed (0 == 28) Feb 9, 2018
@lnicola
Copy link
Contributor Author

lnicola commented Feb 9, 2018

One more thing, how should I proceed in order to recover my system?

  • does this affect the whole pool, or just the filesystem in cause?
  • I used large dnodes on the other filesystems, should I also be worried about them?
  • will zfs send (or zfs send -Le) to a new dataset propagate the underlying issues?

@behlendorf
Copy link
Contributor

One more thing, how should I proceed in order to recover my system?

This looks like it could potentially be a large dnode corner case involving ZIL replay. The VERIFY implies it's attempting to write a large dnode in to a slot which is too small for it. If you're OK with loosing the last 5 or so secords of data written to the pool you can skip the log replay step when mounting the filesystem. This should allow you to mount it, at which point I'd suggest scrubbing the pool. You'll need to.

  1. Set zil_replay_disable=1 module option.
  2. zfs mount <dataset>
  3. Clear zil_replay_disable=0 module option.

does this affect the whole pool, or just the filesystem in cause?

This is filesystem specific, so it may only impact the one filesystem. It's not pool wide.

I used large dnodes on the other filesystems, should I also be worried about them?

As long as you're running 0.7.2 or newer on the release branch you shouldn't have problems. On the master branch there's obviously more churn and thus risk, but all the issues we're aware of have been addressed. If you absolutely want to play it safe set the dnodesize back to legacy.

will zfs send (or zfs send -Le) to a new dataset propagate the underlying issues?

If you've enabled large dnodes then they will be preserved when sending to a new pool.

@tcaputi
Copy link
Contributor

tcaputi commented Feb 10, 2018

If you still have the pool available and can consistently get it to crash like this, could I ask you to run the following command? zdb -e -bccsiv <pool guid> You can get the pool guid by running zpool import without specifying a dataset to simply list all importable pools. The guid is listed there as "id".

@lnicola
Copy link
Contributor Author

lnicola commented Feb 10, 2018

The file system now mounts properly after a reboot. The only change was that a two-week snapshot got destroyed in the meanwhile, though I'm not sure if that's what fixed it.

EDIT: Hmm, it's almost as if it got rolled back to the previous snapshot.

@lnicola
Copy link
Contributor Author

lnicola commented Feb 28, 2018

As mentioned in #7147 (comment), I no longer have a way to reproduce this issue. I'll close it, but feel free to reopen if you want to investigate the error further.

@lnicola lnicola closed this as completed Feb 28, 2018
@yshui
Copy link
Contributor

yshui commented Mar 3, 2018

The bug is still there. I'm seeing the same failure on master

Actually I'm using master+the TRIM branch. Can this be TRIM related?

@lnicola lnicola reopened this Mar 3, 2018
@lnicola
Copy link
Contributor Author

lnicola commented Mar 3, 2018

I'm not using TRIM. Did you enable encryption or large dnodes?

@yshui
Copy link
Contributor

yshui commented Mar 3, 2018

I'm not using encryption, but large_dnode is active

@yshui
Copy link
Contributor

yshui commented Mar 3, 2018

I think I've seen #7147 happen too

@tcaputi
Copy link
Contributor

tcaputi commented Mar 3, 2018

@yshui Do you have any steps to reproduce the bug(s)? And what commit of the code are you running?

@yshui
Copy link
Contributor

yshui commented Mar 3, 2018

@tcaputi I'm on e086e71 (and with TRIM branch merged on top of that).

And I have no idea have to reproduce this bug. So before today, I would have occasional kernel general protection faults, same as the one describe in #7147. I would just force reboot, and everything will be fine.

But today, after I reboot from the same kernel fault, I got the VERIFY failure as described in this issue.

@tcaputi
Copy link
Contributor

tcaputi commented Mar 3, 2018

Hmmmm. Can you tell me a little about your workload? What kinds of files do you have in your zpool? Are you doing a lot of sends / receives?

@yshui
Copy link
Contributor

yshui commented Mar 4, 2018

@tcaputi It's just a normal desktop workload. No send/receive, no snapshot.

@tcaputi
Copy link
Contributor

tcaputi commented Mar 4, 2018

Are you using zfs as your root filesystem?

@yshui
Copy link
Contributor

yshui commented Mar 4, 2018

Yes

@tcaputi
Copy link
Contributor

tcaputi commented Mar 4, 2018

Ok. I will look into reproducing this on Monday.

@tcaputi
Copy link
Contributor

tcaputi commented Apr 17, 2018

Does anyone have any more information on this bug? I have been unable to reproduce it....

@lnicola
Copy link
Contributor Author

lnicola commented Apr 17, 2018

I suppose it's still an issue, but it didn't happen again for me. Not that I miss it, unmountable pools aren't fun.

@yshui
Copy link
Contributor

yshui commented Apr 18, 2018

I haven't seen it again yet since I set dnodesize to legacy

@lnicola
Copy link
Contributor Author

lnicola commented Apr 18, 2018

I might have also done that, can't check right now.

@lnicola
Copy link
Contributor Author

lnicola commented Apr 23, 2018

No, I'm using auto.

@fwaggle
Copy link

fwaggle commented Jul 17, 2019

We've had two instances of the same issue (and I think #8910 is a duplicate as well?). Both of our cases had ZFS 0.8.x, which we originally suspected as the culprit but that doesn't seem to be the cause given other folks' experiences (ie the version of this issue). We're in the same scenario, we periodically have to hard-reset a given node due to unrelated problems, and in two cases fairly recently we had to disable the ZIL replay to get the pool back online after the reboot. Same issue, a panic because dmu_object_claim_dnsize returns ENOSPC.

One node was 0.8.1 and the other was 0.8.0, on Ubuntu Bionic from jonathonf's PPA. Both are kernel 4.15, but different revisions, and we are using legacy dnodesize so none of the prior advice seems to apply. We have fairly heavy snapshot use, and deduplication is enabled.

We managed to get a raw snapshot of the entire underlying disk before the fix was applied to the production machine, so I have a machine with the broken pool where I can test things with impunity, and through some nasty hacking I've determined that it's failing this particular check: https://github.com/zfsonlinux/zfs/blob/3b03ff22761da0f5fad9a781025facfc6e555522/module/zfs/dnode.c#L1468

That's unfortunately as far as I've gotten.

I can reproduce the panic at will by simply mounting the broken dataset, but I don't know how to reproduce the situation that led to the broken dataset. I've tried checking things like zdb -i and nothing looks out of place, there's no errors or anything. The duplicated machine is no longer production, so I can keep it around and broken for a couple of weeks if anyone has anything I can try that would help narrow down the cause (including patches to ZFS, as I've built the module from source on this machine for testing). I'm also available on freenode with the same username if that helps speed things up.

@tcaputi
Copy link
Contributor

tcaputi commented Jul 17, 2019

I am on vacation this week. Do you mind if I ask you more about the issue when I return next Wednesday?

@fwaggle
Copy link

fwaggle commented Jul 17, 2019

@tcaputi Not at all!

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 17, 2019
@tuxoko
Copy link
Contributor

tuxoko commented Aug 8, 2019

From the log of @fwaggle
This is the last entry in the trace.

txtype=1 len=120 txg=664335 seq=3089679

The relevant entries in the ZIL are

                TX_CREATE           len    120, txg 664333, seq 3089560
                        tmpf8m9e4m
                        Sun Jul 14 00:39:01 2019
                        doid 26, foid 187988, slots 1, mode 100600
                        uid 1000000, gid 1000000, gen 664333, rdev 0x0

                TX_REMOVE           len     56, txg 664333, seq 3089561
                        doid 26, name tmpf8m9e4m

                TX_CREATE           len    120, txg 664335, seq 3089679
                        registry.new
                        Sun Jul 14 00:45:50 2019
                        doid 40562, foid 187988, slots 1, mode 100600
                        uid 1000000, gid 1000000, gen 664335, rdev 0x0

So this has the same pattern as in my repro.
There's a remove followed by a create on the same object id. So if during replay they happen to be in the same txg, it will trigger the error.

Edit: I'm not sure why the timestamp is reversed between trace and panic, maybe the trace log is handle asynchronously?

@fwaggle
Copy link

fwaggle commented Aug 8, 2019

Gotcha - is #9061 supposed to fix that before it happens, or make the ZIL playback able to cope with it after the fact? Because I tried applying #9061 manually and it still got a panic trying to mount the affected dataset.

@tuxoko
Copy link
Contributor

tuxoko commented Aug 8, 2019

#9061 is irrelevant to this issue. The repro I used was to test another issue but happened to trigger this issue under certain condition.

@fwaggle
Copy link

fwaggle commented Aug 8, 2019

Ahh, understood! Thanks!

@tcaputi
Copy link
Contributor

tcaputi commented Aug 8, 2019

@tuxoko I was going to make a post when I had done more confirmation, but this diagnosis matches what I was finding.

@tuxoko
Copy link
Contributor

tuxoko commented Aug 8, 2019

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_replay.c#L476
So zfs_replay_create does check if the object id is empty or not. However, this check was broken after 78e2139

@behlendorf Is that change really needed? Because in contrast to the commit message, originally we don't return ENOENT until the txg is synced.

@tuxoko
Copy link
Contributor

tuxoko commented Aug 8, 2019

Actually, it seems it seems more complicated than that.
This block also changed in a way that would cause this issue as well.
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/dnode.c#L1521

tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 9, 2019
Make sure dnode_hold_impl return EEXIST if the object hasn't been
completely freed. Otherwise, if there's TX_REMOVE followed by TX_CREATE
on the same object id, zil replay won't wait for the remove txg to
complete and would panic when doing create.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 9, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed. Instead, we check if DNODE_MUST_BE_FREE returns
successful.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 9, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 10, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 21, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 21, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 21, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 21, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 23, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Aug 23, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
behlendorf pushed a commit that referenced this issue Aug 28, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
ghost pushed a commit to zfsonfreebsd/ZoF that referenced this issue Aug 28, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Closes openzfs#9145
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 17, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Closes openzfs#9145
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Closes openzfs#9145
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Closes openzfs#9145
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 18, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Closes openzfs#9145
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 19, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Closes openzfs#9145
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 23, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7151
Closes openzfs#8910
Closes openzfs#9123
Closes openzfs#9145
tonyhutter pushed a commit that referenced this issue Sep 26, 2019
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
@Hypocritus
Copy link

Hypocritus commented Jun 7, 2020

I encountered this specific issue on Ubuntu 17.10, which has ZFS version 0.8.1. My pool had 22TB of data, and would cause a ZFS panic upon import (original cause, hardware power failure), Because I'm search-skill-deficient, I didn't find the fix referenced here for a few hours.

But as mentioned, the official fix for this issue is found in ZFS releases 0.8.2 and 0.8.3. I updated to 0.8.3. You can also use the quick fix posted above by behlendorf if you don't need the last few transactions for the pool you are trying to import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants