-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC at zap.c:441:zap_create_leaf #16157
Comments
Perhaps related to #12366 |
@alex-stetsenko I wonder if this issue can be related to #15888. Before zap_trunc() roll back zap_freeblk, is there anything to free zap_leaf_t associated with deleted leafs before dbuf eviction code get to them? zap_shrink() does dmu_free_range(), but I wonder whether it frees associated user data buffers. May be we should do it explicitly. |
It looks suspicious. I don't have answers at the moment - needs an investigation. Should we back out the zap trimming code until it fixed? |
I don't think it should cause any pool corruptions, and it is only in master branch. So I'd vote towards just focusing on fixing it. If it can be fixed within lets say a week, I would not mess the commit history with reverts. |
A fresh one:
I can instrument the code with some |
We have also hit this bug and are disabling zap shrinking to see if it works around the problem. |
Don't know this code super well, but I've been reading it and @amotin seems right.
which is mostly just Intuitively, it doesn't seem like its going to be usable again after this point, so we probably could just clear the dbu and free the diff --git module/zfs/zap.c module/zfs/zap.c
index 1b6b16fc6..33f529468 100644
--- module/zfs/zap.c
+++ module/zfs/zap.c
@@ -1660,7 +1660,9 @@ zap_shrink(zap_name_t *zn, zap_leaf_t *l, dmu_tx_t *tx)
(void) dmu_free_range(zap->zap_objset, zap->zap_object,
sl_blkid << bs, 1 << bs, tx);
+ dmu_buf_remove_user(sl->l_dbuf, &sl->l_dbu);
zap_put_leaf(sl);
+ zap_leaf_evict_sync(sl);
zap_f_phys(zap)->zap_num_leafs--;
though I'd probably stick a refcount verify in the dbuf in there, just to be sure. But maybe it's safer just to reuse the diff --git module/zfs/zap.c module/zfs/zap.c
index 1b6b16fc6..48333b545 100644
--- module/zfs/zap.c
+++ module/zfs/zap.c
@@ -425,20 +425,30 @@ zap_leaf_evict_sync(void *dbu)
static zap_leaf_t *
zap_create_leaf(zap_t *zap, dmu_tx_t *tx)
{
- zap_leaf_t *l = kmem_zalloc(sizeof (zap_leaf_t), KM_SLEEP);
-
ASSERT(RW_WRITE_HELD(&zap->zap_rwlock));
- rw_init(&l->l_rwlock, NULL, RW_NOLOCKDEP, NULL);
- rw_enter(&l->l_rwlock, RW_WRITER);
- l->l_blkid = zap_allocate_blocks(zap, 1);
- l->l_dbuf = NULL;
+ uint64_t blkid = zap_allocate_blocks(zap, 1);
+ dmu_buf_t *db = NULL;
VERIFY0(dmu_buf_hold_by_dnode(zap->zap_dnode,
- l->l_blkid << FZAP_BLOCK_SHIFT(zap), NULL, &l->l_dbuf,
+ blkid << FZAP_BLOCK_SHIFT(zap), NULL, &db,
DMU_READ_NO_PREFETCH));
- dmu_buf_init_user(&l->l_dbu, zap_leaf_evict_sync, NULL, &l->l_dbuf);
- VERIFY3P(NULL, ==, dmu_buf_set_user(l->l_dbuf, &l->l_dbu));
+
+ zap_leaf_t *l = dmu_buf_get_user(db);
+ if (l == NULL) {
+ l = kmem_zalloc(sizeof (zap_leaf_t), KM_SLEEP);
+ l->l_blkid = blkid;
+ l->l_dbuf = db;
+ rw_init(&l->l_rwlock, NULL, RW_NOLOCKDEP, NULL);
+ dmu_buf_init_user(&l->l_dbu, zap_leaf_evict_sync, NULL,
+ &l->l_dbuf);
+ dmu_buf_set_user(l->l_dbuf, &l->l_dbu);
+ } else {
+ ASSERT3U(l->l_blkid, ==, blkid);
+ ASSERT3P(l->l_dbuf, ==, db);
+ }
+
+ rw_enter(&l->l_rwlock, RW_WRITER);
dmu_buf_will_dirty(l->l_dbuf, tx);
zap_leaf_init(l, zap->zap_normflags != 0); I don't have a repro or a good intuition about how to reproduce this directly, so I haven't tested the above diffs, and I might be way off. Feels not totally wrong though! |
@robn those seem like good ideas. I was also wondering how important |
@ahrens Having excessive amounts of holes makes prefetching more difficult. Also it is not that we just don't shrink the dnode once, but that we grow it indefinitely with each grow/shrink cycle, since the code it unable to fill the holes, only append. |
At this moment, I would suggest to add a module parameter to disable zap_trunc(). So that zap_shrink() could work without zap_trunc() until the problem fixed. |
So we have two crashes with one container, I'm trying to reach out to our member whether I can work with them to devise a clean reproducer. In the meantime we've deployed a revert of the ZAP shrinking patch. |
@alex-stetsenko @robn provided even two possible solutions above, both of which look reasonable to me, while the first seems to be cleaner, not leaving user buffer for block we consider free. Do you see a problem with integrating one of them instead of introducing unneeded kill switches? I'd expect PR to be created about 4 days ago. Do you want someone of us to create the PR? |
The first patch doesn't looks correct. |
@robn can you please post the second idea as a PR so it runs through the test suite? We could merge it into our staging right now, but that wouldn't tell us much as the staging environment exerts only a pretty low pressure on the code. I'd say if it passes the test suite, let's merge it and then wait and see whether more reports pop up or not :) (also if anyone could do a more comprehensive re-review of the codepaths that would be awesome, I'll be sure to send a few beers to everyone involved :D) |
@snajpa Will do today. @alex-stetsenko Can you elaborate on "doesn't look correct" please? (apart from the missing |
I don't think you can unconditionally destroy a leaf. As far as I can see, zap_cursor (fzap_cursor_retrieve()) can still reference it. |
@alex-stetsenko mm, you might be right. That was partly why I said "maybe verify the refcount" there, mostly because I didn't want to think very hard for a drive-by. In any case, the second seems like a safer thing to do in terms of the timeline - we know at that point nothing is using it, or we wouldn't be trying to create it, and we also know that the userdata being NULL or not can't race, because of the dance on So here's #16204; please take a look and lets see what falls out. @snajpa sorry about the delay! |
If a shrink or truncate had recently freed a portion of the ZAP, the dbuf could still be sitting on the dbuf cache waiting for eviction. If it is then allocated for a new leaf before it can be evicted, the zap_leaf_t is still attached as userdata, tripping the VERIFY. Instead, just check for the userdata, and if we find it, reuse it. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16157. Closes #16204
If a shrink or truncate had recently freed a portion of the ZAP, the dbuf could still be sitting on the dbuf cache waiting for eviction. If it is then allocated for a new leaf before it can be evicted, the zap_leaf_t is still attached as userdata, tripping the VERIFY. Instead, just check for the userdata, and if we find it, reuse it. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#16157. Closes openzfs#16204
System information
Describe the problem you're observing
One of our nodes hit the following VERIFY3 assertion:
Describe how to reproduce the problem
Don't know at the moment. Nothing obvious stands out (nowhere near dataset quota or full pool, no device failures, no memory shortage).
The text was updated successfully, but these errors were encountered: