Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid panic in case of pool errors and missing L2ARC #12392

Merged
merged 1 commit into from
Sep 16, 2021

Conversation

gamanakis
Copy link
Contributor

@gamanakis gamanakis commented Jul 19, 2021

Motivation and Context

Local testing hitting the panic:

VERIFY(HDR_EMPTY_OR_LOCKED(hdr)) failed
PANIC at arc.c:1628:arc_hdr_clear_flags()
Showing stack for process 1084883
CPU: 8 PID: 1084883 Comm: z_rd_int_0 Tainted: P           OE     5.12.15-arch1-1 #1
Call Trace:
 dump_stack+0x76/0x94
 spl_panic+0xef/0x117 [spl]
 ? abd_fletcher_4_impl+0x50/0x50 [zfs]
 ? _raw_spin_unlock+0xa/0x20
 ? _raw_spin_unlock+0xa/0x20
 arc_hdr_clear_flags+0x7b/0x80 [zfs]
 arc_hdr_destroy+0x393/0x3a0 [zfs]
 zio_done+0x416/0x18b0 [zfs]
 zio_execute+0xea/0x220 [zfs]
 taskq_thread+0x239/0x4a0 [spl]
 ? wake_up_q+0xa0/0xa0
 ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
 ? taskq_thread_spawn+0x50/0x50 [spl]
 kthread+0x133/0x150
 ? kthread_associate_blkcg+0xc0/0xc0
 ret_from_fork+0x22/0x30

Description

In case an ARC buffer is allocated only on L2ARC, and there are
underlying errors in a pool with the cache device in faulty state, a
panic can occur in arc_read_done()->arc_hdr_destroy()->
arc_hdr_l2arc_destroy()->arc_hdr_clear_flags() when trying to free
the ARC buffer.

Fix this by checking in arc_read_done() if the ARC buffer to be freed
is stored on L2ARC and not empty, and acquiring its hash_lock in this
case.

How Has This Been Tested?

Without this patch the following code panics:

echo 0 | tee /sys/module/zfs/parameters/l2arc_noprefetch
zpool create test /dev/sda cache /dev/sdb

# Fill L2ARC
fio --name=test --size=100M --nrfiles=1 --readwrite=read --directory=/test --time_based --runtime=1000

# Simulate missing device, this leaves the L2ARC buffers intact in ARC
zpool offline test /dev/sdb

# Simulate error on pool
zinject -t data -e checksum -f 100 -am /test/test.0.0 

# Panic
fio --name=test --size=100M --nrfiles=1 --readwrite=read --directory=/test

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@gamanakis gamanakis marked this pull request as ready for review July 19, 2021 18:45
@gamanakis
Copy link
Contributor Author

removal/removal_remap_deadlists fails in all TEST VMs. I do not think it is related.

@jumbi77
Copy link
Contributor

jumbi77 commented Jul 19, 2021

Does it make sense to write a ZTS test case since you already provide a reproducer?

@behlendorf
Copy link
Contributor

@gamanakis so it looks to me like we could perhaps handle this issue instead by discarding the buffers identify a little bit sooner in arc_hdr_destroy(). This would correctly satisfy the ASSERT without the need to take the hash lock. What do you think about this alternate (entirely untested) fix? It is a little icky that we again try and discard the buffer identify a few lines down in arc_hdr_destroy() but I think we could live with that.

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index c2557f33c..8c2c1e416 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -3781,8 +3781,13 @@ arc_hdr_destroy(arc_buf_hdr_t *hdr)
                 * to acquire the l2ad_mtx. If that happens, we don't
                 * want to re-destroy the header's L2 portion.
                 */
-               if (HDR_HAS_L2HDR(hdr))
+               if (HDR_HAS_L2HDR(hdr)) {
+
+                       if (!HDR_EMPTY(hdr))
+                               buf_discard_identity(hdr);
+
                        arc_hdr_l2hdr_destroy(hdr);
+               }
 
                if (!buflist_held)
                        mutex_exit(&dev->l2ad_mtx);

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 30, 2021
@gamanakis
Copy link
Contributor Author

@behlendorf thank you for taking a look! Interesting approach, I will rebase to master and re-test with your proposal.

In case an ARC buffer is allocated only on L2ARC, and there are
underlying errors in a pool with the cache device in faulty state, a
panic can occur in arc_read_done()->arc_hdr_destroy()->
arc_hdr_l2arc_destroy()->arc_hdr_clear_flags() when trying to free
the ARC buffer.

Fix this by discarding the buffer's identity in arc_hdr_destroy(), in
case the buffer is not empty, before calling arc_hdr_l2hdr_destroy().

Signed-off-by: George Amanakis <gamanakis@gmail.com>
@gamanakis
Copy link
Contributor Author

67c2f8f: Rebased to master, follow approach suggested by @behlendorf.

@behlendorf behlendorf requested a review from amotin September 16, 2021 00:25
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It definitely makes no sense to take the hash_lock as was proposed originally, since we know the header is not there.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 16, 2021
@behlendorf behlendorf merged commit 2a49ebb into openzfs:master Sep 16, 2021
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request Sep 22, 2021
In case an ARC buffer is allocated only on L2ARC, and there are
underlying errors in a pool with the cache device in faulty state, a
panic can occur in arc_read_done()->arc_hdr_destroy()->
arc_hdr_l2arc_destroy()->arc_hdr_clear_flags() when trying to free
the ARC buffer.

Fix this by discarding the buffer's identity in arc_hdr_destroy(), in
case the buffer is not empty, before calling arc_hdr_l2hdr_destroy().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#12392
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants