Fix: segmentation fault of write throttle while import #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The issue
Async writes may be issued before the pool finish the process of initialization, which means we may access a NULL point of spa->spa_dsl_pool in function vdev_queue_max_async_writes illegality. typically this happens when the self-healing zio was issued by mirror while importing.
The analysis
Pool always imported with flag of FWRITE | FREAD, when we open the meta objset with ZIO error in one of our multi-DVAS(Both silent and “noisy” data corruption), the logical vdev which have mirror ops will try to use the good data it have in hand to repair damaged children. When the self-healing write zio issues to the leaf vdev, it encount the write throttle strategy which demand the statistics of dirty data in spa->spa_dsl_pool->dp_dirty_total. Unfortunately, the pointer of spa->spa_dsl_pool will keep NULL until we open the meta objset success, and then, segmentation fault comes up.
stack below will be help to understand what I say(maybe different from the newest version):
Thread 126 (Thread 4542):
#0 0x00007f5008fa016c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1 0x00007f500e069bae in cv_wait () from /version/lib/lib_naslfs_dll.so
#2 0x00007f500e19ab53 in zio_wait () from /version/lib/lib_naslfs_dll.so
#3 0x00007f500e1218b0 in arc_read() from /version/lib/lib_naslfs_dll.so
#4 0x00007f500e12fda9 in dmu_objset_open_impl () from /version/lib/lib_naslfs_dll.so
#5 0x00007f500e150762 in dsl_pool_init () from /version/lib/lib_naslfs_dll.so
openzfs#6 0x00007f500e16570f in spa_load_impl.clone.10 () from /version/lib/lib_naslfs_dll.so
openzfs#7 0x00007f500e1664bf in spa_load () from /version/lib/lib_naslfs_dll.so
openzfs#8 0x00007f500e166c77 in spa_load_best () from /version/lib/lib_naslfs_dll.so
openzfs#9 0x00007f500e167724 in spa_import () from /version/lib/lib_naslfs_dll.so
Thread 115 (Thread 4531):
#0 0x00007f5008fa3cd7 in do_sigwait () from /lib/libpthread.so.0
#1 0x00007f5008fa3d57 in sigwait () from /lib/libpthread.so.0
#2 0x00007f5009532d2c in Vos_SuspendTask () from /version/lib/lib_com_dll.so
#3 signal handler called
#4 0x00007f500e182a0e in vdev_queue_io_to_issue () from /version/lib/lib_naslfs_dll.so
#5 0x00007f500e1833a6 in vdev_queue_io () from /version/lib/lib_naslfs_dll.so
openzfs#6 0x00007f500e19c315 in zio_vdev_io_start () from /version/lib/lib_naslfs_dll.so
openzfs#7 0x00007f500e197cea in zio_execute () from /version/lib/lib_naslfs_dll.so
openzfs#8 0x00007f500e181914 in vdev_mirror_io_done () from /version/lib/lib_naslfs_dll.so
openzfs#9 0x00007f500e198475 in zio_vdev_io_done () from /version/lib/lib_naslfs_dll.so
openzfs#10 0x00007f500e197cea in zio_execute () from /version/lib/lib_naslfs_dll.so
openzfs#11 0x00007f500e19ca93 in zio_done () from /version/lib/lib_naslfs_dll.so
openzfs#12 0x00007f500e197cea in zio_execute () from /version/lib/lib_naslfs_dll.so
openzfs#13 0x00007f500e070f79 in taskq_thread () from /version/lib/lib_naslfs_dll.so
The patch
Do not ask for the dirty data statistics anymore in vdev_queue_max_async_writes before the pool pointer get initialized. Instead we push data out as fast as possible to speed up the self-healing process triggered by import.