-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeatably hang user processes, using memory-mapped I/O and fsync. #8751
Comments
The system is still up, seems perfectly stable apart from the hung processes mentioned above. Even the stuck dataset is available for reading, creating, deleting files.
I note there are a few open "everything hangs" issues, with difficulty in replicating the issue. E.g. #8556, #8469, #7038, #7484, #8321, #7924. I have also had this machine hang under high I/O, and so far never found a cause. High I/O seems to be related. Notably, #7484 mentions fsync as a trigger. Hoping to have found a smoking gun, I repeated the benchmarks, and got the same result. First, created empty dataset ssd/benchmark2 and changed directory to there. Ran the first fio benchmark 10x times in a row, no issues. Changed to mmap, and it died at about the 50% mark, on the first attempt. Numbers all dropped to zero, and the fio output just says e.g.:
This is an example of what
This is the output of the final successful test:
|
Can confirm the hung IO has been around for a long, long time. Had it happen on Thursday, the only "fix" is to reboot which is maddening. What I still fail to understand is why things like the deadman timer do not kick in even after the entire volume hangs. |
I'm curious to know, can anyone else reproduce the issue, by running the |
From my experiences doing a zfs send from a server with 10 ssd's in raidz2
to a backup server that has 4 sata HDD's in raidz over a 10G network with
jumbo frames will crash the backup server unless you nice the sshd on the
backup server and ionice the zfs recv. I would imagine using mbuffer
instead of ssh would "lighten the load" also. The backup server has no
chance if it is running scrub at the same time as a recv. This is with
ubuntu 18.x through 19.04. I cannot get 19.04 to even complete a recv with
the 5.x kernel, had to drop to the 4.x.oem kernel.
…On Mon, May 20, 2019 at 11:56 AM darrenfreeman ***@***.***> wrote:
I'm curious to know, can anyone else reproduce the issue, by running the
fio benchmarks that I gave above?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8751>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABTQJT6BRUI3VAXQUHMRLADPWLC4JANCNFSM4HNK6CNQ>
.
--
Regards
--------------------------------------
Gerald Drouillard
Cell: 734.502.4356
|
@darrenfreeman testing on Ubuntu 18.04, with the final 0.8.0 release, I wasn't able to reproduce the issue using your reproducer. Would it be possible for you to update to zfs-0.8.0 and rerun the test in your environment to confirm it is still an issue. Based on your stack traces it looks like a deadlock so I may just not have been able to hit it in my environment. |
Had to do a hard reset after lots of error messages about being unable to unmount the pool. Rebooted into zfs-0.8.0 (compiled from the git tagged release). Also a slightly newer kernel, 4.9.0-9-amd64. Can still reproduce the issue: 10x successful benchmarks, change to mmap, died about halfway in. Including some fresh kernel stack traces just in case there's anything new:
|
I am able to 100% reliably reproduce this problem in a matter of seconds using master from today and a pool consisting of a single 20GiB file on a tmpfs file system. I've not looked into it too deeply yet but so far, I've eliminated the recently-added direct IO support as a cause by removing the "direct=1" fio flag and also by removing the direct IO support from the ZFS module because the kernel's writeback mechanism will use it as a standard course of its operation if it exists. |
I had a similar issue with my zfs pool across multiple versions of the software on ubuntu 18.04 LFS and 19.04. I was able to mount my volume to capture the data out of it using the following command
Your Mileage May Vary but wanted to share what got mine to work with this issue. |
Updating this before I forget. The case I run into more often are ext4 zvols with either corrupt journals or otherwise unavailable (device changes for example). The mount will hang. Stack trace[<0>] io_schedule+0x16/0x40 [<0>] wait_on_page_bit_common+0x109/0x1c0 [<0>] __filemap_fdatawait_range+0x10a/0x180 [<0>] file_write_and_wait_range+0x6d/0xa0 [<0>] ext4_sync_file+0x114/0x370 [ext4] [<0>] vfs_fsync_range+0x3f/0x80 [<0>] vfs_fsync+0x1c/0x20 [<0>] vn_fsync+0x17/0x20 [spl] [<0>] vdev_file_io_start+0xae/0x120 [zfs] [<0>] zio_vdev_io_start+0xc7/0x350 [zfs] [<0>] zio_nowait+0xbf/0x150 [zfs] [<0>] zio_flush+0x33/0x40 [zfs] [<0>] zil_commit_writer+0x67b/0x760 [zfs] [<0>] zil_commit.part.13+0x93/0x100 [zfs] [<0>] zil_commit+0x17/0x20 [zfs] [<0>] zvol_write+0x570/0x620 [zfs] [<0>] zvol_request+0x22d/0x350 [zfs] [<0>] generic_make_request+0x19a/0x3d0 [<0>] submit_bio+0x75/0x140 [<0>] submit_bh_wbc+0x16f/0x1a0 [<0>] __sync_dirty_buffer+0x70/0xd0 [<0>] ext4_commit_super+0x1df/0x2c0 [ext4] [<0>] ext4_setup_super+0x158/0x1c0 [ext4] [<0>] ext4_fill_super+0x2116/0x3c50 [ext4] [<0>] mount_bdev+0x187/0x1c0 [<0>] ext4_mount+0x15/0x20 [ext4] [<0>] mount_fs+0x3e/0x150 [<0>] vfs_kern_mount+0x67/0x130 [<0>] do_mount+0x1f0/0xca0 [<0>] ksys_mount+0x83/0xd0 [<0>] __x64_sys_mount+0x25/0x30 [<0>] do_syscall_64+0x60/0x190 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<0>] 0xffffffffffffffff |
@darrenfreeman What is the last version of ZoL on which this test did not deadlock? After reviewing the stack traces I've been able to get and a short list of potential commits which may have caused this problem, I wasn't able to seem any obvious problem so I've done a bit of bisection and have been able to easily reproduce this problem as far back as 0f69f42, committed on Jul 27, 2017. The caveat being is that I'm still using a tmpfs-backed single-vdev pool which might actually be contributing to this problem. |
I don't know how far back it goes, I had been having my whole system hang on high SSD zpool I/O, until a fairly recent commit. But after updating to the git version in my original post, I found the situation much improved. So I added the Optane NVMe SLOG and pushed it further. |
After further investigation and research, I've discovered that reverting both 21a96fb and d958324 cause this test case to work. Here's the reproducer I've been using. I've also been using a similar script that uses a loopback device backed by an ext4 sparse file. Both typically deadlock within seconds of starting the random writes. With the 2 commits listed above reverted, the deadlock goes away.
Pinging @behlendorf and @tuxoko to get their ideas on this. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I don't believe this issue has been resolved, although I haven't repeated my tests against the latest version. |
I have to rule this out a hardware issue, unless there's something I am missing. I have 12 drives of various manufacturers and models. I swapped out the cables, backplane, and HBA. I also tried the motherboard's onboard SATA connections. The txg_sync() timeout remains, and each is preceded by a deadman event. The drive that triggers the deadman/txg_sync() hang is different each time, and it has occurred enough times that all 12 drives have triggered it at least once. Edit (11/19/2020): a few days after this, I was able to update ZFS on this system and it has not had a hang since (so over 2 months at this point). Previously, it would hang every few days, usually during a backup or scrub. I'm cautiously optimistic. Edit (04/02/2021): still no hangs since September. Hardware, load, and general server usage has not changed. |
I have the same issue. I can repeat the issue using fio with fsync. It consumes all the available memory and all the swap space. PS: I am running Ubuntu 20.04.2 LTS with zfs 0.8.3 |
Potential fixes: #12284 |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I am not able to test on my original hardware setup, and the Optane has since failed. But it's serious enough to not close until it's proven fixed. @dweeezil has a reproducer above. |
System information
Describe the problem you're observing
Complete failure to perform further FS I/O on the same dataset, after running a benchmarking test. Benchmarking processes are defunct, kill -9 has no effect on them.
Describe how to reproduce the problem
Create an unused dataset on a pool with 2x Samsung SSD 850 EVO 500GB configured as independent vdevs, plus a 32 GB Optane NVMe module as SLOG.
I am attempting to simulate a worst case workload from Postgresql 11. I believe it uses mmap for file I/O, but I may be mistaken. Page size is 8 KiB.
Run the following benchmark:
fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randwrite --size=512m --io_size=10g --blocksize=8k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=8 --runtime=60 --group_reporting
This will succeed with the SLOG getting up to a couple of GB used, nice high IOPS.
Now run it again with
--ioengine=mmap
. There is an initial burst of write I/O to the pool, then the numbers all drop to zero, and it seems very much stalled. Occasionally there is a burst of I/O, then nothing. (I think this is happening to another dataset on the same pool.)The fio job can't be killed with ctrl-C, the processes don't die with
kill -9
, but they can be made to stop outputting to the shell withkill -9
. ps aux reveals a status of "Ds".Now the previously successful fio test will also suffer the same fate if run again. However it can be run on a different pool which also uses the Optane as SLOG, and it succeeds. I'm not game to try mmap again.
The postgresql DB on the SSD pool seems to be still running, it can generate reasonable looking I/O numbers. It is in a different dataset from the one used for benchmarking. I think maybe the benchmarking throw-away dataset could be hosed though.
Since the host appears to still be up and providing services, I will leave it like this for a day or two, in case you have some tests you'd like me to perform in this state.
Include any warning/errors/backtraces from the system logs
Here are the first few kernel stacktraces from journalctl, they are not identical even for the different workers of fio.
The text was updated successfully, but these errors were encountered: