Deadlock with Postgres on 2.0.1 #11463

dkacar-oradian · 2021-01-14T14:16:17Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.9
Linux Kernel	3.10.0-1160.11.1
Architecture	x86_64
ZFS Version	2.0.1
SPL Version	2.0.1

Describe the problem you're observing

I have an easily reproducible deadlock on CentOS 7.9 with OpenZFS 2.0.1 (kmod kernel package from the official RPM repo) which doesn't happen on 2.0.0.

I have this in top:

top - 14:45:30 up 10 min,  2 users,  load average: 1.97, 1.68, 0.89
Tasks: 258 total,   4 running, 254 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.7 us,  1.0 sy,  0.0 ni, 97.7 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.3 us,  1.0 sy,  0.0 ni, 97.3 id,  1.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 41185060 total, 36549828 free,  2478316 used,  2156916 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 37563100 avail Mem

At first there are only 2 CPUs spinning at 100%, but after some time the third joins in. And so it stays and I have to power off the VM (running in VmWare). I suppose those are deadlocked spinlocks in the kernel or something like that.

Debug log shows:

# cat /proc/spl/kstat/zfs/dbgmsg
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408988 snapshot pg_data/prod-db-c09_5433@rmf_6432 (id 645)  
1610631466   spa_history.c:329:spa_history_log_sync(): ioctl snapshot
1610631466   spa_history.c:296:spa_history_log_sync(): command: zfs snapshot pg_data/prod-db-c09_5433@rmf_6432
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408989 clone pg_data/rmf_6432 (id 651) origin=pg_data/prod-db-c09_5433@rmf_6432 (645)
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) redundant_metadata=1
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) compression=16
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) atime=0
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) xattr=2
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) mountpoint=/data/pg_data/snapshots/rmf_6432
1610631468   spa_history.c:329:spa_history_log_sync(): ioctl clone
1610631468   spa_history.c:296:spa_history_log_sync(): command: zfs clone -oredundant_metadata=most -ocompression=zstd -oatime=off -oxattr=sa -o mountpoint=/data/pg_data/snapshots/rmf_6432 pg_data/prod-db-c09_5433@rmf_6432 pg_data/rmf_6432

And nothing new appears.

Describe how to reproduce the problem

What I did was: create 2 Postgres slaves with ZFS 2.0.0, each on its own ZFS file system. Then upgraded to 2.0.1 and created snapshot and clone on one of them and tried to start primary Postgres instance on the clone. That will apply a certain number of WAL files and then it gets stuck like this. Reproducible every time, so far.

Include any warning/errors/backtraces from the system logs

However, this data isn't particularly useful for debugging the issue. When I had similar problems with BTRFS there were stack traces for stuck processes obtainable with dmsg, due to:

# cat /proc/sys/kernel/hung_task_timeout_secs
120

But there's absolutely nothing after the boot messages with ZFS.

So how do I find out relevant data to debug this?

The text was updated successfully, but these errors were encountered:

sempervictus · 2021-01-17T21:13:18Z

Does this happen if you unmount the source of the clone and then re-mount it before spinning up the WAL writer threads on the clone itself?

Do-while-Bacon · 2021-01-17T21:28:04Z

This deadlock behavior is very similar to what I was seeing in #11476, also on CentOS 7.9, and also resolved by downgrading to 2.0.0.

dkacar-oradian · 2021-01-18T08:53:29Z

@sempervictus WAL directory is moved to ext4 before Postgres is started. So WAL writers are not in the picture, I think.
I'll try unmounting and remounting the origin before starting Postgres.

dkacar-oradian · 2021-01-18T14:36:21Z

I did git bisect on this and I got:

58bc86c5cb47bb9d1886a0d0d3f47df90dc3955a is the first bad commit
commit 58bc86c5cb47bb9d1886a0d0d3f47df90dc3955a
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri Dec 18 08:48:26 2020 -0800

    Linux 5.10 compat: use iov_iter in uio structure

Unfortunately, that one is quite large.

Testing methodology:
good - if starting Postgres on a clone passed three times without deadlock and if those clones were deleted without deadlock
bad - one of the three attempts would deadlock. But in practice it was always the first one.

Interesting thing is that with the full 2.0.1 release I first see 2 CPUs spinning at 100% and after some time the third joins them. But with this commit it's only one CPU at 100%. The others never join.

sempervictus · 2021-01-18T15:15:11Z

Skimming over the commit the uiomove_iter setup looks a bit suspect in the "full pipe" case documented in the function.
Looks like we also gained some cycles in zpl_xattr_get_dir.
The whole thing's weird - it kind of looks like an attempt to unify access paths for the future when everything is great and everyone's on the same happy/stable UIO codepaths, not taking into account that most humans are using older stable kernels likely containing distro backports for the "very much in-flux" Linux IO subsystem. I dont know if this makes sense in 2.0.1 at all (the tip-kernel fanboy club is probably less important to support than release package consumers), or why the legacy implementation known to work was not kept ifdefd for everyone before 5.10.

arm32let · 2021-01-18T16:19:30Z

courtesy of echo l > /proc/sysrq-trigger

[ 1793.329066] CPU: 2 PID: 1315 Comm: postgres Tainted: P           OE  ------------   3.10.0-1160.11.1.el7.x86_64 #1
[ 1793.329066] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[ 1793.329067] task: ffff8f8b3a208000 ti: ffff8f8b3a540000 task.ti: ffff8f8b3a540000
[ 1793.329067] RIP: 0010:[<ffffffffa2d97b80>]  [<ffffffffa2d97b80>] copy_user_handle_tail+0x70/0x90
[ 1793.329068] RSP: 0018:ffff8f8b3a543a68  EFLAGS: 00040206
[ 1793.329068] RAX: 0000000000003000 RBX: 0000000000003000 RCX: 0000000000000000
[ 1793.329069] RDX: 0000000000000318 RSI: ffff9bdc014b9ce9 RDI: ffff9bdc014b9ce8
[ 1793.329069] RBP: ffff8f8b3a543a68 R08: 0000000000000000 R09: 0000000000000000
[ 1793.329070] R10: 0000000000011000 R11: ffffdf5100eaed80 R12: ffff8f8b3a543d80
[ 1793.329070] R13: 0000000000001000 R14: 0000000000003000 R15: ffff8f8b3a543df8
[ 1793.329071] FS:  00007f8fc3b7b880(0000) GS:ffff8f8b3fd00000(0000) knlGS:0000000000000000
[ 1793.329071] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1793.329072] CR2: 00007f8fb5a5b000 CR3: 000000003a53c000 CR4: 0000000000360ee0
[ 1793.329072] Call Trace:
[ 1793.329073]  [<ffffffffc094b2d6>] uiomove_iov+0x1d6/0x2b0 [zfs]
[ 1793.329073]  [<ffffffffc094b554>] uiomove+0x14/0x30 [zfs]
[ 1793.329074]  [<ffffffffc07f969b>] dmu_write_uio_dnode+0x9b/0x200 [zfs]
[ 1793.329074]  [<ffffffffc07f9854>] dmu_write_uio_dbuf+0x54/0x70 [zfs]
[ 1793.329075]  [<ffffffffc09188f5>] zfs_write+0xe65/0x1180 [zfs]
[ 1793.329075]  [<ffffffffa2c71d94>] ? mntput+0x24/0x40
[ 1793.329076]  [<ffffffffa2c5e08d>] ? do_last+0x66d/0x1340
[ 1793.329076]  [<ffffffffc0967c3d>] zpl_aio_write+0x13d/0x1d0 [zfs]
[ 1793.329076]  [<ffffffffa2c4d243>] do_sync_write+0x93/0xe0
[ 1793.329077]  [<ffffffffa2c4dd30>] vfs_write+0xc0/0x1f0
[ 1793.329077]  [<ffffffffa2c4eb0f>] SyS_write+0x7f/0xf0
[ 1793.329078]  [<ffffffffa3193f92>] system_call_fastpath+0x25/0x2a

arm32let · 2021-01-18T16:27:44Z

postgresql-server-9.2 running on top of ZFS.

wal_level = archive
fsync = on 
synchronous_commit = on
wal_sync_method = fsync

in /var/lib/pgsql/data/postgresql.conf if that even matters.

reproduced running sudo -u postgres psql -c 'create database test;'

dkacar-oradian · 2021-01-19T11:12:54Z

How do I build kernel modules to include enough debug information to decode hex offsets to line numbers? I've tried adding --enable-debuginfo, but that doesn't seem to do the trick. The decode_stacktrace.sh script wants to find .debug_line ELF section.

behlendorf · 2021-01-19T21:34:13Z

@dkacar-oradian thanks for narrowing this down, I'll see if I can reproduce it on CentOS 7 with the provided reproducer. While the patch is fairly large and some code was refactored to deal with the 5.10 kernel change what the code does should be identical for CentOS 7. Though clearly something a little different, we just need to identify it.

The whole thing's weird - it kind of looks like an attempt to unify access paths for the future when everything is great and everyone's on the same happy/stable UIO code paths not taking into account that most humans are using older stable kernels

@sempervictus yes, there's more churn than I would have liked but it was necessary to support both 3.10 and 5.10 kernels. I'm surprised we didn't encounter this issue in the master branch before backporting the change. We manually ran a ton of regression tests as part of making this change. I considered the ifdef approach, but backporting the change unmodified from master with all the testing felt like the lowest risk approach. The alternative would be doing something custom for the 2.0 branch which would have seen less testing and complicated any needed future compatibility backports.

behlendorf · 2021-01-20T01:43:37Z

I've opened PR #11484 with a proposed fix for this regression. Any additional testing would be welcome. Thus far I've only verified it resolves the provided Postgres test case on CentOS 7.9 with the 3.10.0-1160.11.1 kernel. @Do-while-Bacon I'd appreciate any help in verifying it also resolves the docker container issue as well (it should). The patch applies cleanly to the 2.0.1 tag.

dkacar-oradian · 2021-01-20T12:13:50Z

I've applied that patch on top of 2.0.1 and things work fine with starting Postgres on a clone (three in a row) and deleting them. Reboot also works fine.

As part of commit 1c2358c the custom uio_prefaultpages() code was removed in favor of using the generic kernel provided iov_iter_fault_in_readable() interface. Unfortunately, it turns out that up until the Linux 4.7 kernel the function would only ever fault in the first iovec of the iov_iter. The result being uiomove_iov() may hang waiting for the page. This commit effectively restores the custom uio_prefaultpages() pages code for Linux 4.9 and earlier kernels which contain the troublesome version of iov_iter_fault_in_readable(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#11463 Closes openzfs#11484

behlendorf · 2021-01-21T18:52:48Z

The fix has been merged to the master branch and I've opened #11493 to make sure it gets applied for OpenZFS 2.0.2.

As part of commit 1c2358c the custom uio_prefaultpages() code was removed in favor of using the generic kernel provided iov_iter_fault_in_readable() interface. Unfortunately, it turns out that up until the Linux 4.7 kernel the function would only ever fault in the first iovec of the iov_iter. The result being uiomove_iov() may hang waiting for the page. This commit effectively restores the custom uio_prefaultpages() pages code for Linux 4.9 and earlier kernels which contain the troublesome version of iov_iter_fault_in_readable(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11463 Closes #11484

As part of commit 1c2358c the custom uio_prefaultpages() code was removed in favor of using the generic kernel provided iov_iter_fault_in_readable() interface. Unfortunately, it turns out that up until the Linux 4.7 kernel the function would only ever fault in the first iovec of the iov_iter. The result being uiomove_iov() may hang waiting for the page. This commit effectively restores the custom uio_prefaultpages() pages code for Linux 4.9 and earlier kernels which contain the troublesome version of iov_iter_fault_in_readable(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#11463 Closes openzfs#11484

dkacar-oradian added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Jan 14, 2021

behlendorf mentioned this issue Jan 20, 2021

Linux 5.10 compat: restore custom uio_prefaultpages() #11484

Merged

13 tasks

behlendorf closed this as completed in 83b91ae Jan 21, 2021

behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Jan 21, 2021

rincebrain mentioned this issue Mar 16, 2021

stuck in futex #11749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock with Postgres on 2.0.1 #11463

Deadlock with Postgres on 2.0.1 #11463

dkacar-oradian commented Jan 14, 2021

sempervictus commented Jan 17, 2021

Do-while-Bacon commented Jan 17, 2021 •

edited

Loading

dkacar-oradian commented Jan 18, 2021

dkacar-oradian commented Jan 18, 2021

sempervictus commented Jan 18, 2021

arm32let commented Jan 18, 2021

arm32let commented Jan 18, 2021

dkacar-oradian commented Jan 19, 2021

behlendorf commented Jan 19, 2021

behlendorf commented Jan 20, 2021

dkacar-oradian commented Jan 20, 2021 •

edited

Loading

behlendorf commented Jan 21, 2021

Deadlock with Postgres on 2.0.1 #11463

Deadlock with Postgres on 2.0.1 #11463

Comments

dkacar-oradian commented Jan 14, 2021

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

sempervictus commented Jan 17, 2021

Do-while-Bacon commented Jan 17, 2021 • edited Loading

dkacar-oradian commented Jan 18, 2021

dkacar-oradian commented Jan 18, 2021

sempervictus commented Jan 18, 2021

arm32let commented Jan 18, 2021

arm32let commented Jan 18, 2021

dkacar-oradian commented Jan 19, 2021

behlendorf commented Jan 19, 2021

behlendorf commented Jan 20, 2021

dkacar-oradian commented Jan 20, 2021 • edited Loading

behlendorf commented Jan 21, 2021

Do-while-Bacon commented Jan 17, 2021 •

edited

Loading

dkacar-oradian commented Jan 20, 2021 •

edited

Loading