Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing a file's inode can lead to a NULL dereference #10737

Closed
grwilson opened this issue Aug 18, 2020 · 3 comments · Fixed by #12299
Closed

Accessing a file's inode can lead to a NULL dereference #10737

grwilson opened this issue Aug 18, 2020 · 3 comments · Fixed by #12299
Assignees

Comments

@grwilson
Copy link
Member

grwilson commented Aug 18, 2020

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 5.3
Architecture amd64
ZFS Version master
SPL Version master

Describe the problem you're observing

Periodic system crashes (Oops) when running our internal replication test. From our initial analysis we see that accessing a file's inode can lead to a NULL dereference.

Describe how to reproduce the problem

Right now the only way I've been able to reproduce this is by running an internal test suite in a loop.

Include any warning/errors/backtraces from the system logs

[37546.151522] BUG: kernel NULL pointer dereference, address: 0000000000000000
[37546.154688] #PF: supervisor read access in kernel mode
[37546.157069] #PF: error_code(0x0000) - not-present page
[37546.159429] PGD 155e9e067 P4D 155e9e067 PUD 1a70d0067 PMD 0
[37546.162017] Oops: 0000 [#1] SMP NOPTI
[37546.163799] CPU: 1 PID: 4643 Comm: java Kdump: loaded Tainted: P           OE
     5.3.0-1030-aws #32~18.04.1-Ubuntu
[37546.171408] Hardware name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
[37546.175745] RIP: 0010:vfs_writev+0x70/0x120
[37546.179336] Code: 00 00 00 48 89 45 d8 31 c0 48 8d 85 50 ff ff ff b9 08 00 00
 00 48 89 85 20 ff ff ff e8 99 c2 23 00 48 85 c0 78 53 48 8b 53 20 <0f> b7 02 66
 25 00 f0 66 3d 00 80 74 65 48 8d b5 28 ff ff ff 44 89
[37546.191995] RSP: 0018:ffffab8c00677dd0 EFLAGS: 00010202
[37546.195876] RAX: 000000000000aff8 RBX: ffff9b14b2f2db00 RCX: 0000000000000001
[37546.200566] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffffab8c00677de8
[37546.205266] RBP: ffffab8c00677ec0 R08: 000000000000aff8 R09: 000000007ffff000
[37546.209978] R10: ffffab8c00677e28 R11: ffff9b14b2f2db38 R12: ffffab8c00677ee0
[37546.214685] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[37546.219328] FS:  00007f5b0bf8a700(0000) GS:ffff9b1579700000(0000) knlGS:00000
00000000000
[37546.225934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37546.230028] CR2: 0000000000000000 CR3: 0000000095030004 CR4: 00000000007606e0
[37546.234754] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37546.239405] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[37546.244081] PKRU: 55555554
[37546.246905] Call Trace:
[37546.249640]  ? __fget+0x32/0x80
[37546.252674]  do_writev+0xde/0x120
[37546.255738]  ? do_writev+0xde/0x120
[37546.258922]  __x64_sys_writev+0x1c/0x20
[37546.262254]  do_syscall_64+0x5a/0x130
[37546.265456]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[37546.269233] RIP: 0033:0x7f5b6c61a7e7
[37546.272383] Code: c3 66 90 41 54 55 41 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8
 bb a0 01 00 44 89 e2 41 89 c0 48 89 ee 89 df b8 14 00 00 00 0f 05 <48> 3d 00 f0
 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 f4 a0 01 00 48
[37546.284997] RSP: 002b:00007f5b0bf896b0 EFLAGS: 00000293 ORIG_RAX: 00000000000
00014
[37546.291234] RAX: ffffffffffffffda RBX: 0000000000000240 RCX: 00007f5b6c61a7e7
[37546.296009] RDX: 0000000000000001 RSI: 00007f5b3c0508e0 RDI: 0000000000000240
[37546.300684] RBP: 00007f5b3c0508e0 R08: 0000000000000000 R09: 00000000e7843818
[37546.305324] R10: 0000000000006dfa R11: 0000000000000293 R12: 0000000000000001
[37546.309979] R13: 0000000000000001 R14: 00007f5b0bf89740 R15: 00007f5b4c161800
[37546.314661] Modules linked in: iscsi_target_mod target_core_mod binfmt_misc i
sst_if_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_
x86_64 crypto_simd cryptd glue_helper intel_rapl_perf ppdev serio_raw parport_pc
 parport nfsd nf_tables auth_rpcgss nfnetlink nfs_acl lockd grace sch_fq_codel s
unrpc ip_tables x_tables autofs4 zfs(POE) zunicode(POE) zlua(POE) zavl(POE) icp(
POE) zcommon(POE) znvpair(POE) spl(OE) btrfs zstd_compress raid10 raid456 async_
raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid
1 raid0 multipath linear i2c_piix4 ena i2c_core
[37546.349568] CR2: 0000000000000000

Doing a bisect, I've narrowed it down to this commit:

commit da92d5cbb38cea3a860b8a6bb8ee21f9129e7d7c (HEAD -> bisect)
Author: Matthew Macy <mmacy@freebsd.org>
Date:   Thu Nov 21 09:32:57 2019 -0800

    Add zfs_file_* interface, remove vnodes

    Provide a common zfs_file_* interface which can be implemented on all
    platforms to perform normal file access from either the kernel module
    or the libzpool library.

    This allows all non-portable vnode_t usage in the common code to be
    replaced by the new portable zfs_file_t.  The associated vnode and
    kobj compatibility functions, types, and macros have been removed
    from the SPL.  Moving forward, vnodes should only be used in platform
    specific code when provided by the native operating system.

    Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Igor Kozhukhov <igor@dilos.org>
    Reviewed-by: Jorgen Lundman <lundman@lundman.net>
    Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
    Closes #9556

From the crash dumps I see that the file's inode can sometimes be NULL or is NULL at the time the instruction is executed and later populated as part of the crash.

Here's an example of both where they both crashed at the same instruction:

0xffffffff824ced12 <vfs_writev+98>:     callq  0xffffffff8270afb0 <import_iovec>
0xffffffff824ced17 <vfs_writev+103>:    test   %rax,%rax
0xffffffff824ced1a <vfs_writev+106>:    js     0xffffffff824ced6f <vfs_writev+191>
0xffffffff824ced1c <vfs_writev+108>:    mov    0x20(%rbx),%rdx
0xffffffff824ced20 <vfs_writev+112>:    movzwl (%rdx),%eax <==== PANIC HERE
0xffffffff824ced23 <vfs_writev+115>:    and    $0xf000,%ax

This is equivalent to this code in the kernel:

static inline void file_start_write(struct file *file)
 {
     if (!S_ISREG(file_inode(file)->i_mode)) <=== PANIC here when we dereference `f_inode`
         return;
     __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
 }

 static inline struct inode *file_inode(const struct file *f)
 {
     return f->f_inode;
 }

So we know that %rdx is our struct file * and the f_inode member is at offset 0x20. Looking at 2 crashes we see both cases where f_inode is NULL or populated later:

  • NULL when executed and NULL in the crash dump:
sdb> echo 0xffff900db1fe3800 | cast struct file *  | member f_inode
(struct inode *)0x0
  • NULL when executed but accessible in the crash dump:
sdb> echo 0xffff9b14b2f2db00 | cast struct file * | member f_inode
(struct inode *)0xffff9b1554ac8048
@Sea-you
Copy link

Sea-you commented Aug 20, 2020

@grwilson Do you think this issue can be related to this? https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1885265

We're experiencing regular kernel panics seemingly from NFS, but we suspect that this has something to do with ZFS (0.8.3 as of now) never saw the same with 0.7.x versions.

As a first step I posted this issue on nfs mailing list, and this reply backs our hunch up https://marc.info/?l=linux-nfs&m=159775983016484&w=2

@grwilson
Copy link
Member Author

grwilson commented Aug 21, 2020

@Sea-you this looks pretty different. With the issue I'm seeing we're running with ZFS root. Are you also using ZFS root or is ZFS only being used as the backend for NFSv4 filesystems? Are you able to get a crash dump?

@grwilson grwilson self-assigned this Jun 29, 2021
@grwilson
Copy link
Member Author

We have seen some new panics and also soft lockups reported that all appear to be related to this issue. Here are some of the stacks:

Crashes:

#10 [ffffc33656557e10] common_file_perm at ffffffffaaaa43de
#11 [ffffc33656557e38] apparmor_file_permission at ffffffffaaaa45ca
#12 [ffffc33656557e48] security_file_permission at ffffffffaaa54873
#13 [ffffc33656557e80] rw_verify_area at ffffffffaa8da963
#14 [ffffc33656557ea0] vfs_read at ffffffffaa8dcd58
#15 [ffffc33656557ed8] ksys_read at ffffffffaa8dcf17
#16 [ffffc33656557f20] __x64_sys_read at ffffffffaa8dcf6a
#17 [ffffc33656557f30] do_syscall_64 at ffffffffaa604417
[ 7915.914608]  ? __fget+0x40/0x80
[ 7915.917928]  ? __fget_light+0x59/0x70
[ 7915.921383]  do_writev+0x6d/0x110
[ 7915.924629]  __x64_sys_writev+0x1c/0x20
[ 7915.928164]  do_syscall_64+0x57/0x190
[ 7915.931590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Soft lockup:

[ 7915.914608]  ? __fget+0x40/0x80
[ 7915.917928]  ? __fget_light+0x59/0x70
[ 7915.921383]  do_writev+0x6d/0x110
[ 7915.924629]  __x64_sys_writev+0x1c/0x20
[ 7915.928164]  do_syscall_64+0x57/0x190
[ 7915.931590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[93680.708058]  fget+0x18/0x20
[93680.708151]  zfs_file_get+0x9/0x10 [zfs]
[93680.708215]  zfs_ioc_send_new+0x10c/0x1c0 [zfs]
[93680.708220]  ? nvlist_alloc+0x27/0x30 [znvpair]
[93680.708280]  zfsdev_ioctl_common+0x1f6/0x640 [zfs]
[93680.708341]  zfsdev_ioctl+0x54/0xe0 [zfs]
[93680.708343]  do_vfs_ioctl+0xa9/0x640
[93680.708346]  ? __audit_syscall_entry+0xdd/0x130
[93680.708347]  ksys_ioctl+0x67/0x90
[93680.708348]  __x64_sys_ioctl+0x1a/0x20
[93680.708350]  do_syscall_64+0x57/0x190
[93680.708353]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

From some debugging the problem was tracked down to commit mentioned above and can result in corrupted file reference counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants