Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mergerfs Segfault #1004

Closed
Go0oSer opened this issue Jan 7, 2022 · 47 comments
Closed

Mergerfs Segfault #1004

Go0oSer opened this issue Jan 7, 2022 · 47 comments

Comments

@Go0oSer
Copy link

Go0oSer commented Jan 7, 2022

Describe the bug

  • MergerFS segfaults with error 4.
  • Jan 07 08:36:38 beast kernel: mergerfs[345]: segfault at 10 ip 000055dfac03ccb2 sp 00007fa1f77c53e0 error 4 in mergerfs[55dfabff3000+58000]
  • beast kernel: [149142.660381] mergerfs[345]: segfault at 10 ip 000055dfac03ccb2 sp 00007fa1f77c53e0 error 4 in mergerfs[55dfabff3000+58000] Jan 7 08:36:38 beast kernel: [149142.660396] Code: 80 7b 02 00 0f 85 70 fe ff ff 49 8d 6f 60 31 db 48 89 ef e8 20 7d fb ff 4c 89 ee 4c 89 ff e8 f5 d6 ff ff 48 89 ef 48 8b 40 20 <4c> 8b 68 10 e8 d5 77 fb ff e9 42 fe ff ff 48 89 df e8 c8 77 fb ff
  • umount -f /mnt/storage (mergerfs mountpoint) prompts a message that the device is busy.
  • mount -o remount /mnt/storage prompts the error
  • fuse: bad mount point /mnt/storage "Input/output error"

The master branch is not to be considered production ready. Feel free to file bug reports but do so indicating clearly that you are testing unreleased code.

To Reproduce

Im not exactly sure how to reproduce the problem. It seems fairly random. I cant force it to happen. Ill log into a client system and notice the SMB mount isn't working for example. Ill log onto the server and ls -l /mnt/storage (mergerfs volume) and the permissions and other information for the mergerfs mount are peppered with ? characters.

Expected behavior

Mount point operates normally. The mount point is becoming unavailable.

System information:

  • OS, kernel version: uname -a
  • Linux beast 5.11.0-44-generic #48~20.04.2-Ubuntu SMP Tue Dec 14 15:36:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • mergerfs version: mergerfs -V
  • mergerfs version: 2.33.3
  • mergerfs settings
  • /mnt/disk* /mnt/storage fuse.mergerfs defaults,nonempty,allow_other,noforget,use_ino,inodecalc=path-hash,cache.files=off,moveonenospc=true,dropcacheonclose=true,minfreespace=200G,fsname=mergerfs 0 0
  • List of drives, filesystems, & sizes:
    • df -h
      Filesystem Size Used Avail Use% Mounted on
      udev 7.8G 0 7.8G 0% /dev
      tmpfs 1.6G 4.7M 1.6G 1% /run
      /dev/sda2 49G 24G 23G 52% /
      tmpfs 7.9G 0 7.9G 0% /dev/shm
      tmpfs 5.0M 4.0K 5.0M 1% /run/lock
      tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
      mergerfs 29T 3.3T 25T 13% /mnt/storage
      /dev/loop0 66M 66M 0 100% /snap/gtk-common-themes/1515
      /dev/sda1 511M 5.3M 506M 2% /boot/efi
      /dev/sdi1 2.7T 168G 2.4T 7% /mnt/disk4
      /dev/sdb1 2.7T 174G 2.4T 7% /mnt/disk3
      /dev/sdd1 7.3T 1.5T 5.5T 22% /mnt/disk1
      /dev/sdc1 7.3T 1.5T 5.4T 22% /mnt/disk2
      /dev/sdh1 9.1T 21G 8.6T 1% /mnt/disk5
      /dev/sde1 13T 1.5T 11T 13% /mnt/parity1
      /dev/loop2 66M 66M 0 100% /snap/gtk-common-themes/1519
      /dev/loop3 9.2M 9.2M 0 100% /snap/canonical-livepatch/119
      /dev/loop1 128K 128K 0 100% /snap/bare/5
      /dev/loop5 9.2M 9.2M 0 100% /snap/canonical-livepatch/126
      /dev/loop4 56M 56M 0 100% /snap/core18/2253
      /dev/loop6 100M 100M 0 100% /snap/core/11798
      /dev/loop7 56M 56M 0 100% /snap/core18/2246
      /dev/loop8 219M 219M 0 100% /snap/gnome-3-34-1804/72
      /dev/loop9 248M 248M 0 100% /snap/gnome-3-38-2004/87
      /dev/loop10 43M 43M 0 100% /snap/snapd/14066
      /dev/loop11 44M 44M 0 100% /snap/snapd/14295
      /dev/loop12 55M 55M 0 100% /snap/snap-store/558
      /dev/loop13 62M 62M 0 100% /snap/core20/1270
      /dev/loop14 100M 100M 0 100% /snap/core/11993
      /dev/loop15 51M 51M 0 100% /snap/snap-store/547
      /dev/loop16 219M 219M 0 100% /snap/gnome-3-34-1804/77
      /dev/loop17 62M 62M 0 100% /snap/core20/1242
    • lsblk
      NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
      loop0 7:0 0 65.1M 1 loop /snap/gtk-common-themes/1515
      loop1 7:1 0 4K 1 loop /snap/bare/5
      loop2 7:2 0 65.2M 1 loop /snap/gtk-common-themes/1519
      loop3 7:3 0 9M 1 loop /snap/canonical-livepatch/119
      loop4 7:4 0 55.5M 1 loop /snap/core18/2253
      loop5 7:5 0 9M 1 loop /snap/canonical-livepatch/126
      loop6 7:6 0 99.5M 1 loop /snap/core/11798
      loop7 7:7 0 55.5M 1 loop /snap/core18/2246
      loop8 7:8 0 219M 1 loop /snap/gnome-3-34-1804/72
      loop9 7:9 0 247.9M 1 loop /snap/gnome-3-38-2004/87
      loop10 7:10 0 42.2M 1 loop /snap/snapd/14066
      loop11 7:11 0 43.3M 1 loop /snap/snapd/14295
      loop12 7:12 0 54.2M 1 loop /snap/snap-store/558
      loop13 7:13 0 61.9M 1 loop /snap/core20/1270
      loop14 7:14 0 99.4M 1 loop /snap/core/11993
      loop15 7:15 0 51M 1 loop /snap/snap-store/547
      loop16 7:16 0 219M 1 loop /snap/gnome-3-34-1804/77
      loop17 7:17 0 61.9M 1 loop /snap/core20/1242
      sda 8:0 0 50G 0 disk
      ├─sda1 8:1 0 512M 0 part /boot/efi
      └─sda2 8:2 0 49.5G 0 part /
      sdb 8:16 0 2.7T 0 disk
      └─sdb1 8:17 0 2.7T 0 part /mnt/disk3
      sdc 8:32 0 7.3T 0 disk
      └─sdc1 8:33 0 7.3T 0 part /mnt/disk2
      sdd 8:48 0 7.3T 0 disk
      └─sdd1 8:49 0 7.3T 0 part /mnt/disk1
      sde 8:64 0 12.8T 0 disk
      └─sde1 8:65 0 12.8T 0 part /mnt/parity1
      sdf 8:80 0 7.3T 0 disk
      ├─sdf1 8:81 0 7.3T 0 part
      └─sdf9 8:89 0 8M 0 part
      sdg 8:96 0 7.3T 0 disk
      ├─sdg1 8:97 0 7.3T 0 part
      └─sdg9 8:105 0 8M 0 part
      sdh 8:112 0 9.1T 0 disk
      └─sdh1 8:113 0 9.1T 0 part /mnt/disk5
      sdi 8:128 0 2.7T 0 disk
      └─sdi1 8:129 0 2.7T 0 part /mnt/disk4
      sr0 11:0 1 1024M 0 rom
  • A strace of the application having a problem:
    • strace -fvTtt -s 256 -o /tmp/app.strace.txt <cmd>
    • strace -fvTtt -s 256 -o /tmp/app.strace.txt -p <appPID>
  • strace of mergerfs while app tried to do it's thing:
    • strace -fvTtt -s 256 -p <mergerfsPID> -o /tmp/mergerfs.strace.txt

Additional context

Add any other context about the problem here.
I rebooted the machine after the issue arose a second time in ~2 weeks this morning. If it happens again, I will work to get the strace data for you. Apologies for not being able to file all of the requested data.

@Go0oSer
Copy link
Author

Go0oSer commented Jan 7, 2022

Pulles the following trace from journalctl. Not sure how useful it will be but I figured I would supply it.

Jan 07 09:00:42.836335 beast kernel: ------------[ cut here ]------------
Jan 07 09:00:42.836504 beast kernel: nfsd: non-standard errno: -107
Jan 07 09:00:42.836554 beast kernel: WARNING: CPU: 0 PID: 3147 at fs/nfsd/nfsproc.c:852 nfserrno+0x5a/0x70 [nfsd]
Jan 07 09:00:42.836782 beast kernel: Modules linked in: rfcomm xt_connmark xt_mark iptable_mangle xt_comment iptable_raw wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 li>
Jan 07 09:00:42.837036 beast kernel: usbserial btbcm ttm efi_pstore serio_raw btintel bluetooth drm_kms_helper snd_seq_device ecdh_generic joydev ecc snd_timer cec snd rc_core fb_sys_fops syscopyarea >
Jan 07 09:00:42.837094 beast kernel: CPU: 0 PID: 3147 Comm: nfsd Tainted: P O 5.11.0-44-generic #48~20.04.2-Ubuntu
Jan 07 09:00:42.837132 beast kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Jan 07 09:00:42.837167 beast kernel: RIP: 0010:nfserrno+0x5a/0x70 [nfsd]
Jan 07 09:00:42.837196 beast kernel: Code: 00 00 b8 00 00 00 05 74 01 c3 55 89 fe 48 c7 c7 8c 4d 59 c0 48 89 e5 48 83 ec 08 c6 05 02 f0 04 00 01 89 45 fc e8 0c f2 83 cf <0f> 0b 8b 45 fc c9 c3 66 66 2e >
Jan 07 09:00:42.837229 beast kernel: RSP: 0018:ffffc07b053b3de0 EFLAGS: 00010282
Jan 07 09:00:42.837260 beast kernel: RAX: 0000000000000000 RBX: ffff9f23c4530000 RCX: 0000000000000027
Jan 07 09:00:42.837290 beast kernel: RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff9f26efc18ac8
Jan 07 09:00:42.837312 beast kernel: RBP: ffffc07b053b3de8 R08: ffff9f26efc18ac0 R09: ffffc07b053b3ba0
Jan 07 09:00:42.837333 beast kernel: R10: 0000000000000001 R11: 0000000000000001 R12: ffff9f23c4520000
Jan 07 09:00:42.837353 beast kernel: R13: ffff9f23c4528000 R14: ffff9f23a1ac3000 R15: 0000000000000018
Jan 07 09:00:42.837384 beast kernel: FS: 0000000000000000(0000) GS:ffff9f26efc00000(0000) knlGS:0000000000000000
Jan 07 09:00:42.837417 beast kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 07 09:00:42.837448 beast kernel: CR2: 0000563d50c7db68 CR3: 00000001275dc000 CR4: 00000000000006f0
Jan 07 09:00:42.837471 beast kernel: Call Trace:
Jan 07 09:00:42.837503 beast kernel: nfsd3_proc_getattr+0xaf/0x110 [nfsd]
Jan 07 09:00:42.837548 beast kernel: nfsd_dispatch+0x117/0x200 [nfsd]
Jan 07 09:00:42.837592 beast kernel: svc_process_common+0x49f/0x730 [sunrpc]
Jan 07 09:00:42.837620 beast kernel: ? nfsd_svc+0x340/0x340 [nfsd]
Jan 07 09:00:42.837641 beast kernel: svc_process+0xbc/0xf0 [sunrpc]
Jan 07 09:00:42.837676 beast kernel: nfsd+0xed/0x150 [nfsd]
Jan 07 09:00:42.837712 beast kernel: ? nfsd_destroy+0x60/0x60 [nfsd]
Jan 07 09:00:42.837740 beast kernel: kthread+0x12b/0x150
Jan 07 09:00:42.837762 beast kernel: ? set_kthread_struct+0x40/0x40
Jan 07 09:00:42.837792 beast kernel: ret_from_fork+0x22/0x30
Jan 07 09:00:42.837819 beast kernel: ---[ end trace 4ab54518410b1377 ]---

@trapexit
Copy link
Owner

trapexit commented Jan 7, 2022

That's about NFS. Are you serving mergerfs mount via NFS or is an NFS mount a branch in mergerfs?

@Go0oSer
Copy link
Author

Go0oSer commented Jan 7, 2022

That's about NFS. Are you serving mergerfs mount via NFS or is an NFS mount a branch in mergerfs?

I am serving the mergerfs mount via NFS to other clients in my network.

@Go0oSer
Copy link
Author

Go0oSer commented Jan 9, 2022

This morning there was another segfault waiting for me. I was able to capture strace data from umount and mount commands
mount.strace.txt
umount.strace.txt
sudo_umount.strace.txt
sudo_mount.strace.txt

Kernel Logging.
[141401.892702] mergerfs[347]: segfault at 10 ip 0000561248636cb2 sp 00007fafdba733e0 error 4 in mergerfs[5612485ed000+58000] [141401.892718] Code: 80 7b 02 00 0f 85 70 fe ff ff 49 8d 6f 60 31 db 48 89 ef e8 20 7d fb ff 4c 89 ee 4c 89 ff e8 f5 d6 ff ff 48 89 ef 48 8b 40 20 <4c> 8b 68 10 e8 d5 77 fb ff e9 42 fe ff ff 48 89 df e8 c8 77 fb ff [141412.098744] ------------[ cut here ]------------

Output of LS command of mergerfs mountpoint:

gregory@beast:~/tmp$ ls -al /mnt/ ls: cannot access '/mnt/storage': Input/output error total 37 drwxr-xr-x 11 root root 4096 Mar 10 2021 . drwxr-xr-x 20 root root 4096 Mar 6 2021 .. drwxr-xr-x 2 root root 4096 Feb 12 2021 butter drwxrwxr-x 4 root root 4096 Jan 9 01:39 disk1 drwxrwxr-x 4 root root 4096 Jan 9 01:39 disk2 drwxrwxr-x 6 root root 4096 Nov 21 20:54 disk3 drwxrwxr-x 5 root root 4096 Nov 23 14:52 disk4 drwx------ 4 gregory gregory 4096 Nov 23 14:50 disk5 drwx------ 3 gregory gregory 4096 Nov 21 20:06 parity1 d????????? ? ? ? ? ? storage drwxr-xr-x 4 root root 4 Feb 17 2021 tank
I gather this may not be to the level of useful from a software developer perspective. I sort of understand the strace stuff. Ill leave the system in this state for a bit if you want me to do anything else.

@trapexit
Copy link
Owner

trapexit commented Jan 9, 2022

Once it's crashed there isn't anything to gather from strace. In this case the usefulness comes from maybe tracing mergerfs as it segvs.

@Go0oSer
Copy link
Author

Go0oSer commented Jan 9, 2022

Ok thanks. That will be difficult because these seem to be random at this point. Ill see what I can do. For example in regards to the randomness here are the kernel messages for the 3 times it seems to have happened. Seems maybe to have started with the previous and most recent version. Ill start backtracking any other changes Ive made. The system mainly runs docker containers and serves up file storage so in terms of the OS and packages, there is not all that much change.

gregory@beast:~& cat /var/log/syslog | grep mergerfs Jan 9 00:27:14 beast kernel: [141401.892702] mergerfs[347]: segfault at 10 ip 0000561248636cb2 sp 00007fafdba733e0 error 4 in mergerfs[5612485ed000+58000]
gregory@beast:~$ journalctl -b -1 | grep mergerfs Jan 07 08:36:38 beast kernel: mergerfs[345]: segfault at 10 ip 000055dfac03ccb2 sp 00007fa1f77c53e0 error 4 in mergerfs[55dfabff3000+58000]
gregory@beast:~$ journalctl -b -2 | grep mergerfs Jan 05 15:09:16 beast kernel: mergerfs[343]: segfault at 10 ip 0000561e127aecb2 sp 00007fd702d2b3e0 error 4 in mergerfs[561e12765000+58000]

@trapexit
Copy link
Owner

trapexit commented Jan 9, 2022

That's just saying that mergerfs died. Not super useful by itself. You would have to rebuild the source with debugging enabled and core files enabled to grab a core dump from the segv.

@Go0oSer
Copy link
Author

Go0oSer commented Jan 9, 2022

Ok. Ill see if I can get that in place.

@Go0oSer
Copy link
Author

Go0oSer commented Jan 12, 2022

To be sure the build flag I should use is - make DEBUG=1 correct?

@trapexit
Copy link
Owner

Yup. The executable will end up in ./build directory.

@Go0oSer
Copy link
Author

Go0oSer commented Jan 14, 2022

I compiled the software in DEBUG mode last night and ran it. It had crashed this morning. I was unable to gather a coredump as I'm not that familiar with how to gather one. I was anticipating a file would be created in /var/crash based on the ubuntu documentation. Other than replacing the executable with the newly compiled file, is there anything else I should be doing in order to gather the proper information for you?

That said there was a .crash titled _usr_bin_mergerfs.0.crash file from Jan 5 in the /var/crash location, which would have been from prior to compiling with the DEBUG flag enabled. Would this be of any use?

@trapexit
Copy link
Owner

Sorry. Should have been more thorough.

https://www.ibm.com/support/pages/how-do-i-enable-core-dumps

You should make sure to check that core dumps are enabled. ulimit -c will show you if it is enabled. If it is 0 then you need to run ulimit -c unlimited before you run mergerfs. And in the same terminal session. sudo to root, run ulimit, run mergerfs, get it to crash.

@Go0oSer
Copy link
Author

Go0oSer commented Jan 21, 2022

I wanted to send an update on this issue. I found that for some reason the server which I use MergerFS on, and use NFS to serve up the storage, was not running the rpc-statd daemon for whatever reason. I enabled the service and haven't had a segfault since.

I had been seeing locking messages in the kernel ring buffer as well as the segfaults and the nfs stack traces. It seems that the locking with NFS may have been causing the mergerfs application to hang or crash.

I'm not sure I'm going to be able to gather the proper information to help with the segfaults, but I think they were spawning from another issue with my server in the first place.

@trapexit
Copy link
Owner

Curious. I'm not sure why that would cause segvs unless somehow it was leading mergerfs to misunderstand a situation.

We can keep this open for now and if I find time I can try to reproduce it.

@DomiStyle
Copy link

I'm also getting segfaults with error 4 since switching to Fedora Server. rpc-statd is already running on Fedora by default (it's not on Ubuntu) so that's not the cause but it seems like this started happening once I started using NFSv4 so it might really be related to locks. I disabled my NFSv4 mounts for now and will see if it still happens.

Not sure if the logs are useful but just in case:

Logs
Feb 15 17:12:20 stor01 kernel: show_signal_msg: 20 callbacks suppressed
Feb 15 17:12:20 stor01 kernel: mergerfs[868]: segfault at 10 ip 000000000044fc7a sp 00007fbaab7f5360 error 4 in mergerfs[406000+55000]
Feb 15 17:12:20 stor01 kernel: Code: 80 7b 02 00 0f 85 41 fe ff ff 49 8d 6f 60 31 db 48 89 ef e8 08 6e fb ff 4c 89 ee 4c 89 ff e8 9d cb ff ff 48 89 ef 48 8b 40 20 <4c> 8b 68 10 e8 dd 68 fb ff e9 13 fe ff ff 0f 1f 84 00 00 00 00 00
Feb 15 17:12:20 stor01 systemd[1]: Created slice Slice /system/systemd-coredump.
Feb 15 17:12:20 stor01 audit: BPF prog-id=50 op=LOAD
Feb 15 17:12:20 stor01 audit: BPF prog-id=51 op=LOAD
Feb 15 17:12:20 stor01 audit: BPF prog-id=52 op=LOAD
Feb 15 17:12:20 stor01 systemd[1]: Started Process Core Dump (PID 1983/UID 0).
Feb 15 17:12:20 stor01 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-coredump@0-1983-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 15 17:12:20 stor01 kernel: xfs filesystem being remounted at /run/systemd/unit-root/var/tmp supports timestamps until 2038 (0x7fffffff)
Feb 15 17:12:20 stor01 kernel: xfs filesystem being remounted at /run/systemd/unit-root/var/lib/systemd/coredump supports timestamps until 2038 (0x7fffffff)
Feb 15 17:12:20 stor01 kernel: xfs filesystem being remounted at /run/systemd/unit-root/var/lib/systemd/coredump supports timestamps until 2038 (0x7fffffff)
Feb 15 17:12:20 stor01 kernel: xfs filesystem being remounted at /run/systemd/unit-root/var/tmp supports timestamps until 2038 (0x7fffffff)
Feb 15 17:12:20 stor01 systemd-coredump[1984]: [🡕] Process 860 (mergerfs) of user 0 dumped core.
                                               
                                               Found module linux-vdso.so.1 with build-id: fb3d9d23e731a0e438355b1502ea5441c3ae2d6d
                                               Found module libnss_systemd.so.2 with build-id: babba3904ec3983f952d1a2e49984c3d375448d8
                                               Found module libnss_sss.so.2 with build-id: f507413541bb0cb10590bccb2ead1bed2599c74b
                                               Found module ld-linux-x86-64.so.2 with build-id: 4e2c368b1240116994daff7f0f248a01f6e53198
                                               Found module libc.so.6 with build-id: 881b4882a363c5c193cddf6e4991e1873f593364
                                               Found module libgcc_s.so.1 with build-id: 88564abce789aa42536da1247a57ff6062d61dcb
                                               Found module libm.so.6 with build-id: 47343f97c29c4e91c841e2bfd344329a264075ad
                                               Found module libstdc++.so.6 with build-id: 64dc8524cb0109c7f3d3c0ef3d4d6ec9979db75c
                                               Found module mergerfs with build-id: d170192b6cef5406aae729ad387b5bb5ab309fd3
                                               Stack trace of thread 868:
                                               #0  0x000000000044fc7a fuse_lib_lookup (mergerfs + 0x4fc7a)
                                               #1  0x0000000000453bd2 fuse_ll_process_buf (mergerfs + 0x53bd2)
                                               #2  0x0000000000451d95 fuse_do_work (mergerfs + 0x51d95)
                                               #3  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #4  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 862:
                                               #0  0x00007fbab3ac07ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a080 fuse_kern_chan_receive (mergerfs + 0x5a080)
                                               #2  0x0000000000454849 fuse_ll_receive_buf (mergerfs + 0x54849)
                                               #3  0x0000000000451d65 fuse_do_work (mergerfs + 0x51d65)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 864:
                                               #0  0x00007fbab3ac07ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a080 fuse_kern_chan_receive (mergerfs + 0x5a080)
                                               #2  0x0000000000454849 fuse_ll_receive_buf (mergerfs + 0x54849)
                                               #3  0x0000000000451d65 fuse_do_work (mergerfs + 0x51d65)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 867:
                                               #0  0x00007fbab3ac07ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a080 fuse_kern_chan_receive (mergerfs + 0x5a080)
                                               #2  0x0000000000454849 fuse_ll_receive_buf (mergerfs + 0x54849)
                                               #3  0x0000000000451d65 fuse_do_work (mergerfs + 0x51d65)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 863:
                                               #0  0x00007fbab3ac07ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a080 fuse_kern_chan_receive (mergerfs + 0x5a080)
                                               #2  0x0000000000454849 fuse_ll_receive_buf (mergerfs + 0x54849)
                                               #3  0x0000000000451d65 fuse_do_work (mergerfs + 0x51d65)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 865:
                                               #0  0x00007fbab3ac07ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a080 fuse_kern_chan_receive (mergerfs + 0x5a080)
                                               #2  0x0000000000454849 fuse_ll_receive_buf (mergerfs + 0x54849)
                                               #3  0x0000000000451d65 fuse_do_work (mergerfs + 0x51d65)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 869:
                                               #0  0x00007fbab3ac07ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a080 fuse_kern_chan_receive (mergerfs + 0x5a080)
                                               #2  0x0000000000454849 fuse_ll_receive_buf (mergerfs + 0x54849)
                                               #3  0x0000000000451d65 fuse_do_work (mergerfs + 0x51d65)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 866:
                                               #0  0x00007fbab3ac07ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a080 fuse_kern_chan_receive (mergerfs + 0x5a080)
                                               #2  0x0000000000454849 fuse_ll_receive_buf (mergerfs + 0x54849)
                                               #3  0x0000000000451d65 fuse_do_work (mergerfs + 0x51d65)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 860:
                                               #0  0x00007fbab3a4973a __futex_abstimed_wait_common (libc.so.6 + 0x8a73a)
                                               #1  0x00007fbab3a54ac8 __new_sem_wait_slow64.constprop.0 (libc.so.6 + 0x95ac8)
                                               #2  0x0000000000451ff2 fuse_session_loop_mt (mergerfs + 0x51ff2)
                                               #3  0x000000000045a416 fuse_loop_mt (mergerfs + 0x5a416)
                                               #4  0x000000000045777e fuse_main_real (mergerfs + 0x5777e)
                                               #5  0x000000000040aab1 _ZN1l4mainEiPPc (mergerfs + 0xaab1)
                                               #6  0x00007fbab39ec560 __libc_start_call_main (libc.so.6 + 0x2d560)
                                               #7  0x00007fbab39ec60c __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2d60c)
                                               #8  0x000000000040c875 _start (mergerfs + 0xc875)
                                               
                                               Stack trace of thread 861:
                                               #0  0x00007fbab3a973a5 clock_nanosleep@GLIBC_2.2.5 (libc.so.6 + 0xd83a5)
                                               #1  0x00007fbab3a9bfa7 __nanosleep (libc.so.6 + 0xdcfa7)
                                               #2  0x00007fbab3a9bede sleep (libc.so.6 + 0xdcede)
                                               #3  0x0000000000450766 fuse_maintenance_loop (mergerfs + 0x50766)
                                               #4  0x00007fbab3a4cb1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007fbab3ad1650 __clone3 (libc.so.6 + 0x112650)
Feb 15 17:12:20 stor01 systemd[1]: systemd-coredump@0-1983-0.service: Deactivated successfully.
Feb 15 17:12:20 stor01 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-coredump@0-1983-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 15 17:12:20 stor01 audit: BPF prog-id=0 op=UNLOAD
Feb 15 17:12:20 stor01 audit: BPF prog-id=0 op=UNLOAD
Feb 15 17:12:20 stor01 audit: BPF prog-id=0 op=UNLOAD
Feb 15 17:12:20 stor01 kernel: ------------[ cut here ]------------
Feb 15 17:12:20 stor01 kernel: nfsd: non-standard errno: -107
Feb 15 17:12:20 stor01 kernel: WARNING: CPU: 7 PID: 1052 at fs/nfsd/nfsproc.c:886 nfserrno+0x57/0x70 [nfsd]
Feb 15 17:12:20 stor01 kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core qrtr rfkill ses enclosure mpt3sas raid_class i2c_piix4 joydev scsi_transport_sas virtio_balloon nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse zram ip_tables xfs bochs drm_vram_helper drm_kms_helper cec drm_ttm_helper ttm virtio_net drm net_failover failover ata_generic serio_raw pata_acpi virtio_blk qemu_fw_cfg
Feb 15 17:12:20 stor01 kernel: CPU: 7 PID: 1052 Comm: nfsd Not tainted 5.16.5-200.fc35.x86_64 #1
Feb 15 17:12:20 stor01 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Feb 15 17:12:20 stor01 kernel: RIP: 0010:nfserrno+0x57/0x70 [nfsd]
Feb 15 17:12:20 stor01 kernel: Code: 66 e9 06 00 00 b8 00 00 00 05 74 01 c3 48 83 ec 08 89 fe 48 c7 c7 06 91 5f c0 c6 05 4a e9 06 00 01 89 44 24 04 e8 95 ce 67 e0 <0f> 0b 8b 44 24 04 48 83 c4 08 c3 66 66 2e 0f 1f 84 00 00 00 00 00
Feb 15 17:12:20 stor01 kernel: RSP: 0018:ffffaf92421ffe30 EFLAGS: 00010286
Feb 15 17:12:20 stor01 kernel: RAX: 000000000000001e RBX: ffff9254eb300000 RCX: 0000000000000000
Feb 15 17:12:20 stor01 kernel: RDX: 0000000000000001 RSI: ffffffffa1634c5d RDI: 00000000ffffffff
Feb 15 17:12:20 stor01 kernel: RBP: ffff9254ebdb4000 R08: 0000000000000000 R09: ffffaf92421ffc70
Feb 15 17:12:20 stor01 kernel: R10: ffffaf92421ffc68 R11: ffffffffa1f462a8 R12: ffff9254eb2f8000
Feb 15 17:12:20 stor01 kernel: R13: ffff9254ebdb41f0 R14: ffffffffc05f3a98 R15: ffff925580e6c100
Feb 15 17:12:20 stor01 kernel: FS:  0000000000000000(0000) GS:ffff9255bbdc0000(0000) knlGS:0000000000000000
Feb 15 17:12:20 stor01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 15 17:12:20 stor01 kernel: CR2: 00007fbf7f42d2f8 CR3: 0000000056204000 CR4: 00000000000006e0
Feb 15 17:12:20 stor01 kernel: Call Trace:
Feb 15 17:12:20 stor01 kernel:  <TASK>
Feb 15 17:12:20 stor01 kernel:  nfsd3_proc_getattr+0x9a/0xd0 [nfsd]
Feb 15 17:12:20 stor01 kernel:  nfsd_dispatch+0x149/0x270 [nfsd]
Feb 15 17:12:20 stor01 kernel:  svc_process_common+0x39b/0x5c0 [sunrpc]
Feb 15 17:12:20 stor01 kernel:  ? nfsd_svc+0x370/0x370 [nfsd]
Feb 15 17:12:20 stor01 kernel:  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Feb 15 17:12:20 stor01 kernel:  svc_process+0xb7/0xf0 [sunrpc]
Feb 15 17:12:20 stor01 kernel:  nfsd+0xed/0x150 [nfsd]
Feb 15 17:12:20 stor01 kernel:  kthread+0x16b/0x190
Feb 15 17:12:20 stor01 kernel:  ? set_kthread_struct+0x40/0x40
Feb 15 17:12:20 stor01 kernel:  ret_from_fork+0x22/0x30
Feb 15 17:12:20 stor01 kernel:  </TASK>
Feb 15 17:12:20 stor01 kernel: ---[ end trace 2d61902326538459 ]---
Feb 15 17:12:20 stor01 abrt-server[1995]: Package 'mergerfs' isn't signed with proper key
Feb 15 17:12:20 stor01 abrt-server[1995]: 'post-create' on '/var/spool/abrt/ccpp-2022-02-15-17:12:20.431355-860' exited with 1
Feb 15 17:12:20 stor01 abrt-server[1995]: Deleting problem directory '/var/spool/abrt/ccpp-2022-02-15-17:12:20.431355-860'
Feb 15 17:12:21 stor01 abrt-dump-journal-oops[952]: abrt-dump-journal-oops: Found oopses: 1
Feb 15 17:12:21 stor01 abrt-dump-journal-oops[952]: abrt-dump-journal-oops: Creating problem directories
Feb 15 17:12:21 stor01 abrt-server[1998]: Deleting problem directory oops-2022-02-15-17:12:21-952-0 (dup of oops-2022-02-15-11:40:06-954-0)

The second part is just (obviously) NFS having issues since the mount became unavailable.

@trapexit
Copy link
Owner

Thanks for the info. The stack trace of mergerfs shows it doing a lookup but otherwise idle. I'll look around that code to see if anything stands out and when I get cycles try to reproduce.

@DomiStyle
Copy link

Any update on this? Still experiencing those crashes.

Logs
Mar 31 08:05:17 stor01 systemd-coredump[3435]: [🡕] Process 865 (mergerfs) of user 0 dumped core.
                                               
                                               Found module linux-vdso.so.1 with build-id: fb3d9d23e731a0e438355b1502ea5441c3ae2d6d
                                               Found module libnss_systemd.so.2 with build-id: babba3904ec3983f952d1a2e49984c3d375448d8
                                               Found module libnss_sss.so.2 with build-id: f507413541bb0cb10590bccb2ead1bed2599c74b
                                               Found module ld-linux-x86-64.so.2 with build-id: 4e2c368b1240116994daff7f0f248a01f6e53198
                                               Found module libc.so.6 with build-id: 881b4882a363c5c193cddf6e4991e1873f593364
                                               Found module libgcc_s.so.1 with build-id: 88564abce789aa42536da1247a57ff6062d61dcb
                                               Found module libm.so.6 with build-id: 47343f97c29c4e91c841e2bfd344329a264075ad
                                               Found module libstdc++.so.6 with build-id: 64dc8524cb0109c7f3d3c0ef3d4d6ec9979db75c
                                               Found module mergerfs with build-id: 1bb1e816543d21163515be40fa7ca2ad6dcb0c59
                                               Stack trace of thread 867:
                                               #0  0x000000000044fd5a fuse_lib_lookup (mergerfs + 0x4fd5a)
                                               #1  0x0000000000453c22 fuse_ll_process_buf (mergerfs + 0x53c22)
                                               #2  0x0000000000451de5 fuse_do_work (mergerfs + 0x51de5)
                                               #3  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #4  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 865:
                                               #0  0x00007f316cab173a __futex_abstimed_wait_common (libc.so.6 + 0x8a73a)
                                               #1  0x00007f316cabcac8 __new_sem_wait_slow64.constprop.0 (libc.so.6 + 0x95ac8)
                                               #2  0x0000000000452042 fuse_session_loop_mt (mergerfs + 0x52042)
                                               #3  0x000000000045a466 fuse_loop_mt (mergerfs + 0x5a466)
                                               #4  0x00000000004577ce fuse_main_real (mergerfs + 0x577ce)
                                               #5  0x000000000040aac1 _ZN1l4mainEiPPc (mergerfs + 0xaac1)
                                               #6  0x00007f316ca54560 __libc_start_call_main (libc.so.6 + 0x2d560)
                                               #7  0x00007f316ca5460c __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2d60c)
                                               #8  0x000000000040c885 _start (mergerfs + 0xc885)
                                               
                                               Stack trace of thread 866:
                                               #0  0x00007f316caff3a5 clock_nanosleep@GLIBC_2.2.5 (libc.so.6 + 0xd83a5)
                                               #1  0x00007f316cb03fa7 __nanosleep (libc.so.6 + 0xdcfa7)
                                               #2  0x00007f316cb03ede sleep (libc.so.6 + 0xdcede)
                                               #3  0x00000000004507db fuse_maintenance_loop (mergerfs + 0x507db)
                                               #4  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 871:
                                               #0  0x00007f316cb287ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a0d0 fuse_kern_chan_receive (mergerfs + 0x5a0d0)
                                               #2  0x0000000000454899 fuse_ll_receive_buf (mergerfs + 0x54899)
                                               #3  0x0000000000451db5 fuse_do_work (mergerfs + 0x51db5)
                                               #4  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 872:
                                               #0  0x00007f316cb287ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a0d0 fuse_kern_chan_receive (mergerfs + 0x5a0d0)
                                               #2  0x0000000000454899 fuse_ll_receive_buf (mergerfs + 0x54899)
                                               #3  0x0000000000451db5 fuse_do_work (mergerfs + 0x51db5)
                                               #4  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 869:
                                               #0  0x00007f316cb287ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a0d0 fuse_kern_chan_receive (mergerfs + 0x5a0d0)
                                               #2  0x0000000000454899 fuse_ll_receive_buf (mergerfs + 0x54899)
                                               #3  0x0000000000451db5 fuse_do_work (mergerfs + 0x51db5)
                                               #4  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 873:
                                               #0  0x00007f316cb287ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a0d0 fuse_kern_chan_receive (mergerfs + 0x5a0d0)
                                               #2  0x0000000000454899 fuse_ll_receive_buf (mergerfs + 0x54899)
                                               #3  0x0000000000451db5 fuse_do_work (mergerfs + 0x51db5)
                                               #4  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 874:
                                               #0  0x00007f316cb287ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a0d0 fuse_kern_chan_receive (mergerfs + 0x5a0d0)
                                               #2  0x0000000000454899 fuse_ll_receive_buf (mergerfs + 0x54899)
                                               #3  0x0000000000451db5 fuse_do_work (mergerfs + 0x51db5)
                                               #4  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 870:
                                               #0  0x00007f316cb2e42d writev (libc.so.6 + 0x10742d)
                                               #1  0x000000000045a1c3 fuse_kern_chan_send (mergerfs + 0x5a1c3)
                                               #2  0x0000000000453006 send_reply_iov (mergerfs + 0x53006)
                                               #3  0x000000000045502d fuse_reply_attr (mergerfs + 0x5502d)
                                               #4  0x000000000044fa5d fuse_lib_getattr (mergerfs + 0x4fa5d)
                                               #5  0x000000000045221b do_getattr (mergerfs + 0x5221b)
                                               #6  0x0000000000453c22 fuse_ll_process_buf (mergerfs + 0x53c22)
                                               #7  0x0000000000451de5 fuse_do_work (mergerfs + 0x51de5)
                                               #8  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #9  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)
                                               
                                               Stack trace of thread 868:
                                               #0  0x00007f316cb287ac read (libc.so.6 + 0x1017ac)
                                               #1  0x000000000045a0d0 fuse_kern_chan_receive (mergerfs + 0x5a0d0)
                                               #2  0x0000000000454899 fuse_ll_receive_buf (mergerfs + 0x54899)
                                               #3  0x0000000000451db5 fuse_do_work (mergerfs + 0x51db5)
                                               #4  0x00007f316cab4b1a start_thread (libc.so.6 + 0x8db1a)
                                               #5  0x00007f316cb39650 __clone3 (libc.so.6 + 0x112650)

@trapexit
Copy link
Owner

Unfortunately not. There just isn't any information in those traces. One thread is writing back data to the kernel. All others are sitting around waiting. Unless it is some subtle memory corruption I don't know what it could be.

@DomiStyle
Copy link

Any way to diagnose this further? I already compiled mergerfs with make DEBUG=1 and replaced the binary. Does mergerfs log somewhere or do you need the strace?

Unless it is some subtle memory corruption I don't know what it could be.

mergerfs is running on a system with ECC RAM inside of a virtual machine so hopefully there is no memory corruption.

@Go0oSer
Copy link
Author

Go0oSer commented Mar 31, 2022

Any way to diagnose this further? I already compiled mergerfs with make DEBUG=1 and replaced the binary. Does mergerfs log somewhere or do you need the strace?

Unless it is some subtle memory corruption I don't know what it could be.

mergerfs is running on a system with ECC RAM inside of a virtual machine so hopefully there is no memory corruption.

My machine is also running as a virtual machine on Proxmox. Passing through an HBA via SR-IOV.

Of note I haven't suffered any segfault since February 13. I noticed they started sometime in December once I deployed Jellyfin on a couple of systems and had it actively monitoring the media directories (mergerfs mount point served via NFS) for changes. This particular segfault was random as I hadnt really seen one before that since enabling rpc-statd on or around Jan 15th.

I would suspect this isnt that helpful, but I would be happy to provide any additional information about my system if would help for troubleshooting.

@DomiStyle
Copy link

My machine is also running as a virtual machine on Proxmox. Passing through an HBA via SR-IOV.

Our setup is very similar so this is probably not a coincidence.

I'm also passing through an HBA to a VM in Proxmox. I'm also running Jellyfin on another VM which uses the mergerfs mount via NFS.

Haven't had a segfault for a long while but it just started again last week. Looking at the log again, the crash happened at 8:05, which is the time Jellyfin starts scanning. So that might be related.

@trapexit
Copy link
Owner

trapexit commented Mar 31, 2022

I never added logging to mergerfs because the data would almost always be useless and would kill performance. strace is almost always sufficient. And for segv they are basically the same unless I logged a crazy amount of data but even then a stack trace is best.

You can run the debug build of mergerfs and then attach with gdb and just let it wait till it segvs.... then give me the stack track. thread apply all bt

gdb attach PID
then when it starts just hit 'c ENTER' to continue running.

Unfortunately it'd be a major PITA to try to recreate your setup. I haven't even used NFS in years. I could try a simpler version. An strace of jellyfin when it happens would be helpful.

@DomiStyle
Copy link

Thanks, I attached gdb to mergerfs and will keep it running until it crashes.

Missing separate debuginfos, use: dnf debuginfo-install mergerfs-2.33.4-1.fc35.x86_64
I assume that's ok?

@trapexit
Copy link
Owner

Are you running the debug build you created or did you install the RPM I provided? If the latter then you should do the former.

@DomiStyle
Copy link

According to the MD5 checksum it's my build. Although I did install the RPM first and then replaced the mergerfs binary at /usr/bin/mergerfs.

@DomiStyle
Copy link

So this is what I got out of the last crash. Not sure if it's of much use since some things appear to be missing. Also I manually got the stacktrace with bt and bt full, not sure if it's supposed to print out any segfaults on its own.

This one was most likely caused by samba by the way, so it seems not only NFS causes it.

Attaching to process 882
[New LWP 883]
[New LWP 884]
[New LWP 885]
[New LWP 886]
[New LWP 887]
[New LWP 888]
[New LWP 889]
[New LWP 890]
[New LWP 891]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f9dcbf3d73a in __futex_abstimed_wait_common () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install mergerfs-2.33.4-1.fc35.x86_64
Continuing.

Thread 5 "mergerfs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f9dca29a640 (LWP 886)]
0x0000561fddd8eb59 in fuse_lib_lookup (req=0x7f9dccd80160, parent=1, name=0x0) at lib/fuse.c:2066
2066    lib/fuse.c: No such file or directory.
#0  0x0000561fddd8eb59 in fuse_lib_lookup (req=0x7f9dccd80160, parent=1, name=0x0) at lib/fuse.c:2066
#1  0x0000561fddda8dd8 in do_lookup (req=0x7f9dccd80160, nodeid=1, inarg=0x7f9dca29b038) at lib/fuse_lowlevel.c:1011
#2  0x0000561fdddb47da in fuse_ll_process_buf (data=0x561fdea46ec0, buf=0x7f9dca2915d0, ch=0x561fdea46a10) at lib/fuse_lowlevel.c:2498
#3  0x0000561fdddba618 in fuse_session_process_buf (se=0x561fdea46b10, buf=0x7f9dca2915d0, ch=0x561fdea46a10) at lib/fuse_session.c:80
#4  0x0000561fddda0129 in fuse_do_work (data=0x561fdea45940) at lib/fuse_loop_mt.c:101
#5  0x00007f9dcbf40b1a in start_thread () from /lib64/libc.so.6
#6  0x00007f9dcbfc5660 in clone3 () from /lib64/libc.so.6
#0  0x0000561fddd8eb59 in fuse_lib_lookup (req=0x7f9dccd80160, parent=1, name=0x0) at lib/fuse.c:2066
        f = 0x561fdea46be0
        e = {ino = 65552, generation = 3907356, attr = {st_dev = 140315678282480, st_ino = 94694865973101, st_nlink = 140315678282560, 
            st_mode = 3721970175, st_uid = 22047, st_gid = 3391689608, __pad0 = 32669, st_rdev = 94694867567744, st_size = 4096, 
            st_blksize = 140315723301216, st_blocks = 140315678282656, st_atim = {tv_sec = 94694867567808, tv_nsec = 140315678282640}, 
            st_mtim = {tv_sec = 94694866033587, tv_nsec = 140315678322704}, st_ctim = {tv_sec = 94694867567744, 
              tv_nsec = 140315678282704}, __glibc_reserved = {140315723301216, 0, 94694867567808}}, timeout = {
            entry = 140315678282704, attr = 94694866035296}}
        path = 0x7f9dcbf40840 <start_thread> "\363\017\036\372USH\201", <incomplete sequence \354\210>
        err = 0
        dot = 0x0
#1  0x0000561fddda8dd8 in do_lookup (req=0x7f9dccd80160, nodeid=1, inarg=0x7f9dca29b038) at lib/fuse_lowlevel.c:1011
        name = 0x7f9dca29b038 ".."
#2  0x0000561fdddb47da in fuse_ll_process_buf (data=0x561fdea46ec0, buf=0x7f9dca2915d0, ch=0x561fdea46a10) at lib/fuse_lowlevel.c:2498
        f = 0x561fdea46ec0
        write_header_size = 80
        bufv = {count = 1, idx = 0, off = 0, buf = {{size = 43, flags = 0, mem = 0x7f9dca29b010, fd = 0, pos = 0}}}
        tmpbuf = {count = 1, idx = 0, off = 0, buf = {{size = 80, flags = 0, mem = 0x0, fd = -1, pos = 0}}}
        in = 0x7f9dca29b010
        inarg = 0x7f9dca29b038
        req = 0x7f9dccd80160
        mbuf = 0x0
        err = 38
        res = -559650288
#3  0x0000561fdddba618 in fuse_session_process_buf (se=0x561fdea46b10, buf=0x7f9dca2915d0, ch=0x561fdea46a10) at lib/fuse_session.c:80
        __PRETTY_FUNCTION__ = "fuse_session_process_buf"
#4  0x0000561fddda0129 in fuse_do_work (data=0x561fdea45940) at lib/fuse_loop_mt.c:101
        res = 43
        fbuf = {size = 43, flags = 0, mem = 0x7f9dca29b010, fd = 0, pos = 0}
        ch = 0x561fdea46a10
        w = 0x561fdea45940
        mt = 0x7ffc47b7e660
#5  0x00007f9dcbf40b1a in start_thread () from /lib64/libc.so.6
No symbol table info available.
#6  0x00007f9dcbfc5660 in clone3 () from /lib64/libc.so.6
No symbol table info available.

@trapexit
Copy link
Owner

If that trace is accurate then it would explain the segv but I'm not sure how it could come to be in that state. The name=0x0 part. The value is passed in from do_lookup's inarg variable and you see name = "..". name in fuse_lib_lookup should be identical to inarg from do_lookup. This is bad as it could mean there is some subtle memory corruption. I'll keep looking to see if anything sticks out that might be easier to address.

@trapexit
Copy link
Owner

Oh wait... I figured out why name = 0x0. I have an idea now. Let me dig in a bit.

@DomiStyle
Copy link

I will do some memtests both on host and in VM just to make sure anyway.

@trapexit
Copy link
Owner

I figured out why it segv's in the sense I know what request is triggering it. But I'm not sure why the request is sent and as far as I can tel such a request shouldn't come in or... it would break any libfuse filesystem using the higher level API. To be fair mergerfs has deviated from that but this part is basically the same. The request is "lookup the parent of root"... which doesn't exist. And in the code it's set to NULL. It's the only node in the tree without a parent given it's the top node. If this was a regular thing mergerfs would be crashing left and right for years so something about your setup is atypical. I'm not even sure how to force it. Or even how to respond to the request. I would have thought the operating system would be managing that.

@trapexit
Copy link
Owner

I've fired off an email to the fuse mailing list asking some questions about this situation. Should be pretty easy to fix once I know what the appropriate response is. Unfortunately, I've yet to figure out how to trigger it.

@DomiStyle
Copy link

Good to hear. I ran memtest both on host and in VM and found no errors.

@trapexit
Copy link
Owner

trapexit commented Apr 7, 2022

I've got not replies yet regarding my questions on the situation. What I could do is make a small patch and you could try it? The problem is I can't manually recreate the situation and I'm not sure what the appropriate reply is. I could return the details for the root node, I could return an error. Both seem wrong... but so would faking a parent to root... there isn't one.

I could return ENOENT error first... and if that causes you an issue we could try just returning the root's details?

@trapexit
Copy link
Owner

trapexit commented Apr 7, 2022

Can you try building this branch: https://github.com/trapexit/mergerfs/tree/root-parent ?

I'm curious how the OS / client software will respond to a ENOENT here so you might need to keep an eye out on whatever software triggers it. If it causes an issue we can try returning the same details as root.

@DomiStyle
Copy link

Sure, will try it.

Haven't had a crash in those 7 days for some reason but will play around with it a little bit.

@trapexit
Copy link
Owner

Still no response from FUSE folks on this situation. I'm thinking I'll just merge the patch I made into master branch and put out a patch release for now till I get more information on the situation. Might result in some ENOENT errors when it occurs but that's better than crashing.

@DomiStyle
Copy link

I have been running this branch for a while now with no issues. I assume there is nothing I can check to see if the error occured?

@trapexit
Copy link
Owner

Nothing from mergerfs itself. The request coming into mergerfs was "what is the parent to /" ie.... ".." for "/" which, as far as I know, doesn't make sense because root doesn't have a parent. And that what was causing the segv. It wasn't checking for that and the parent of root was null. Now I check and return ENOENT. Which under normal situations would manifest to the user app as the same error. But it is possible that NFS is swallowing the error.

@trapexit
Copy link
Owner

FUSE author got back to me. Sounds like not only is this situation not supposed to happen but there are some checks in place to catch it. So something about mergerfs is seriously messed up or there is an interesting kernel bug.

The author would like a request log but 1) I stripped logging out of mergerfs in prep to redo it and 2) such logging if not immediately reproducible would be large. So I might just have to try building a similar setup. My main server is Ubuntu 20.04. What packages did you install for NFS? What is your NFS config? The segv always happened when a remote computer with the NFS export mounted started a heavy scan of the filesystem? Any other patterns noticed?

@DomiStyle
Copy link

My main server is Ubuntu 20.04. What packages did you install for NFS? What is your NFS config?

The server is running Fedora Server inside of a Proxmox (QEMU) virtual machine.

My NFS config looks like this:

/export/pool01                  192.168.10.0/24(fsid=5b8fb1b6-8769-11ec-98ae-0371ad815c94,rw,no_root_squash,no_subtree_check,insecure)

/export                         192.168.10.0/24(ro,fsid=0,no_root_squash,no_subtree_check,hide)

/export/pool01 is a symlink to /srv/pools/pool01 where the mergerfs mount sits at.

The segv always happened when a remote computer with the NFS export mounted started a heavy scan of the filesystem? Any other patterns noticed?

It also happened randomly but I/O heavy tasks like backups or Jellyfin scans provoked it more often. Might also be related to file locking since it happens much quicker over NFSv4.

Any other patterns noticed?

I think it might be related to the ballooning RAM feature of QEMU. The crashes stopped happening after disabling it. I'm gonna enable it again to see if it happens again.

@DomiStyle
Copy link

Just a small update, after using the old mergerfs with the ballooning RAM it crashed again 2 days later so it seems like it really is related somehow.

Which is strange considering memtest and all other software runs perfectly fine, even with ballooning RAM enabled.

@fosslinux
Copy link

Just as another datapoint: I had this exact problem, serving the mergerfs mount over NFS. mergerfs was running in a QEMU VM. After I updated to the most recent version, it hasn't recurred for over a week.

@Go0oSer
Copy link
Author

Go0oSer commented Jul 14, 2022

I haven’t had a crash since Feb. Do we still want this open?

@trapexit
Copy link
Owner

If you all are fine closing this I am too. This does appear to be a kernel bug but I've not had the time to make a simple example for the devs.

@DomiStyle
Copy link

I also don't mind, it's definitely related to KVM and ballooning RAM but it's hard to reproduce.

@aqxa1
Copy link

aqxa1 commented Nov 17, 2022

I haven't tested yet but I suspect this was fixed in 2.33.4 or 2.33.5. Annoyingly, the latest version that Ubuntu includes is 2.33.3, which appears to have the bug for me. Hopefully, 2.33.5 fixes the issue.

@trapexit
Copy link
Owner

Debian/Ubuntu and other non-rolling release distros will almost never have uptodate software. And they don't focus on secondary software bug fixes. That's why I create my own packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants