-
-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mergerfs Segfault #1004
Comments
Pulles the following trace from journalctl. Not sure how useful it will be but I figured I would supply it. Jan 07 09:00:42.836335 beast kernel: ------------[ cut here ]------------ |
That's about NFS. Are you serving mergerfs mount via NFS or is an NFS mount a branch in mergerfs? |
I am serving the mergerfs mount via NFS to other clients in my network. |
This morning there was another segfault waiting for me. I was able to capture strace data from umount and mount commands Kernel Logging. Output of LS command of mergerfs mountpoint:
|
Once it's crashed there isn't anything to gather from strace. In this case the usefulness comes from maybe tracing mergerfs as it segvs. |
Ok thanks. That will be difficult because these seem to be random at this point. Ill see what I can do. For example in regards to the randomness here are the kernel messages for the 3 times it seems to have happened. Seems maybe to have started with the previous and most recent version. Ill start backtracking any other changes Ive made. The system mainly runs docker containers and serves up file storage so in terms of the OS and packages, there is not all that much change.
|
That's just saying that mergerfs died. Not super useful by itself. You would have to rebuild the source with debugging enabled and core files enabled to grab a core dump from the segv. |
Ok. Ill see if I can get that in place. |
To be sure the build flag I should use is - |
Yup. The executable will end up in ./build directory. |
I compiled the software in DEBUG mode last night and ran it. It had crashed this morning. I was unable to gather a coredump as I'm not that familiar with how to gather one. I was anticipating a file would be created in /var/crash based on the ubuntu documentation. Other than replacing the executable with the newly compiled file, is there anything else I should be doing in order to gather the proper information for you? That said there was a .crash titled _usr_bin_mergerfs.0.crash file from Jan 5 in the /var/crash location, which would have been from prior to compiling with the DEBUG flag enabled. Would this be of any use? |
Sorry. Should have been more thorough. https://www.ibm.com/support/pages/how-do-i-enable-core-dumps You should make sure to check that core dumps are enabled. |
I wanted to send an update on this issue. I found that for some reason the server which I use MergerFS on, and use NFS to serve up the storage, was not running the rpc-statd daemon for whatever reason. I enabled the service and haven't had a segfault since. I had been seeing locking messages in the kernel ring buffer as well as the segfaults and the nfs stack traces. It seems that the locking with NFS may have been causing the mergerfs application to hang or crash. I'm not sure I'm going to be able to gather the proper information to help with the segfaults, but I think they were spawning from another issue with my server in the first place. |
Curious. I'm not sure why that would cause segvs unless somehow it was leading mergerfs to misunderstand a situation. We can keep this open for now and if I find time I can try to reproduce it. |
I'm also getting segfaults with error 4 since switching to Fedora Server. Not sure if the logs are useful but just in case: Logs
The second part is just (obviously) NFS having issues since the mount became unavailable. |
Thanks for the info. The stack trace of mergerfs shows it doing a lookup but otherwise idle. I'll look around that code to see if anything stands out and when I get cycles try to reproduce. |
Any update on this? Still experiencing those crashes. Logs
|
Unfortunately not. There just isn't any information in those traces. One thread is writing back data to the kernel. All others are sitting around waiting. Unless it is some subtle memory corruption I don't know what it could be. |
Any way to diagnose this further? I already compiled mergerfs with
mergerfs is running on a system with ECC RAM inside of a virtual machine so hopefully there is no memory corruption. |
My machine is also running as a virtual machine on Proxmox. Passing through an HBA via SR-IOV. Of note I haven't suffered any segfault since February 13. I noticed they started sometime in December once I deployed Jellyfin on a couple of systems and had it actively monitoring the media directories (mergerfs mount point served via NFS) for changes. This particular segfault was random as I hadnt really seen one before that since enabling rpc-statd on or around Jan 15th. I would suspect this isnt that helpful, but I would be happy to provide any additional information about my system if would help for troubleshooting. |
Our setup is very similar so this is probably not a coincidence. I'm also passing through an HBA to a VM in Proxmox. I'm also running Jellyfin on another VM which uses the mergerfs mount via NFS. Haven't had a segfault for a long while but it just started again last week. Looking at the log again, the crash happened at 8:05, which is the time Jellyfin starts scanning. So that might be related. |
I never added logging to mergerfs because the data would almost always be useless and would kill performance. strace is almost always sufficient. And for segv they are basically the same unless I logged a crazy amount of data but even then a stack trace is best. You can run the debug build of mergerfs and then attach with gdb and just let it wait till it segvs.... then give me the stack track. gdb attach PID Unfortunately it'd be a major PITA to try to recreate your setup. I haven't even used NFS in years. I could try a simpler version. An strace of jellyfin when it happens would be helpful. |
Thanks, I attached gdb to mergerfs and will keep it running until it crashes.
|
Are you running the debug build you created or did you install the RPM I provided? If the latter then you should do the former. |
According to the MD5 checksum it's my build. Although I did install the RPM first and then replaced the mergerfs binary at |
So this is what I got out of the last crash. Not sure if it's of much use since some things appear to be missing. Also I manually got the stacktrace with This one was most likely caused by samba by the way, so it seems not only NFS causes it.
|
If that trace is accurate then it would explain the segv but I'm not sure how it could come to be in that state. The |
Oh wait... I figured out why name = 0x0. I have an idea now. Let me dig in a bit. |
I will do some memtests both on host and in VM just to make sure anyway. |
I figured out why it segv's in the sense I know what request is triggering it. But I'm not sure why the request is sent and as far as I can tel such a request shouldn't come in or... it would break any libfuse filesystem using the higher level API. To be fair mergerfs has deviated from that but this part is basically the same. The request is "lookup the parent of root"... which doesn't exist. And in the code it's set to NULL. It's the only node in the tree without a parent given it's the top node. If this was a regular thing mergerfs would be crashing left and right for years so something about your setup is atypical. I'm not even sure how to force it. Or even how to respond to the request. I would have thought the operating system would be managing that. |
I've fired off an email to the fuse mailing list asking some questions about this situation. Should be pretty easy to fix once I know what the appropriate response is. Unfortunately, I've yet to figure out how to trigger it. |
Good to hear. I ran memtest both on host and in VM and found no errors. |
I've got not replies yet regarding my questions on the situation. What I could do is make a small patch and you could try it? The problem is I can't manually recreate the situation and I'm not sure what the appropriate reply is. I could return the details for the root node, I could return an error. Both seem wrong... but so would faking a parent to root... there isn't one. I could return ENOENT error first... and if that causes you an issue we could try just returning the root's details? |
Can you try building this branch: https://github.com/trapexit/mergerfs/tree/root-parent ? I'm curious how the OS / client software will respond to a ENOENT here so you might need to keep an eye out on whatever software triggers it. If it causes an issue we can try returning the same details as root. |
Sure, will try it. Haven't had a crash in those 7 days for some reason but will play around with it a little bit. |
Still no response from FUSE folks on this situation. I'm thinking I'll just merge the patch I made into master branch and put out a patch release for now till I get more information on the situation. Might result in some ENOENT errors when it occurs but that's better than crashing. |
I have been running this branch for a while now with no issues. I assume there is nothing I can check to see if the error occured? |
Nothing from mergerfs itself. The request coming into mergerfs was "what is the parent to /" ie.... ".." for "/" which, as far as I know, doesn't make sense because root doesn't have a parent. And that what was causing the segv. It wasn't checking for that and the parent of root was null. Now I check and return ENOENT. Which under normal situations would manifest to the user app as the same error. But it is possible that NFS is swallowing the error. |
FUSE author got back to me. Sounds like not only is this situation not supposed to happen but there are some checks in place to catch it. So something about mergerfs is seriously messed up or there is an interesting kernel bug. The author would like a request log but 1) I stripped logging out of mergerfs in prep to redo it and 2) such logging if not immediately reproducible would be large. So I might just have to try building a similar setup. My main server is Ubuntu 20.04. What packages did you install for NFS? What is your NFS config? The segv always happened when a remote computer with the NFS export mounted started a heavy scan of the filesystem? Any other patterns noticed? |
The server is running Fedora Server inside of a Proxmox (QEMU) virtual machine. My NFS config looks like this:
It also happened randomly but I/O heavy tasks like backups or Jellyfin scans provoked it more often. Might also be related to file locking since it happens much quicker over NFSv4.
I think it might be related to the ballooning RAM feature of QEMU. The crashes stopped happening after disabling it. I'm gonna enable it again to see if it happens again. |
Just a small update, after using the old mergerfs with the ballooning RAM it crashed again 2 days later so it seems like it really is related somehow. Which is strange considering memtest and all other software runs perfectly fine, even with ballooning RAM enabled. |
Just as another datapoint: I had this exact problem, serving the mergerfs mount over NFS. mergerfs was running in a QEMU VM. After I updated to the most recent version, it hasn't recurred for over a week. |
I haven’t had a crash since Feb. Do we still want this open? |
If you all are fine closing this I am too. This does appear to be a kernel bug but I've not had the time to make a simple example for the devs. |
I also don't mind, it's definitely related to KVM and ballooning RAM but it's hard to reproduce. |
I haven't tested yet but I suspect this was fixed in 2.33.4 or 2.33.5. Annoyingly, the latest version that Ubuntu includes is 2.33.3, which appears to have the bug for me. Hopefully, 2.33.5 fixes the issue. |
Debian/Ubuntu and other non-rolling release distros will almost never have uptodate software. And they don't focus on secondary software bug fixes. That's why I create my own packages. |
Describe the bug
Jan 07 08:36:38 beast kernel: mergerfs[345]: segfault at 10 ip 000055dfac03ccb2 sp 00007fa1f77c53e0 error 4 in mergerfs[55dfabff3000+58000]
beast kernel: [149142.660381] mergerfs[345]: segfault at 10 ip 000055dfac03ccb2 sp 00007fa1f77c53e0 error 4 in mergerfs[55dfabff3000+58000] Jan 7 08:36:38 beast kernel: [149142.660396] Code: 80 7b 02 00 0f 85 70 fe ff ff 49 8d 6f 60 31 db 48 89 ef e8 20 7d fb ff 4c 89 ee 4c 89 ff e8 f5 d6 ff ff 48 89 ef 48 8b 40 20 <4c> 8b 68 10 e8 d5 77 fb ff e9 42 fe ff ff 48 89 df e8 c8 77 fb ff
fuse: bad mount point /mnt/storage "Input/output error"
The master branch is not to be considered production ready. Feel free to file bug reports but do so indicating clearly that you are testing unreleased code.
To Reproduce
Im not exactly sure how to reproduce the problem. It seems fairly random. I cant force it to happen. Ill log into a client system and notice the SMB mount isn't working for example. Ill log onto the server and ls -l /mnt/storage (mergerfs volume) and the permissions and other information for the mergerfs mount are peppered with
?
characters.Expected behavior
Mount point operates normally. The mount point is becoming unavailable.
System information:
uname -a
Linux beast 5.11.0-44-generic #48~20.04.2-Ubuntu SMP Tue Dec 14 15:36:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
mergerfs -V
/mnt/disk* /mnt/storage fuse.mergerfs defaults,nonempty,allow_other,noforget,use_ino,inodecalc=path-hash,cache.files=off,moveonenospc=true,dropcacheonclose=true,minfreespace=200G,fsname=mergerfs 0 0
df -h
Filesystem Size Used Avail Use% Mounted on
udev 7.8G 0 7.8G 0% /dev
tmpfs 1.6G 4.7M 1.6G 1% /run
/dev/sda2 49G 24G 23G 52% /
tmpfs 7.9G 0 7.9G 0% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
mergerfs 29T 3.3T 25T 13% /mnt/storage
/dev/loop0 66M 66M 0 100% /snap/gtk-common-themes/1515
/dev/sda1 511M 5.3M 506M 2% /boot/efi
/dev/sdi1 2.7T 168G 2.4T 7% /mnt/disk4
/dev/sdb1 2.7T 174G 2.4T 7% /mnt/disk3
/dev/sdd1 7.3T 1.5T 5.5T 22% /mnt/disk1
/dev/sdc1 7.3T 1.5T 5.4T 22% /mnt/disk2
/dev/sdh1 9.1T 21G 8.6T 1% /mnt/disk5
/dev/sde1 13T 1.5T 11T 13% /mnt/parity1
/dev/loop2 66M 66M 0 100% /snap/gtk-common-themes/1519
/dev/loop3 9.2M 9.2M 0 100% /snap/canonical-livepatch/119
/dev/loop1 128K 128K 0 100% /snap/bare/5
/dev/loop5 9.2M 9.2M 0 100% /snap/canonical-livepatch/126
/dev/loop4 56M 56M 0 100% /snap/core18/2253
/dev/loop6 100M 100M 0 100% /snap/core/11798
/dev/loop7 56M 56M 0 100% /snap/core18/2246
/dev/loop8 219M 219M 0 100% /snap/gnome-3-34-1804/72
/dev/loop9 248M 248M 0 100% /snap/gnome-3-38-2004/87
/dev/loop10 43M 43M 0 100% /snap/snapd/14066
/dev/loop11 44M 44M 0 100% /snap/snapd/14295
/dev/loop12 55M 55M 0 100% /snap/snap-store/558
/dev/loop13 62M 62M 0 100% /snap/core20/1270
/dev/loop14 100M 100M 0 100% /snap/core/11993
/dev/loop15 51M 51M 0 100% /snap/snap-store/547
/dev/loop16 219M 219M 0 100% /snap/gnome-3-34-1804/77
/dev/loop17 62M 62M 0 100% /snap/core20/1242
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 65.1M 1 loop /snap/gtk-common-themes/1515
loop1 7:1 0 4K 1 loop /snap/bare/5
loop2 7:2 0 65.2M 1 loop /snap/gtk-common-themes/1519
loop3 7:3 0 9M 1 loop /snap/canonical-livepatch/119
loop4 7:4 0 55.5M 1 loop /snap/core18/2253
loop5 7:5 0 9M 1 loop /snap/canonical-livepatch/126
loop6 7:6 0 99.5M 1 loop /snap/core/11798
loop7 7:7 0 55.5M 1 loop /snap/core18/2246
loop8 7:8 0 219M 1 loop /snap/gnome-3-34-1804/72
loop9 7:9 0 247.9M 1 loop /snap/gnome-3-38-2004/87
loop10 7:10 0 42.2M 1 loop /snap/snapd/14066
loop11 7:11 0 43.3M 1 loop /snap/snapd/14295
loop12 7:12 0 54.2M 1 loop /snap/snap-store/558
loop13 7:13 0 61.9M 1 loop /snap/core20/1270
loop14 7:14 0 99.4M 1 loop /snap/core/11993
loop15 7:15 0 51M 1 loop /snap/snap-store/547
loop16 7:16 0 219M 1 loop /snap/gnome-3-34-1804/77
loop17 7:17 0 61.9M 1 loop /snap/core20/1242
sda 8:0 0 50G 0 disk
├─sda1 8:1 0 512M 0 part /boot/efi
└─sda2 8:2 0 49.5G 0 part /
sdb 8:16 0 2.7T 0 disk
└─sdb1 8:17 0 2.7T 0 part /mnt/disk3
sdc 8:32 0 7.3T 0 disk
└─sdc1 8:33 0 7.3T 0 part /mnt/disk2
sdd 8:48 0 7.3T 0 disk
└─sdd1 8:49 0 7.3T 0 part /mnt/disk1
sde 8:64 0 12.8T 0 disk
└─sde1 8:65 0 12.8T 0 part /mnt/parity1
sdf 8:80 0 7.3T 0 disk
├─sdf1 8:81 0 7.3T 0 part
└─sdf9 8:89 0 8M 0 part
sdg 8:96 0 7.3T 0 disk
├─sdg1 8:97 0 7.3T 0 part
└─sdg9 8:105 0 8M 0 part
sdh 8:112 0 9.1T 0 disk
└─sdh1 8:113 0 9.1T 0 part /mnt/disk5
sdi 8:128 0 2.7T 0 disk
└─sdi1 8:129 0 2.7T 0 part /mnt/disk4
sr0 11:0 1 1024M 0 rom
strace -fvTtt -s 256 -o /tmp/app.strace.txt <cmd>
strace -fvTtt -s 256 -o /tmp/app.strace.txt -p <appPID>
strace -fvTtt -s 256 -p <mergerfsPID> -o /tmp/mergerfs.strace.txt
Additional context
Add any other context about the problem here.
I rebooted the machine after the issue arose a second time in ~2 weeks this morning. If it happens again, I will work to get the strace data for you. Apologies for not being able to file all of the requested data.
The text was updated successfully, but these errors were encountered: