-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCSFuse process crashing without logging anything #1603
Comments
Thanks @davidkrikava for reaching out and detailed description! I am looking into this, in the mean time I need one clarification. Do you see this issue after every 20hrs |
Hi @davidkrikava, thanks for the patience! GCSFuse doesn't log when killed externally, but it does log stack traces for crashes within the program in most of the cases. GCSFuse logs the reason for crash and stack-trace if it is mounted in foreground mode (use When it is mounted in background mode (by default), it logs the crashes info other than crash in jacobsa/fuse (one of the core dependencies of gcsfuse). In the mean, time I tried reproducing this but couldn't succeeded yet. If you could provide the minimize steps to reproduce this issue, that would be good for us. Also, you can consider option to clean the object using storage_lifecyle_delete and see if this comes with that too. Regards, |
Hi, At this point I remounted with |
Hi,
leaving it in a screen, so if the process crashed, I would get the output in the command line - and then i tried running those cleanup commands (finds with delete)
whole trace i captured: |
@davidkrikava Thank you for sharing the crash logs. That was helpful and we are able to reproduce the issue on our side. We are working on next steps/fix and will keep you updated on the same. |
@davidkrikava The issue mostly happens when listing and creation of new file is done concurrently on the same mount. While we fix this, we suggest you to consider doing the following to avoid this issue:
Thanks for your patience and please let us know if the above works for you. |
Hi @davidkrikava, |
Hello,
By this I understand that from same mount (in my case same machine) I trigger the crash if I create a file and list that same file at the same time (by "same time" i presume while the file is being created/uploaded ?) Although as I am listing through the bucket, there are files being created by other processes and while going through all the files within this cleanup I may list a file being created. Given all that I still do believe that this is a valid issue, that should be corrected, since this can happen to anyone even without running find over the whole bucket :) |
@davidkrikava Thank you for the detailed response !
It doesn't need to be the same file. "Listing in the mount" and "creation of files in the mount" concurrently can cause this issue.
Got it, I think this is the real cause i.e. listing and creation being performed concurrently in same mount. While we make permanent fix for this, we suggest you to avoid this scenario at your end if possible.
This makes sense. We were not aware that listing and create operations are being performed at your end concurrently (as the original issue was only created around
We are working towards the fix and update this issue once fixed. Thanks again ! |
Yes, I thought as well that the delete was the issue, but it seems that the excessive listing by the find combined with regular usage (we have nginx root serving files from there - creating and listing) is the issue. We are moving towards the bucket lifecycle. Yes than we will definitely update once the fix is released, as IMO the listing greatly increased the chances of a crash, but since we do create files there and read them, this can theoretically happen at anytime now. I now consider my participation in this case finished, but do contact me, if you need anything else in the future. |
Thanks @davidkrikava! Lowering the priority of this issue, but keeping this open until we fix the crash. |
Hi @davidkrikava, Thanks for reporting this issue! The fix has been merged into master and will be included in the March 2024 release. Thanks, |
This issue is resolved in GCSFuse v2.0.0. Please upgrade and reopen the issue if necessary. Thanks, |
System:
Google VM instance (E2 machine type)
CentOS Linux release 7.9.2009
GCFuse version: 1.4.0 (issue was on 1.3.0 also)
Bucket in use: regional (mounted from two VMs from different zones within that region)
mount options:
bucketname /mount/path _netdev,noatime,rw,stat_cache_capacity=81920,max_retry_sleep=0,stat_cache_ttl=5m0s,type_cache_ttl=30m0s,client_protocol=http2,max_conns_per_host=200,allow_other,file_mode=0775,dir_mode=0775,uid=XXX,gid=XXX,log_file=/var/log/gcsfuse.log,debug_fuse,debug_gcs,debug_mutex
VM resources:
during crashes were nowhere near exhausting VMs resources
RAM: ~30% used
cpu: ~10% used
Sizing, req rates:
bucket size ~ 5,5T
object count ~ 1,3M
req rate during crashes ~ 100/s
(above info taken from google metrics explorer from the bucket)
Issue description:
We have two vms (01 and 02) that have regional bucket mounted via gcsfuse. In that bucket we have two folders 01 and 02 (and then a subtree of folders/files), machines write into their folders respectively, but read from both.
Both machines also cleanup their respective folders every morning and we think this cleanup is whats causing the crash as the time of the crashes correlates with the high req count.
Cleanup command (yes this is not optimal, we should use bucket lifecycle instead, but we switched to cron/find script long time ago for some reason):
find /mountpoint/01/ -mtime +20 -type f | while read path; do rm "$path"; done
First occurrence: 25. Dec 2023 (version 1.3.0, continue happening with 1.4.0 version also)
Just the find part is enough to cause gcsfuse crash.
What actually happens: gcsfuse process crashes, no longer running, can be fixed by first unmounting and then mounting again
Logs:
As I posted in "mount options" - we are using several debug options, yet when the crash occurs there is nothing logged.
Log file ends with last successful transaction.
Same with syslog, there is nothing logged, only first failed access:
Jan 10 10:56:37 server-01 collectd[1440]: statvfs(/XXX) failed: Transport endpoint is not connected
Nothing can be found in dmesg either.
To sum up:
I am not aware of reaching any quotas (maybe there are some I am not aware ? - tho i would expect to have errors logged instead of process cashing in such case), this process has been running without changes (and previously until that date without issues) for the past 7 months.
What can we do, to try and figure out the root cause of these crashes, since there is nothing logged anywhere?
Thank you for your help
The text was updated successfully, but these errors were encountered: