-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing Larger Folders with rmdir() is Unreliable in Alpine-based FPM environments when Data Folder is Mounted via NFS or SMBv3 #17980
Comments
…ders For some reason, on SMBv3/Azure Files when there are more than 62 files in a folder, `RecursiveDirectoryIterator` cannot reliably iterate over the contents.
I've prepared a patch (linked above) that works for us, but may not be the best way to solve it long term. I would expect any place in the Nextcloud codebase where |
It looks like this is actually caused by a bug in a standard library in Alpine Linux. I am going to update this issue to reflect it. I am digging further... will update when I have more info. |
After a LOT of digging, it seems as though the musl C-library in Alpine (which stands-in for what would be GNU LibC on an Ubuntu system) has a bug in the way it navigates through folders with the The bug actually seems to exactly match a bug from 2012 on other distros, such as this one from CentOS: https://bugs.centos.org/view.php?id=5496 It seems like Since I cannot reproduce this issue with Given that the defect presents itself because the contents of the folder are changing as files are being removed, I believe that another workaround within Nextcloud to this issue would be to load the listing of all files to be removed before removing them, then remove based on the list. That implementation would only fail if new files get added to the folder in the interim. |
I've filed the upstream ticket here: https://gitlab.alpinelinux.org/alpine/aports/issues/10960 |
Working with the musl library team on IRC, I now have a workaround that involves patching and recompiling the musl library that's used in the Nextcloud container. See this commit: |
There's also been some movement on this from the Azure Files kernel team: |
Looks to me like this needs to be fixed upstream. |
@szaimen This was not fixed upstream. The Alpine MUSL C team concluded it's a kernel bug, while the AKS team concluded that it is not a bug when using glibc. So... we are still having to use this workaround. |
When running containers on Azure Kubernetes Service (AKS), the only practical storage option for persistent volumes is Azure Files, which is CIFS/SMB-based. So, Nextcloud has the
data
directory mounted over SMBv3. In this configuration, when using Alpine FPM containers and performing a chunked upload that has more than 62 chunks, the chunked upload will fail with a403
error at the end. The issue does not appear to affect Nextcloud running on Ubuntu-based containers -- just Alpine.Based on my analysis, the root cause of this issue lies somewhere inside the standard libraries for Alpine, though Nextcloud could implement a workaround for the issue until such time as the issue is fixed in Alpine's standard libraries.
Steps to reproduce
Install Nextcloud on Alpine Linux 3.10 or use the Nextcloud 16.0.6 Alpine FPM Docker image.
Mount the Nextcloud data folder on a CIFS/SMB v3 share, such that chunked uploads for users get saved to the file share. For example, given a test user named
TestUser
, the folder/var/www/html/data/TestUser/uploads
must be a folder that lives on an SMBv3 share.We are using the following CIFS mount options:
rw,relatime,vers=3.0,cache=strict,uid=33,forceuid,gid=33,forcegid,file_mode=0770,dir_mode=0770,soft,persistenthandles,nounix,serverino,mapposix,nostrictsync,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=5
.Adjusting these values -- including setting
actimeo
to1
or0
-- does not appear to affect this bug.Create a Nextcloud user account called
TestUser
.Using the web interface for Nextcloud, log-in as
TestUser
.On your client machine, create a 630 MB sample file to upload using the following Linux CLI command:
Attempt to upload the
test-630
file that was created in step 5 by dragging-and-dropping it on the files area of the Nextcloud interface.Wait for the upload to complete; it should get to 100% and then move into the "Processing files..." step.
Wait until either an error appears on the page, or the newly-uploaded file to appear in the file listing area of the page.
Refresh the page.
Shell-in to the VM/container/host running Nextcloud PHP-FPM.
How you do this depends on how you're running Nextcloud. We're using Azure Kubernetes Service, so we do this with
kubectl exec -it nextcloud-POD-SUFFIX sh
.Change into the
/var/www/html/data/TestUser/uploads
folder in the host running Nextcloud PHP FPM.Execute
ls -l
to locate the most-recentweb-file-upload
folder.Change into the folder identified in the previous step.
Execute
ls -l
to see what files are left.Expected behaviour
web-file-upload
folder (it should have been deleted).Actual behaviour
Error when assembling chunks, status code 403
appears.web-file-upload
folder corresponding to the upload.The
403
error message comes from\OCA\DAV\Connector\Sabre\Directory::delete()
on line 314, but the root cause is that not all of the files in the chunked upload source folder got removed before Nextcloud attempted to delete the folder withrmdir
. Stepping through\OC\Files\Storage\Local::rmdir()
, theRecursiveDirectoryIterator
is not getting a full list of the files in the folder, and so that's why it's not deleting all the files there. This behaviorThe root cause is not Nextcloud's fault -- on Alpine Linux, even
rm -rf
cannot remove the folder on a single try if it contains more than 62 files. I do not know why 62 is significant other than that, if you count.
and..
, it would contain 64 files, but it's more nuanced than that.The point of my filing this ticket is to come up with a mitigation/fix for this issue on the Nextcloud side since we can't do much about the Alpine bug. Presumably, the mitigation in Nextcloud is to keep trying to remove the files in the folder until either the folder is removed or Nextcloud notices that the number of files in the folder isn't decreasing.
See "Reproducing the Root Cause without Nextcloud" below to understand why I think this is an appropriate mitigation, and to see how I know the root cause lies outside Nextcloud.
Server configuration
Operating system:
4.15.0-1063-azure #68-Ubuntu SMP Fri Nov 8 09:30:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Web server:
nginx/1.15.10
with PHP FPMDatabase: Azure MySQL 5.7
PHP version: 7.3.11
Nextcloud version: 16.0.6
Updated from an older Nextcloud/ownCloud or fresh install: No.
Where did you install Nextcloud from:
Custom containers, based on https://github.com/GuyPaddock/inveniem-nextcloud-azure/tree/942313031ddf99608073fe09a1d50fe49f5746a0.
Signing status: N/A
List of activated apps: N/A
Nextcloud configuration: N/A
Are you using external storage, if yes which one: No.
Are you using encryption: No.
Are you using an external user-backend, if yes which one: No.
Client configuration
Browser: Google Chrome 78.0.3904.97 (Official Build) (64-bit)
Operating system: Windows 10.0.18362 Build 18362
Logs
Web server error log
N/A
Nextcloud log (data/nextcloud.log)
Browser log
Reproducing the Root Cause without Nextcloud
I created the following script as
test.sh
inside the/var/www/html/data/TestUser/uploads
folder:This script basically creates a bunch of zero-length files in a new folder under the current working directory (presumably, an SMBv3-mounted share), and then tries to remove the folder with "rm -rf". It counts how many files are in the folder before it removes the folder vs. after attempting to remove the folder, as well as how many files that a verbose "rm -rf" indicates it's removing. The script then loops until the folder is actually removed, printing counts along the way.
With a properly-functioning storage driver, kernel, and standard library, the script should be able to remove all files in a single pass. But, with Alpine Linux 3.9 and 3.10, the script has to make several passes as soon as it is creating more than 62 files in the test folder. This demonstrates the issue we are seeing with Nextcloud, but at a lower-level of the system.
What follows are various results of running it. At first it seems like the way Alpine behaves always forces the last few batches of files to be powers of 62 files, but that pattern breaks down as soon as there are more than 250 files (see the results for powers of 3).
My script is not able to reproduce the issue when running an Ubuntu-based PHP container like php:7.3.11-cli even though the container is using the same file share, same mount settings, and same Kubernetes node (so exact same kernel).
Sequentially Increasing Files (1-50)
Creating Files in Powers of 2
Creating Files in Powers of 3
The text was updated successfully, but these errors were encountered: