Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic I/O Errors Lead to Files Getting Sporadically Locked on azureFile Volumes #1593

Closed
GuyPaddock opened this issue May 6, 2020 · 14 comments

Comments

@GuyPaddock
Copy link

What happened:

  • We are using several azureFile volumes in AKS, mounted through manually-managed PV claims.
  • We have the volumes mounted inside Linux pods running both Alpine Linux 3.10.3 and Linux pods running CentOS 7.7.1908.
  • We are seeing strange behavior when extracting archives or copying many files (both between shares as well as within the same share):
    • What we typically see is that, sporadically, files will sometimes fail to copy or fail to be created ("read error", "write error", "I/O error"); afterwards, the offending file cannot be deleted from the share. The only workaround is to use Get-AzStorageFileHandle and Close-AzStorageFileHandle from a PowerShell terminal to locate and remove the file lock that has been created.
    • Other times, files will copy fine but the folder that they are placed in cannot be deleted afterwards (i.e. when we are done working with the files).
  • This is happening even when the files are being created by low-level tools like cp, unzip, mv, and 7z -- as far as I know, these tools do not normally create file locks.
  • This is happening at least twice a day, and there does not seem to be any common factor with what tools or files it is happening with, but it happens across all our Azure Files shares.
  • On one of the shares that this just happened on, we have a 5 TiB quota and are only using 151.7 GiB, leaving 4.9 TiB free -- so, this isn't a quota issue.

What you expected to happen:

  • File copy/move/create operations should not sporadically fail.
  • Regardless of whether a file operation has or has not failed, it should be possible to delete files if the underlying command has not locked them, without having to manually remove the file lock using PowerShell.

How to reproduce it (as minimally and precisely as possible):
Not sure if all of the following steps are required, but here's what we are doing that eventually leads to this issue:

  1. Create an Alpine Linux Docker container that has both SFTP and 7-zip (7z) inside. For these repro steps, SFTP should ensure that there is a user account inside with a uid of 33 and a gid of 33; in our infrastructure, this is needed for compatibility with the Nextcloud file application.
  2. Run the container in a Kubernetes pod that has an azureFile volume mounted via a Persistent Volume, with the specified mountOptions:
    • uid=33
    • gid=33
    • dir_mode=0770
    • file_mode=0770
    • actimeo=2
  3. Copy a large (100+ GiB) zip file containing several hundred 50 to 200 MB files inside to the file share using SFTP.
  4. Use kubectl exec to shell-in to the pod.
  5. Go to where the ZIP file was transferred.
  6. Unzip the ZIP file with 7z x.
  7. If the unzip operation succeeds, make a new directory in the file share (mkdir test) and then use cp -Rv <extracted folder> test (where <extracted folder> represents the name of the folder created by unzipping in step 6) to copy the files recursively to the new folder on the file share.
  8. Use rm -rf on both the folder that was originally extracted in step 6 and the new folder created in step 7.
  9. Repeat steps 6-8 about three times.

Anything else we need to know?:

  • Environment: Alpine Linux 3.10.3 and Linux pods running CentOS 7.7.1908.
  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"windows/amd64"}
    Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.10", GitCommit:"059c666b8d0cce7219d2958e6ecc3198072de9bc", GitTreeState:"clean", BuildDate:"2020-04-03T15:17:29Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
    
  • Size of cluster (how many worker nodes are in the cluster?): 3
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): Video transcoding and file hosting/file sharing.
  • Others:
@GuyPaddock GuyPaddock changed the title Files Keep Getting Locked on azureFile Volumes Files Getting Sporadically Locked on azureFile Volumes May 6, 2020
@GuyPaddock
Copy link
Author

GuyPaddock commented May 6, 2020

Today is pretty bad... just got this at the end of creating a small tgz file:

gzip: close failed: I/O error

This is inside one of the Alpine containers. This is what followed:

/mnt/share/client-redacted $ rm my_file.tgz
/mnt/share/client-redacted $ df -h .
Filesystem                Size      Used Available Use% Mounted on
//ourazurefilessubdomain.file.core.windows.net/client-redacted
                          5.0T    244.1G      4.8T   5% /mnt/share/client-redacted

/mnt/share/client-redacted $ tar -czvf my_file.tgz my_folder/
tar (child): my_file.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
my_folder/
my_folder/media/
my_folder/media/some_file.m4a
tar: my_file.tgz: Cannot write: Broken pipe
tar: Child returned status 2
tar: Error is not recoverable: exiting now

/mnt/share/client-redacted $ touch test
/mnt/share/client-redacted $ rm test

/mnt/share/client-redacted $ tar -czvf my_file.tgz my_folder/
tar (child): my_file.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
my_folder/
my_folder/media/
my_folder/media/some_file.m4a
tar: my_file.tgz: Cannot write: Broken pipe
tar: Child returned status 2
tar: Error is not recoverable: exiting now

/mnt/share/client-redacted $ rm my_file.tgz
rm: can't remove 'my_file.tgz': No such file or directory

/mnt/share/client-redacted $ ls -alh
total 4K
drwxrwx---    2 www-data www-data       0 Mar  5  2019 .
drwxr-xr-x   22 root     root        4.0K May  6 17:52 ..
-rwxrwx---    0 www-data www-data    1.2G May  6 18:10 my_file.tgz
drwxrwx---    2 www-data www-data       0 May  6 18:06 my_folder
/mnt/share/client-redacted $ rm my_file.tgz
rm: can't remove 'my_file.tgz': No such file or directory

Sure enough:

Get-AzStorageFileHandle -Context $Context -ShareName "client-smb" -Recursive

HandleId     Path                            ClientIp ClientPort OpenTime             LastReconnectTime    FileId              ParentId SessionId
--------     ----                            -------- ---------- --------             -----------------    ------              -------- ---------
166104042281 my_file.tgz 10.1.1.4 43330      2020-05-06 18:10:06Z 2020-05-06 18:12:13Z 9223444604622209024 0        9953474151815577669

@GuyPaddock GuyPaddock changed the title Files Getting Sporadically Locked on azureFile Volumes Sporadic I/O Errors Lead to Files Getting Sporadically Locked on azureFile Volumes May 6, 2020
@andyzhangx
Copy link
Contributor

@GuyPaddock is it possible to try ubuntu:16.04 image?

@GuyPaddock
Copy link
Author

@andyzhangx At the moment, the workloads are provided by a vendor rather than ones we built in-house. Is there a particular reason why Ubuntu 16.04 would handle this better? My understanding is that the SMB volumes are mounted on the host node rather than by the container itself.

@jnoller
Copy link
Contributor

jnoller commented May 11, 2020

@GuyPaddock the issue I see based on the errors you pasted are the result of IOPS starvation/saturation - see issue #1373 - no such file or directory occurs from within the container when the node VM itself is victim to the OS disk or VM level IOPS throttle.

@GuyPaddock
Copy link
Author

@jnoller What is our next step if Azure Support is merely focused on "fixing" CoreDNS and is not working with us on the IOPS issue? I'm waiting to hear back from Azure Support after sending them my latest info, but so far they've only focused on adjusting CoreDNS to auto-scale for us.

@jnoller
Copy link
Contributor

jnoller commented May 11, 2020

@GuyPaddock I would send them this issue/thread as well as issue #1373 via the ticket - you should also ask them to check for IOPS throttling on the VMs - this can be triggered by the OS disk, or the VM SKU limit.

@GuyPaddock
Copy link
Author

@jnoller Both were included in the original issue summary...

@andyzhangx
Copy link
Contributor

@GuyPaddock the issue I see based on the errors you pasted are the result of IOPS starvation/saturation - see issue #1373 - no such file or directory occurs from within the container when the node VM itself is victim to the OS disk or VM level IOPS throttle.

I don't think it's related to IOPS, since using ubuntu:16.04 or centos image it works as expected, while on alpine:3.10 it does not. The test is done on the same environment, and it's reproable

sudo mount -t cifs //accountname.file.core.windows.net/test /tmp/test -o vers=3.0,username=accountname,password=xxx,dir_mode=0777,file_mode=0777,cache=strict,actimeo=30
wget -O /tmp/test/test.sh https://raw.githubusercontent.com/andyzhangx/demo/master/debug/test.sh

docker run -it -v /tmp/test:/var/www/html/data/ --name alpine alpine:3.10 sh
docker run -it -v /tmp/test:/var/www/html/data/ --name ubuntu ubuntu:16.04 sh
docker run -it -v /tmp/test:/var/www/html/data/ --name centos centos sh
sh-4.4# cd /var/www/html/data/
sh-4.4# ./test.sh 128
Creating '128' test files...

Trying to delete test files...
DELETED: 129  BEFORE: 128  AFTER: 0

cc @smfrench

@jnoller
Copy link
Contributor

jnoller commented May 12, 2020

@andyzhangx

This is not tied to the dns issue you’re citing or the kernel revision. While you can re create similar failures you can clearly see the containers inability to read its own filesystem and socket.

I can re create this using any Docker image or application regardless of the contents of the image OS.

If the system’s iops issues are unresolved, even after applying the DNS fixes or other options in this thread, the system will still fail under load

@GuyPaddock
Copy link
Author

GuyPaddock commented May 14, 2020

@AndyZhang I believe you are confusing this issue with my other issue --#1325

Issue 1325 is about the fact that under Alpine containers, folders mounted over SMB that have more than 64 files have issues. This issue (1593) is about the fact that a high volume of reads or writes to an SMB volume in AKS causes the volume to fail sporadically with I/O errors that leave files locked, regardless of the type of container (Alpine, CentOS, etc).

@github-actions
Copy link

Action required from @Azure/aks-pm

@ghost
Copy link

ghost commented Jul 26, 2020

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 26, 2020
@palma21 palma21 added azure/files and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Jul 27, 2020
@ghost ghost added the action-required label Aug 21, 2020
@ghost ghost added the stale Stale issue label Oct 20, 2020
@ghost
Copy link

ghost commented Oct 20, 2020

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ghost ghost closed this as completed Nov 4, 2020
@ghost
Copy link

ghost commented Nov 4, 2020

This issue will now be closed because it hasn't had any activity for 15 days after stale. GuyPaddock feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 5, 2020
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants