Sporadic I/O Errors Lead to Files Getting Sporadically Locked on `azureFile` Volumes #1593

GuyPaddock · 2020-05-06T15:41:54Z

What happened:

We are using several azureFile volumes in AKS, mounted through manually-managed PV claims.
We have the volumes mounted inside Linux pods running both Alpine Linux 3.10.3 and Linux pods running CentOS 7.7.1908.
We are seeing strange behavior when extracting archives or copying many files (both between shares as well as within the same share):
- What we typically see is that, sporadically, files will sometimes fail to copy or fail to be created ("read error", "write error", "I/O error"); afterwards, the offending file cannot be deleted from the share. The only workaround is to use Get-AzStorageFileHandle and Close-AzStorageFileHandle from a PowerShell terminal to locate and remove the file lock that has been created.
- Other times, files will copy fine but the folder that they are placed in cannot be deleted afterwards (i.e. when we are done working with the files).
This is happening even when the files are being created by low-level tools like cp, unzip, mv, and 7z -- as far as I know, these tools do not normally create file locks.
This is happening at least twice a day, and there does not seem to be any common factor with what tools or files it is happening with, but it happens across all our Azure Files shares.
On one of the shares that this just happened on, we have a 5 TiB quota and are only using 151.7 GiB, leaving 4.9 TiB free -- so, this isn't a quota issue.

What you expected to happen:

File copy/move/create operations should not sporadically fail.
Regardless of whether a file operation has or has not failed, it should be possible to delete files if the underlying command has not locked them, without having to manually remove the file lock using PowerShell.

How to reproduce it (as minimally and precisely as possible):
Not sure if all of the following steps are required, but here's what we are doing that eventually leads to this issue:

Create an Alpine Linux Docker container that has both SFTP and 7-zip (7z) inside. For these repro steps, SFTP should ensure that there is a user account inside with a uid of 33 and a gid of 33; in our infrastructure, this is needed for compatibility with the Nextcloud file application.
Run the container in a Kubernetes pod that has an azureFile volume mounted via a Persistent Volume, with the specified mountOptions:
- uid=33
- gid=33
- dir_mode=0770
- file_mode=0770
- actimeo=2
Copy a large (100+ GiB) zip file containing several hundred 50 to 200 MB files inside to the file share using SFTP.
Use kubectl exec to shell-in to the pod.
Go to where the ZIP file was transferred.
Unzip the ZIP file with 7z x.
If the unzip operation succeeds, make a new directory in the file share (mkdir test) and then use cp -Rv <extracted folder> test (where <extracted folder> represents the name of the folder created by unzipping in step 6) to copy the files recursively to the new folder on the file share.
Use rm -rf on both the folder that was originally extracted in step 6 and the new folder created in step 7.
Repeat steps 6-8 about three times.

Anything else we need to know?:

Environment: Alpine Linux 3.10.3 and Linux pods running CentOS 7.7.1908.

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.10", GitCommit:"059c666b8d0cce7219d2958e6ecc3198072de9bc", GitTreeState:"clean", BuildDate:"2020-04-03T15:17:29Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?): 3
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): Video transcoding and file hosting/file sharing.
Others:

The text was updated successfully, but these errors were encountered:

GuyPaddock · 2020-05-06T18:13:45Z

Today is pretty bad... just got this at the end of creating a small tgz file:

gzip: close failed: I/O error

This is inside one of the Alpine containers. This is what followed:

/mnt/share/client-redacted $ rm my_file.tgz
/mnt/share/client-redacted $ df -h .
Filesystem                Size      Used Available Use% Mounted on
//ourazurefilessubdomain.file.core.windows.net/client-redacted
                          5.0T    244.1G      4.8T   5% /mnt/share/client-redacted

/mnt/share/client-redacted $ tar -czvf my_file.tgz my_folder/
tar (child): my_file.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
my_folder/
my_folder/media/
my_folder/media/some_file.m4a
tar: my_file.tgz: Cannot write: Broken pipe
tar: Child returned status 2
tar: Error is not recoverable: exiting now

/mnt/share/client-redacted $ touch test
/mnt/share/client-redacted $ rm test

/mnt/share/client-redacted $ tar -czvf my_file.tgz my_folder/
tar (child): my_file.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
my_folder/
my_folder/media/
my_folder/media/some_file.m4a
tar: my_file.tgz: Cannot write: Broken pipe
tar: Child returned status 2
tar: Error is not recoverable: exiting now

/mnt/share/client-redacted $ rm my_file.tgz
rm: can't remove 'my_file.tgz': No such file or directory

/mnt/share/client-redacted $ ls -alh
total 4K
drwxrwx---    2 www-data www-data       0 Mar  5  2019 .
drwxr-xr-x   22 root     root        4.0K May  6 17:52 ..
-rwxrwx---    0 www-data www-data    1.2G May  6 18:10 my_file.tgz
drwxrwx---    2 www-data www-data       0 May  6 18:06 my_folder
/mnt/share/client-redacted $ rm my_file.tgz
rm: can't remove 'my_file.tgz': No such file or directory

Sure enough:

Get-AzStorageFileHandle -Context $Context -ShareName "client-smb" -Recursive

HandleId     Path                            ClientIp ClientPort OpenTime             LastReconnectTime    FileId              ParentId SessionId
--------     ----                            -------- ---------- --------             -----------------    ------              -------- ---------
166104042281 my_file.tgz 10.1.1.4 43330      2020-05-06 18:10:06Z 2020-05-06 18:12:13Z 9223444604622209024 0        9953474151815577669

andyzhangx · 2020-05-08T08:04:54Z

@GuyPaddock is it possible to try ubuntu:16.04 image?

GuyPaddock · 2020-05-11T16:27:38Z

@andyzhangx At the moment, the workloads are provided by a vendor rather than ones we built in-house. Is there a particular reason why Ubuntu 16.04 would handle this better? My understanding is that the SMB volumes are mounted on the host node rather than by the container itself.

jnoller · 2020-05-11T16:57:58Z

@GuyPaddock the issue I see based on the errors you pasted are the result of IOPS starvation/saturation - see issue #1373 - no such file or directory occurs from within the container when the node VM itself is victim to the OS disk or VM level IOPS throttle.

GuyPaddock · 2020-05-11T22:26:11Z

@jnoller What is our next step if Azure Support is merely focused on "fixing" CoreDNS and is not working with us on the IOPS issue? I'm waiting to hear back from Azure Support after sending them my latest info, but so far they've only focused on adjusting CoreDNS to auto-scale for us.

jnoller · 2020-05-11T22:35:48Z

@GuyPaddock I would send them this issue/thread as well as issue #1373 via the ticket - you should also ask them to check for IOPS throttling on the VMs - this can be triggered by the OS disk, or the VM SKU limit.

GuyPaddock · 2020-05-11T23:29:18Z

@jnoller Both were included in the original issue summary...

andyzhangx · 2020-05-12T03:24:41Z

@GuyPaddock the issue I see based on the errors you pasted are the result of IOPS starvation/saturation - see issue #1373 - no such file or directory occurs from within the container when the node VM itself is victim to the OS disk or VM level IOPS throttle.

I don't think it's related to IOPS, since using ubuntu:16.04 or centos image it works as expected, while on alpine:3.10 it does not. The test is done on the same environment, and it's reproable

sudo mount -t cifs //accountname.file.core.windows.net/test /tmp/test -o vers=3.0,username=accountname,password=xxx,dir_mode=0777,file_mode=0777,cache=strict,actimeo=30
wget -O /tmp/test/test.sh https://raw.githubusercontent.com/andyzhangx/demo/master/debug/test.sh

docker run -it -v /tmp/test:/var/www/html/data/ --name alpine alpine:3.10 sh
docker run -it -v /tmp/test:/var/www/html/data/ --name ubuntu ubuntu:16.04 sh
docker run -it -v /tmp/test:/var/www/html/data/ --name centos centos sh
sh-4.4# cd /var/www/html/data/
sh-4.4# ./test.sh 128
Creating '128' test files...

Trying to delete test files...
DELETED: 129  BEFORE: 128  AFTER: 0

cc @smfrench

jnoller · 2020-05-12T14:12:15Z

@andyzhangx

This is not tied to the dns issue you’re citing or the kernel revision. While you can re create similar failures you can clearly see the containers inability to read its own filesystem and socket.

I can re create this using any Docker image or application regardless of the contents of the image OS.

If the system’s iops issues are unresolved, even after applying the DNS fixes or other options in this thread, the system will still fail under load

GuyPaddock · 2020-05-14T15:53:11Z

@AndyZhang I believe you are confusing this issue with my other issue --#1325

Issue 1325 is about the fact that under Alpine containers, folders mounted over SMB that have more than 64 files have issues. This issue (1593) is about the fact that a high volume of reads or writes to an SMB volume in AKS causes the volume to fail sporadically with I/O errors that leave files locked, regardless of the type of container (Alpine, CentOS, etc).

github-actions · 2020-07-21T01:32:36Z

Action required from @Azure/aks-pm

ghost · 2020-07-26T16:02:24Z

Action required from @Azure/aks-pm

ghost · 2020-10-20T14:01:22Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ghost · 2020-11-04T19:01:22Z

This issue will now be closed because it hasn't had any activity for 15 days after stale. GuyPaddock feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

triage-new-issues bot added the triage label May 6, 2020

GuyPaddock changed the title ~~Files Keep Getting Locked on azureFile Volumes~~ Files Getting Sporadically Locked on azureFile Volumes May 6, 2020

GuyPaddock changed the title ~~Files Getting Sporadically Locked on azureFile Volumes~~ Sporadic I/O Errors Lead to Files Getting Sporadically Locked on azureFile Volumes May 6, 2020

GuyPaddock mentioned this issue May 7, 2020

High Latency from AKS to Azure Storage #1587

Closed

github-actions bot added the action-required label Jul 21, 2020

triage-new-issues bot removed the triage label Jul 21, 2020

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 26, 2020

palma21 assigned juan-lee and alexeldeib Jul 27, 2020

palma21 added azure/files and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Jul 27, 2020

ghost added the action-required label Aug 21, 2020

ghost added the stale Stale issue label Oct 20, 2020

ghost closed this as completed Nov 4, 2020

ghost locked as resolved and limited conversation to collaborators Dec 5, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic I/O Errors Lead to Files Getting Sporadically Locked on `azureFile` Volumes #1593

Sporadic I/O Errors Lead to Files Getting Sporadically Locked on `azureFile` Volumes #1593

GuyPaddock commented May 6, 2020

GuyPaddock commented May 6, 2020 •

edited

Loading

andyzhangx commented May 8, 2020

GuyPaddock commented May 11, 2020

jnoller commented May 11, 2020 •

edited

Loading

GuyPaddock commented May 11, 2020

jnoller commented May 11, 2020

GuyPaddock commented May 11, 2020

andyzhangx commented May 12, 2020

jnoller commented May 12, 2020 •

edited

Loading

GuyPaddock commented May 14, 2020 •

edited

Loading

github-actions bot commented Jul 21, 2020

ghost commented Jul 26, 2020

ghost commented Oct 20, 2020

ghost commented Nov 4, 2020

Sporadic I/O Errors Lead to Files Getting Sporadically Locked on azureFile Volumes #1593

Sporadic I/O Errors Lead to Files Getting Sporadically Locked on azureFile Volumes #1593

Comments

GuyPaddock commented May 6, 2020

GuyPaddock commented May 6, 2020 • edited Loading

andyzhangx commented May 8, 2020

GuyPaddock commented May 11, 2020

jnoller commented May 11, 2020 • edited Loading

GuyPaddock commented May 11, 2020

jnoller commented May 11, 2020

GuyPaddock commented May 11, 2020

andyzhangx commented May 12, 2020

jnoller commented May 12, 2020 • edited Loading

GuyPaddock commented May 14, 2020 • edited Loading

github-actions bot commented Jul 21, 2020

ghost commented Jul 26, 2020

ghost commented Oct 20, 2020

ghost commented Nov 4, 2020

Sporadic I/O Errors Lead to Files Getting Sporadically Locked on `azureFile` Volumes #1593

Sporadic I/O Errors Lead to Files Getting Sporadically Locked on `azureFile` Volumes #1593

GuyPaddock commented May 6, 2020 •

edited

Loading

jnoller commented May 11, 2020 •

edited

Loading

jnoller commented May 12, 2020 •

edited

Loading

GuyPaddock commented May 14, 2020 •

edited

Loading