Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker's large default NFS rsize/wsize hanging container when accessing >~200kb files in Ventura #6544

Open
3 tasks done
karlshea opened this issue Oct 28, 2022 · 24 comments
Open
3 tasks done

Comments

@karlshea
Copy link

karlshea commented Oct 28, 2022

  • I have tried with the latest version of Docker Desktop
  • I have tried disabling enabled experimental features
  • I have uploaded Diagnostics
  • Diagnostics ID: 45967571-6F52-4272-8359-CBA300C14077/20221028162118

Expected behavior

Everything works normally.

Actual behavior

Container will hang. If caught within a second or two, ^C can quit the container if it's running in the foreground (docker compose up). Otherwise Docker itself can hang to the point where the process needs to be killed. nfsd send error 40 also appears in the MacOS Console.

Information

  • macOS Version: Ventura 13.0
  • Intel chip or Apple chip: Apple M2
  • Docker Desktop Version: v4.13.0

Output of /Applications/Docker.app/Contents/MacOS/com.docker.diagnose check

Starting diagnostics

[PASS] DD0027: is there available disk space on the host?
[PASS] DD0028: is there available VM disk space?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0001: is the application running?
[PASS] DD0017: can a VM be started?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0011: are the LinuxKit services running?
[PASS] DD0004: is the Docker engine running?
[PASS] DD0015: are the binary symlinks installed?
[PASS] DD0031: does the Docker API work?
[PASS] DD0013: is the $PATH ok?
[PASS] DD0003: is the Docker CLI working?
[PASS] DD0014: are the backend processes running?
[PASS] DD0007: is the backend responding?
[PASS] DD0008: is the native API responding?
[PASS] DD0009: is the vpnkit API responding?
[PASS] DD0010: is the Docker API proxy responding?
[PASS] DD0012: is the VM networking working?
[SKIP] DD0030: is the image access management authorized?
[PASS] DD0019: is the com.docker.vmnetd process responding?
[PASS] DD0033: does the host have Internet access?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0001: is the application running?
[PASS] DD0017: can a VM be started?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0011: are the LinuxKit services running?
[PASS] DD0004: is the Docker engine running?
[PASS] DD0015: are the binary symlinks installed?
[PASS] DD0031: does the Docker API work?
[PASS] DD0032: do Docker networks overlap with host IPs?
segment 2022/10/28 11:24:45 ERROR: sending request - Post "https://api.segment.io/v1/batch": dial tcp: lookup api.segment.io: no such host
segment 2022/10/28 11:24:45 ERROR: 1 messages dropped because they failed to be sent and the client was closed
No fatal errors detected.

(Pi-hole is blocking api.segment.io)

Steps to reproduce the behavior

  1. Clone https://github.com/karlshea/docker-nfs
  2. Fix your path in docker-compose.yaml
  3. Follow the tests in that repo's README.md
  4. Another users reports a hang when iterating over a directory with a large number of files, or using cat/md5sum: macOS Ventura + nfs_mount_enabled: 504 Error on laravel and bedrock sites ddev/ddev#4122 (comment)_

Plain NFS access from another Mac seems to work normally.

Related issue: ddev/ddev#4122

@tsrivishnu
Copy link

Facing the same issue with a Ruby on Rails project. Moving away from NFS mount removes the problem but we would need NFS for the project to work efficiently.

@noud-github
Copy link

noud-github commented Oct 31, 2022

FYI:
adding ,wsize=32768,rsize=3276 to our docker NFS mount options seems to fix this issue.

nfsmount_xdebug:
    driver: local
    driver_opts:
      type: nfs
      o: addr=host.docker.internal,rw,nolock,hard,nointr,nfsvers=3,wsize=32768,rsize=3276
      device: ":${PWD}/xdebug"

edit:
unexpected but the protocol maximum also seem to work wsize=65536,rsize=65536

@ryanchapman
Copy link

Following @noud-github's suggestion, adding wsize and rsize options fixes the issue for me as well.

Although I went with ,wsize=32768,rsize=32768 (versus 3276 for rsize).

@karlshea
Copy link
Author

I think wsize/rsize might just be masking the problem. Trying an md5sum on a 100MB zip file through Docker still hangs, while it succeeds on a normal NFS mount on another Mac.

@noud-github
Copy link

@karlshea using wsize=65536,rsize=65536 i can do a md5sum 100MB.zip on that NFS share in docker,
did you remove the "NFS Volume" seen with docker volume list after editing the options?
if you don't (or forgot like I did the first time) the new setting are not applied.

@karlshea
Copy link
Author

karlshea commented Nov 1, 2022

@noud-github You're right, I didn't! wsize=32768,rsize=32768 does indeed fix it for me. It looks like 32768 is the default for recent distros, so I'm curious what the Docker driver is using instead.

I still believe this is covering up a deeper issue (why are smaller values breaking?), but at least it's fixing the immediate problem.

@noud-github
Copy link

@karlshea I think you are right in assuming this is just covering up a deeper issue.
the fact that this is not a issue on two system pointed me in the direction to try this solution in the first place, Because when you connect to another system, you use two "real" networks stacks, inc buffers sizes etc. but if you run this on you local system you use "local interface" and this is not the first time i have had "unexpected" behavior when only using "local interface", So my guess would be that ventura has some "bug" or feature in the Local interface stack, that is triggered when not setting the wsize and rsize in NFS

@karlshea
Copy link
Author

karlshea commented Nov 1, 2022

macOS defaults

macOS defaults (from man mount_nfs) are 8192 for UDP mounts and 32768 for TCP mounts.

Additional notes from wsize param: "Note that both the rsize and wsize options should only be used as a last ditch effort at improving performance when mounting servers that do not support TCP mounts."

nfsstat -m for a Mac-to-Mac NFS mount using all default options (mount -t nfs server-mac:/server-path directory):

General mount flags: 0x4000018 nodev,nosuid,multilabel
NFS parameters: vers=3,tcp,port=2049,nomntudp,hard,nointr,noresvport,negnamecache,callumnt,locks,quota,rsize=32768,wsize=32768,readahead=16,dsize=32768,rdirplus,nodumbtimer,timeo=10,maxgroups=16,acregmin=5,acregmax=60,acdirmin=5,acdirmax=60,nomutejukebox,nonfc,sec=sys

Docker defaults

I tried to find defaults by mounting with no options other than addr:

volumes:
  nfsmount-repo:
    driver: local
    driver_opts:
      type: nfs
      o: "addr=host.docker.internal"
      device: ":/Users/karl/Sites/nfs-test"

Then got into the Docker VM using justincormack/nsenter1 and ran mount:

/Users/karl/Sites/nfs-test on /var/lib/docker/volumes/nfs-test_nfsmount-repo/_data type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.65.2,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.65.2

Which is pretty strange, since our fixes actually seem to be setting the values lower. man mount_nfs says

The default read and write sizes are 8K when using UDP, and 32K when using TCP. Values over 16K are only supported for TCP, where 2M is the maximum.

Any value over 32K is unlikely to get you more performance, unless you have a very fast network.

If the network interface cannot handle larger packet sizes or a long train of back to back packets, you may see low performance figures or even temporary hangups during NFS activity.

This seems to possibly point to the root cause.

@noud-github
Copy link

looks like docker is trying to use a pretty high default package size, that in combination with:

If the network interface cannot handle larger packet sizes or a long train of back to back packets, you may see low performance figures or even temporary hangups during NFS activity.

that makes a solid case for using a lower package size, as we found ventura's "local interface" doing just that.
makes you wonder what they chanced there ;-)

@karlshea
Copy link
Author

karlshea commented Nov 1, 2022

Those sizes are supposed to be powers of 2 according to the man page. I tested up to 262144 (md5sum hangs), the biggest that worked was 131072.

@karlshea karlshea changed the title Accessing >~200kb files from NFS volumes in Ventura can hang container Docker's large default NFS rsize/wsize hanging container when accessing >~200kb files in Ventura Nov 1, 2022
@rfay
Copy link

rfay commented Nov 1, 2022

Oh I forgot - I had tested this problem with Colima in ddev/ddev#4122 (comment) - So this is not strictly a Docker Desktop issue I don't think.

@noud-github
Copy link

noud-github commented Nov 2, 2022

So this is not strictly a Docker Desktop issue I don't think.

i am using ranger desktop so I can concur on that.

@shinde-rahul
Copy link

The wsize=8192,rsize=8192 fix my issue too.

@Carpenter0100
Copy link

Carpenter0100 commented Nov 8, 2022

@noud-github

Fuck this! I spent the whole day trying to find a solution to this problem.
Thank you very much.

Could you explain how you came up with the solution and how you investigated the problem? Thank you.

@karlshea
Copy link
Author

karlshea commented Nov 8, 2022

Could you explain how you came up with the solution and how you investigated the problem? Thank you.

All of the investigation is in this issue and ddev/ddev#4122. I believe all of us looking into it thought we were raising the Docker defaults to fix the problem, but it turns out we were lowering them.

@notinaboat
Copy link

notinaboat commented Dec 6, 2022

FWIW: I'm seeing nfsd send error 40 and md5sum bigfile hanging with Ventura NFS server and Raspberry Pi clients over WiFi (no Docker involved). Was reliable before Ventura. So this looks like a macOS problem.

wsize=65536,rsize=65536 works for me.

@Krilo89
Copy link

Krilo89 commented Dec 13, 2022

Exactly what @Carpenter0100 says ;).
How did you come up with those params @noud-github ?

@freef4ll
Copy link

The recently released Ventura 13.1 now has issues with wsize=32768,rsize=32768 and fail to use them:

nfssvc_addsock: socket buffer setting error(s) 22

Error code 22 is EINVAL.

Bumping the the NFS socket buffer to 65536 solves the issue for me.

@herveguetin
Copy link

Same issue with a Magento 2 project using huge amount of Composer dependencies.
Using wsize=65536,rsize=65536 fixed the issue.
Once YAML file is updated, do not forget to:

  1. docker compose down
  2. docker volume rm [NFSMOUNT_VOLUME_NAME]
  3. docker-compose up -d

@noud-github
Copy link

noud-github commented Jan 6, 2023

@Krilo89

actually, it was down to experience in DevOps for more than 2 decades ,
cannot find the git issue, but there was a vertura/docker/nfs issue where someone mentioned that running NFS on one laptop and docker on the other did not have the issue. That reminded me of a 2 decade old issue on windows where the mtu size was not respected/applied by the local interface (lo) breaking iSCSI on a local system
so I actually only googled on NFS and packet size to find the solution

[edit]the working cross mac came from this tread:
ddev/ddev#4122 (comment)

@noud-github
Copy link

@Carpenter0100 see above

@docker-robott
Copy link
Collaborator

There hasn't been any activity on this issue for a long time.
If the problem is still relevant, mark the issue as fresh with a /remove-lifecycle stale comment.
If not, this issue will be closed in 30 days.

Prevent issues from auto-closing with a /lifecycle frozen comment.

/lifecycle stale

@karlshea
Copy link
Author

karlshea commented Apr 6, 2023

/remove-lifecycle stale
/lifecycle frozen

There are workarounds, but with out of the box defaults it's broken. If anyone from the Docker org bothered replying to shed any light on this situation maybe it could move more towards "fixed".

@colradec
Copy link

Has anybody encountered a drastic decrease in performance (both r/w) of nfs mounts in 13.4? Installing our CI takes over ~40 minutes instead of ~5 (both M1 and M1 Pro)

I was debugging if the docker version caused this, however with another Mac with 13.3.1 no performance issues were noticeable.

To all of you, do NOT update to 13.4! The VirtIO FS does not have any performance issues but this problem #6820 seems to be present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests