Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: short writes with FileServer on macos #70000

Closed
a-palchikov opened this issue Oct 23, 2024 · 23 comments
Closed

net/http: short writes with FileServer on macos #70000

a-palchikov opened this issue Oct 23, 2024 · 23 comments
Assignees
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin
Milestone

Comments

@a-palchikov
Copy link

Go version

go version go1.23.2 darwin/arm64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/user/Library/Caches/go-build'
GOENV='/user/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/user/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/user/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/user/dev/go/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/user/dev/go/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.2'
GODEBUG='httpmuxgo121=0'
GOTELEMETRY='local'
GOTELEMETRYDIR='/user/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK='/user/dev/path/to/go.work'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/_y/<snip>/T/go-build<id>=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

I have a small webserver that serves static files in a testing setup.

What did you see happen?

I'm on go 1.23.2 on macos 14.7 and I'm having issues with the stdlib http.FileServer.
After much of poking around, I found that with the BSD sendfile implementation in internal/poll/sendfile_bsd.go having
the maxSendfileSize set to 4MiB is causing all large file transfers to fail. For example, the following is the curl failure:

...
{ [512 bytes data]
* transfer closed with 361300436 bytes remaining to read
  1  350M    1 5567k    0     0  80.7M      0  0:00:04 --:--:--  0:00:04 81.1M
* Closing connection
curl: (18) transfer closed with 361300436 bytes remaining to read

so all requests are failing with short writes.

If I update the maxSendfileSize to maximum (adopting the change from this commit) I don't observe any failures.

Just setting the maxSendfileSize feels like a hack though (maybe I'm wrong) so I'd appreciate any hints on this behavior.

What did you expect to see?

The sendfile API should be able to send files in their entirety.

@ianlancetaylor
Copy link
Member

Can you show us code that we can use the replicate the problem? Thanks.

CC @golang/darwin

@ianlancetaylor ianlancetaylor added OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 23, 2024
@ianlancetaylor
Copy link
Member

CC @panjf2000

@a-palchikov
Copy link
Author

a-palchikov commented Oct 23, 2024

There's nothing special to it, really:

m := http.NewServeMux()
m.Handle("/large-payload/", http.StripPrefix("/large-payload/", http.FileServer(http.Dir("path/to/large-payload"))))

it looks at a directory with a bunch of numeric files all 350MiB in size:

$ tree path/to/large-payload | head -n3
path/to/large-payload
├── 1
├── 10
...

and I simply hit the server with:

curl localhost:44444/large-payload/1 -o ./test

In an OK case, this results in:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  350M  100  350M    0     0  1199M      0 --:--:-- --:--:-- --:--:-- 1202M

In the error case (go 1.23.2):

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  2  350M    2 8031k    0     0   361M      0 --:--:-- --:--:-- --:--:--  373M
curl: (18) transfer closed with 358777300 bytes remaining to read

@panjf2000
Copy link
Member

Can we get the error returned by your server-side? This 970b1c0 will ship in Go1.24.

@a-palchikov
Copy link
Author

a-palchikov commented Oct 23, 2024

@panjf2000 I'll gladly comply if you can narrow the context. On high-level it writes short here:

io.CopyN(w, sendContent, sendSize)

From what I observed, there's no error, e.g.:

written, werr, handled = poll.SendFile(&c.pfd, int(fd), pos, remain)

returns with

written=4194304, werr=<nil>, handled=true

the number of bytes written fluctuates, e.g.:

written=12524536, werr=<nil>, handled=true

the main point is that it's a short write.

Please let me know if you need more details.

@jiahongli18
Copy link

Can we get the error returned by your server-side? This 970b1c0 will ship in Go1.24.

As @a-palchikov mentioned, we don't see any errors from the server itself. The observation from the client is a partially fetched file.

@prattmic
Copy link
Member

cc @neild

@a-palchikov
Copy link
Author

a-palchikov commented Oct 23, 2024

This seems to do the trick:

diff --git a/src/internal/poll/sendfile_bsd.go b/src/internal/poll/sendfile_bsd.go
index 669df94cc1..35338730aa 100644
--- a/src/internal/poll/sendfile_bsd.go
+++ b/src/internal/poll/sendfile_bsd.go
@@ -47,7 +47,7 @@ func SendFile(dstFD *FD, src int, pos, remain int64) (written int64, err error,
                // don't implement sendfile), and other errors.
                // We should end the loop when there is no error
                // returned from sendfile(2) or it is not a retryable error.
-               if err != syscall.EAGAIN {
+               if err != nil && err != syscall.EAGAIN {
                        break
                }
                if err = dstFD.pd.waitWrite(dstFD.isFile); err != nil {

The sequence of events is like following:

n= 1408044 , err= resource temporarily unavailable
n= 434176 , err= resource temporarily unavailable
n= 4194304

given the series of EAGAIN followed by no error, when it reaches the limit (4MiB), it bails out.

@ianlancetaylor
Copy link
Member

@a-palchikov Thanks. Looks like we need something similar in the other sendfile implementations as well.

@neild neild self-assigned this Oct 23, 2024
@jonjohnsonjr
Copy link
Contributor

Just wanted to chime in that this exhibits worse behavior on the client side when uploading large files. I have a client that will hang indefinitely on macOS pretty regularly, which I believe is due to this issue, as it uses an os.File as the Body of a request.

This was a bit of a heisenbug for me because when I tried wrapping the os.File in another struct to observe what was going on, the problem disappeared. That was enough of a clue to lead me here (and also serve as a workaround until it's fixed).

Thanks for uncovering this!

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622235 mentions this issue: internal/poll: keep copying after successful Sendfile return on BSD

@ianlancetaylor
Copy link
Member

@gopherbot Please open backport issues.

This causes inexplicable and difficult to avoid failures on BSD systems. We should backport the fix.

@gopherbot
Copy link
Contributor

gopherbot commented Oct 24, 2024

Backport issue(s) opened: #70019 (for 1.22), #70020 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

gopherbot pushed a commit that referenced this issue Oct 24, 2024
The BSD implementation of poll.SendFile incorrectly halted
copying after succesfully writing one full chunk of data.
Adjust the copy loop to match the Linux and Solaris
implementations.

In testing, empirically macOS appears to sometimes return
EAGAIN from sendfile after successfully copying a full
chunk. Add a check to all implementations to return nil
after successfully copying all data if the last sendfile
call returns EAGAIN.

For #70000

Change-Id: I57ba649491fc078c7330310b23e1cfd85135c8ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/622235
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@panjf2000
Copy link
Member

This seems to do the trick:

diff --git a/src/internal/poll/sendfile_bsd.go b/src/internal/poll/sendfile_bsd.go
index 669df94cc1..35338730aa 100644
--- a/src/internal/poll/sendfile_bsd.go
+++ b/src/internal/poll/sendfile_bsd.go
@@ -47,7 +47,7 @@ func SendFile(dstFD *FD, src int, pos, remain int64) (written int64, err error,
                // don't implement sendfile), and other errors.
                // We should end the loop when there is no error
                // returned from sendfile(2) or it is not a retryable error.
-               if err != syscall.EAGAIN {
+               if err != nil && err != syscall.EAGAIN {
                        break
                }
                if err = dstFD.pd.waitWrite(dstFD.isFile); err != nil {

The sequence of events is like following:

n= 1408044 , err= resource temporarily unavailable
n= 434176 , err= resource temporarily unavailable
n= 4194304

given the series of EAGAIN followed by no error, when it reaches the limit (4MiB), it bails out.

Thanks for the details! I think I've found the root cause and you would get rid of this issue after adding the below changes to your source code of go1.23.2:

diff --git a/src/internal/poll/sendfile_bsd.go b/src/internal/poll/sendfile_bsd.go
index f2d13e1069..1f9c42253a 100644
--- a/src/internal/poll/sendfile_bsd.go
+++ b/src/internal/poll/sendfile_bsd.go
@@ -35,12 +35,16 @@ func SendFile(dstFD *FD, src int, pos, remain int64) (written int64, err error,
 		if int64(n) > remain {
 			n = int(remain)
 		}
+		m := n
 		pos1 := pos
 		n, err = syscall.Sendfile(dst, src, &pos1, n)
 		if n > 0 {
 			pos += int64(n)
 			written += int64(n)
 			remain -= int64(n)
+			if n == m {
+				continue
+			}
 		}
 		if err == syscall.EINTR {
 			continue

Please try it out and let me know if it works, thanks! @a-palchikov

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622255 mentions this issue: internal/poll: handle the special case of sendfile(2) sending the full chunk

@a-palchikov
Copy link
Author

@panjf2000 Yes, it works as well and is short-circuiting the wait write (which I assume is only necessary when handling the EAGAIN). Thanks!

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622696 mentions this issue: [release-branch.go1.23] internal/poll: keep copying after successful Sendfile return on BSD

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/622697 mentions this issue: [release-branch.go1.23] internal/poll: handle the special case of sendfile(2) sending the full chunk

@dmitshur dmitshur added this to the Go1.24 milestone Oct 26, 2024
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 26, 2024
gopherbot pushed a commit that referenced this issue Oct 30, 2024
…Sendfile return on BSD

The BSD implementation of poll.SendFile incorrectly halted
copying after succesfully writing one full chunk of data.
Adjust the copy loop to match the Linux and Solaris
implementations.

In testing, empirically macOS appears to sometimes return
EAGAIN from sendfile after successfully copying a full
chunk. Add a check to all implementations to return nil
after successfully copying all data if the last sendfile
call returns EAGAIN.

For #70000
For #70020

Change-Id: I57ba649491fc078c7330310b23e1cfd85135c8ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/622235
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit bd388c0)
Reviewed-on: https://go-review.googlesource.com/c/go/+/622696
gopherbot pushed a commit that referenced this issue Oct 30, 2024
…dfile(2) sending the full chunk

CL 622235 would fix #70000 while resulting in one extra sendfile(2) system
call when sendfile(2) returns (>0, EAGAIN).
That's also why I left sendfile_bsd.go behind, and didn't make it line up
with other two implementations: sendfile_linux.go and sendfile_solaris.go.

Unlike sendfile(2)'s on Linux and Solaris that always return (0, EAGAIN),
sendfile(2)'s on *BSD and macOS may return (>0, EAGAIN) when using a socket
marked for non-blocking I/O. In that case, the current code will try to re-call
sendfile(2) immediately, which will most likely get us a (0, EAGAIN).
After that, it goes to `dstFD.pd.waitWrite(dstFD.isFile)` below,
which should have been done in the first place.

Thus, the real problem that leads to #70000 is that the old code doesn't handle
the special case of sendfile(2) sending the exact number of bytes the caller requested.

Fixes #70000
Fixes #70020

Change-Id: I6073d6b9feb58b3d7e114ec21e4e80d9727bca66
Reviewed-on: https://go-review.googlesource.com/c/go/+/622255
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Run-TryBot: Andy Pan <panjf2000@gmail.com>
Reviewed-on: https://go-review.googlesource.com/c/go/+/622697
@ldemailly
Copy link

is this a regression or has been bugged for a long time? ie what is the range of go versions with that issue?

@a-palchikov
Copy link
Author

I believe it was introduced in this commit so affected only the 1.23 release.

wyf9661 pushed a commit to wyf9661/go that referenced this issue Dec 16, 2024
…Sendfile return on BSD

The BSD implementation of poll.SendFile incorrectly halted
copying after succesfully writing one full chunk of data.
Adjust the copy loop to match the Linux and Solaris
implementations.

In testing, empirically macOS appears to sometimes return
EAGAIN from sendfile after successfully copying a full
chunk. Add a check to all implementations to return nil
after successfully copying all data if the last sendfile
call returns EAGAIN.

For golang#70000
For golang#70020

Change-Id: I57ba649491fc078c7330310b23e1cfd85135c8ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/622235
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit bd388c0)
Reviewed-on: https://go-review.googlesource.com/c/go/+/622696
wyf9661 pushed a commit to wyf9661/go that referenced this issue Dec 16, 2024
…Sendfile return on BSD

The BSD implementation of poll.SendFile incorrectly halted
copying after succesfully writing one full chunk of data.
Adjust the copy loop to match the Linux and Solaris
implementations.

In testing, empirically macOS appears to sometimes return
EAGAIN from sendfile after successfully copying a full
chunk. Add a check to all implementations to return nil
after successfully copying all data if the last sendfile
call returns EAGAIN.

For golang#70000
For golang#70020

Change-Id: I57ba649491fc078c7330310b23e1cfd85135c8ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/622235
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit bd388c0)
Reviewed-on: https://go-review.googlesource.com/c/go/+/622696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin
Projects
None yet
Development

No branches or pull requests

12 participants