-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC connection closing with unexpected EOF on high-latency connections #5358
Comments
Setting this flag to false will disable BDP and dynamic windows for the grpc connection that the client uses. This is a workaround for grpc/grpc-go#5358 which affects clients with high-latency. Change-Id: Id43097bd32df6801fbbc0011886589a06d883903
Hello, thank you for bringing up this issue/potential bug. In the current state, the team is unable to reproduce this. This requires pinging a remote peer running Centos 7, configured for a high ping. Could you please try reproducing this on the same machine, or even better, the same binary. grpc-go provides a latency package (https://pkg.go.dev/google.golang.org/grpc/benchmark/latency) that provides a way to inject real world latency into network connections. Reproducing in this way would a. Let the team reproduce it ourselves, and b. Get rid of all the other variables that may have caused this bug, and narrow the system under question down to our grpc code and the tcp stack it runs on top of. |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Hi, I wrote a test using the latency package as you described, but unfortunately as of yet I can't seem to reproduce the bug with this test. I also recorded some debug logs showing the case where the bug hits in contrast to the case where it does not hit. Something that may be relevant here is that the remote machine is very powerful relative to the local machine. |
My worry is that there's something else in the system impacting it like a firewall or proxy. It might be helpful to inspect the traffic with something like wireshark. |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Hi, The Centos remote machine is behind a VPN, although it's worth noting that this issue can be hit even when the client is on the same VPN (it's just a lot less frequent, presumably because the network connection is better) Other points of note regarding the architecture(s) required to reproduce this issue: |
I believe we're also seeing this issue, but there is not high-latency as both the client and server are running in the same process. It's rare (1 out of every 3000-5000 attempts) but we've seen it on both macOS and Linux. Setting a large window size does seem to resolve the issue. In our case, it's just a Go test that spins up a server and client and makes a simple (non-streaming) RPC. |
A number of our GRPC tests have been failing with an unexpected EOF error. It appears to be related to grpc/grpc-go#5358. Apply the suggested window size workaround in tests that have been known to fail with this error.
@zmb3 I wrote a small unit test that minimally reproduces the I haven't yet looked into the |
Thank you @brianneville! Appreciate you putting together a smaller reproduction - it was on my list of things to do. |
Thanks for providing a smaller test case that reproduces the error seen in TestListResources. I was able to reproduce this error locally. Looking into it further now, interesting that it's also fixed with the window-size workaround. However, the orthogonal issue this issue was originally filed about is still too much for the team to reproduce (with the requirements of the machine the server is running on). |
That workaround (delaying ending the stream) works for both streaming and unary RPCs.
The EOF appears when there is just one streaming RPC running, i.e. one client streaming to one server, without any other clients streaming at the same time. If there are concurrency issues between writing and closing, then they're happening in some layer below the service (gRPC layer or lower).
https://github.com/brianneville/grpcbug doesn't use keepalive settings, but for the original program where this issue was observed (which the grpcbug repo is minimally reproducing) yes, the following keepalive settings being used: grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{MinTime: time.Second}),
grpc.KeepaliveParams(keepalive.ServerParameters{Time: 10 * time.Minute}) |
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for grpc/grpc-go#5358 which has been causing tests to fail.
I also have the same EOF message When I use 'get' command in etcdctl on mac, It always outputs 'unexpected EOF' . (ping endpoit: 15ms) I modified etcdctl source like below and I got all results. |
I recently encountered the same issue. The conclusion was that when communicating between the office network and the development environment network within my company, there is an "Man-in-the-middle" that identifies and forwards communication content. In a specific scenario, the "Man-in-the-middle" failed to forward a segment of the HTTP2 frame, resulting in the Here's how I discovered the issue: I started a grpc client on the office network and a grp server on the development environment. When the server returned content with a length greater than 65535, there was a chance of getting the After capturing and comparing packets on both the client and server sides, I found that the packets on both sides could not be matched one-to-one, and some packet contents were modified. In the request that caused the Other findings:
Therefore, I think this is not a grpc bug. If you need more detailed information, such as packet capture files, please let me know, and I will organize and share. |
We also came across this issue in our organization. Reproduced only in local environment when connecting to the remote server behind vpn (tried and reproduced on MacOS and Ubuntu VM on Windows). Most probably there are more proxies on the way which I do not know about. Setting larger settings for window sizes as suggested above helped. Interesting thing - it never works with grpc-go client until tuning those options. But it works when sending request from Postman (though not always, sometimes Postman shows empty response with status ok) - possibly it has bigger defaults in its underlying client or not using some smart algorithms, hard to say. |
One thing we suspect now is that the server might not be setting the linger socket option (need to check on this), which could lead to this kind of behavior if there is any TCP error after the server finishes writing the RPC data and closes the connection. Update: it looks like Go sets this by default, so I don't suspect this has anything to do with it anymore. |
Thank you to @s-matyukevich for figuring out what was going on here and sending an initial fix! (See #6977 for more details) |
What version of gRPC are you using?
1.46.2
What version of Go are you using (
go version
)?go version go1.18 darwin/amd64
What operating system (Linux, Windows, …) and version?
Client is running MacOs:
ProductName: macOS
ProductVersion: 11.6
BuildVersion: 20G165
Server is running Linux (CentOS 7)
What did you do?
Note: I have posted a small repo that can be used to reproduce this bug, please check it out here:
https://github.com/brianneville/grpcbug
For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error:
code = Internal desc = unexpected EOF
This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency.
For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called.
For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.
I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:
1. Client side - configure window sizes
The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server.
That is, setting the DialOptions on the client side to use:
Where both:
largerWindowSize
is greater than 65535 (so that dynamic window/flow estimation is turned off)largerWindowSize
is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).2. Server-side - delay ending the stream
The error can be prevented by delaying the calls that write the
StatusOk
into the transport. Specifically, if the END_STREAM header is delayed from being put into thecontrolBuf
atgoogle.golang.org/grpc/internal/transport/http2_server.go#finishStream
.That is, you can make any of the following changes to
*http2Server.finishStream
, and the error will not be present:or
When the
t.controlBuf.put(hdr)
line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.Note if you add the sleep after the
t.controlBuf.put(hdr)
, then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?
What did you expect to see?
Connection latency does not affect RPCs
What did you see instead?
High-latency connection (such as ~150ms ping) reliably hit the error
code = Internal desc = unexpected EOF
The text was updated successfully, but these errors were encountered: