Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC connection closing with unexpected EOF on high-latency connections #5358

Closed
brianneville opened this issue May 16, 2022 · 39 comments · Fixed by #5494 or #6977
Closed

gRPC connection closing with unexpected EOF on high-latency connections #5358

brianneville opened this issue May 16, 2022 · 39 comments · Fixed by #5494 or #6977
Assignees

Comments

@brianneville
Copy link

What version of gRPC are you using?

1.46.2

What version of Go are you using (go version)?

go version go1.18 darwin/amd64

What operating system (Linux, Windows, …) and version?

Client is running MacOs:
ProductName: macOS
ProductVersion: 11.6
BuildVersion: 20G165

Server is running Linux (CentOS 7)

What did you do?

Note: I have posted a small repo that can be used to reproduce this bug, please check it out here:
https://github.com/brianneville/grpcbug


For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error:
code = Internal desc = unexpected EOF

This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency.
For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called.
For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.

I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:

1. Client side - configure window sizes

The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server.
That is, setting the DialOptions on the client side to use:

    opts := []grpc.DialOption{
       grpc.WithInitialWindowSize(largerWindowSize),
       grpc.WithInitialConnWindowSize(largerWindowSize),
    }

Where both:

  • largerWindowSize is greater than 65535 (so that dynamic window/flow estimation is turned off)
  • largerWindowSize is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).

2. Server-side - delay ending the stream

The error can be prevented by delaying the calls that write the StatusOk into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf at google.golang.org/grpc/internal/transport/http2_server.go#finishStream.
That is, you can make any of the following changes to *http2Server.finishStream, and the error will not be present:

    time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
    t.controlBuf.put(hdr)

or

    go func() {         // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
      time.Sleep(1 * time.Second)
      t.controlBuf.put(hdr)
    }()

When the t.controlBuf.put(hdr) line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.

Note if you add the sleep after the t.controlBuf.put(hdr), then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)

    t.controlBuf.put(hdr)
    time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs 

Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?

What did you expect to see?

Connection latency does not affect RPCs

What did you see instead?

High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF

@zasweq zasweq self-assigned this May 24, 2022
Arista-Jenkins pushed a commit to aristanetworks/goarista that referenced this issue May 25, 2022
Setting this flag to false will disable BDP and dynamic windows
 for the grpc connection that the client uses.
This is a workaround for grpc/grpc-go#5358
 which affects clients with high-latency.

Change-Id: Id43097bd32df6801fbbc0011886589a06d883903
@zasweq
Copy link
Contributor

zasweq commented May 25, 2022

Hello, thank you for bringing up this issue/potential bug. In the current state, the team is unable to reproduce this. This requires pinging a remote peer running Centos 7, configured for a high ping. Could you please try reproducing this on the same machine, or even better, the same binary. grpc-go provides a latency package (https://pkg.go.dev/google.golang.org/grpc/benchmark/latency) that provides a way to inject real world latency into network connections. Reproducing in this way would a. Let the team reproduce it ourselves, and b. Get rid of all the other variables that may have caused this bug, and narrow the system under question down to our grpc code and the tcp stack it runs on top of.

@github-actions
Copy link

github-actions bot commented Jun 1, 2022

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Jun 1, 2022
@brianneville
Copy link
Author

Hi,

I wrote a test using the latency package as you described, but unfortunately as of yet I can't seem to reproduce the bug with this test.
I wrote the test using the same RPC that I have been using to trigger the bug - https://github.com/brianneville/grpcbug
This test (at the time of this comment) is available here: brianneville@e1ce0e5

I also recorded some debug logs showing the case where the bug hits in contrast to the case where it does not hit.
https://github.com/brianneville/grpcbug/tree/master/logs
It's interesting that the bug only seems to hit when the server is on the remote (Centos) machine - hosting the server on my local (Mac) machine and attempting the RPC does not produce this EOF error.

Something that may be relevant here is that the remote machine is very powerful relative to the local machine.
I wonder whether this would allow the remote machine to experience some unanticipated interleaving of goroutines (reading/writing) that could contribute to the error along with the high ping.

@github-actions github-actions bot removed the stale label Jun 6, 2022
@dfawley
Copy link
Member

dfawley commented Jun 6, 2022

My worry is that there's something else in the system impacting it like a firewall or proxy. It might be helpful to inspect the traffic with something like wireshark.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@brianneville
Copy link
Author

Hi,

The Centos remote machine is behind a VPN, although it's worth noting that this issue can be hit even when the client is on the same VPN (it's just a lot less frequent, presumably because the network connection is better)

Other points of note regarding the architecture(s) required to reproduce this issue:
The issue is reproducible when running the https://github.com/brianneville/grpcbug server on a Debian machine (tested with a Debian container on the Centos remote).
It is also reproducible when the client is either a Mac or Ubuntu machine.

@zmb3
Copy link

zmb3 commented Jun 21, 2022

I believe we're also seeing this issue, but there is not high-latency as both the client and server are running in the same process. It's rare (1 out of every 3000-5000 attempts) but we've seen it on both macOS and Linux. Setting a large window size does seem to resolve the issue.

In our case, it's just a Go test that spins up a server and client and makes a simple (non-streaming) RPC.

zmb3 added a commit to gravitational/teleport that referenced this issue Jun 21, 2022
A number of our GRPC tests have been failing with an unexpected EOF
error. It appears to be related to grpc/grpc-go#5358.

Apply the suggested window size workaround in tests that have been known
to fail with this error.
@easwars easwars assigned easwars and unassigned brianneville Jun 21, 2022
@brianneville
Copy link
Author

brianneville commented Jun 22, 2022

@zmb3
TestListResources appears to be failing due to a different problem - although it is weird that the window-size workaround for this bug can also be applied to it, so its still worth mentioning here.

I wrote a small unit test that minimally reproduces the TestListResources error seen in gravitational/teleport#13548 and added info on the bug in the commit message: brianneville/grpcbug@2b04ba8 (briefly, the client is slow to cleanup resources after recieving responses which exceed its grpc.max_receive_message_length, and so tells the server to terminate the stream when it cannot allocate resources to a followup response)

I haven't yet looked into the TestGRPCErrorWrapping failure mentioned at gravitational/teleport#9492 (comment)

@zmb3
Copy link

zmb3 commented Jun 22, 2022

Thank you @brianneville! Appreciate you putting together a smaller reproduction - it was on my list of things to do.

@dfawley dfawley assigned zasweq and unassigned easwars Jun 28, 2022
@zasweq
Copy link
Contributor

zasweq commented Jun 29, 2022

Thanks for providing a smaller test case that reproduces the error seen in TestListResources. I was able to reproduce this error locally. Looking into it further now, interesting that it's also fixed with the window-size workaround. However, the orthogonal issue this issue was originally filed about is still too much for the team to reproduce (with the requirements of the machine the server is running on).

@dfawley dfawley added the P0 label Jul 7, 2022
@brianneville
Copy link
Author

Is this streaming RPCs, unary RPCs, or both?

That workaround (delaying ending the stream) works for both streaming and unary RPCs.

If streaming, is it possible you are returning from your handler while another goroutine is attempting to write to the stream?

The EOF appears when there is just one streaming RPC running, i.e. one client streaming to one server, without any other clients streaming at the same time. If there are concurrency issues between writing and closing, then they're happening in some layer below the service (gRPC layer or lower).

Are you using any keepalive settings, by any chance?

https://github.com/brianneville/grpcbug doesn't use keepalive settings, but for the original program where this issue was observed (which the grpcbug repo is minimally reproducing) yes, the following keepalive settings being used:

grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{MinTime: time.Second}),
grpc.KeepaliveParams(keepalive.ServerParameters{Time: 10 * time.Minute})

@github-actions github-actions bot removed the stale label Sep 3, 2022
@zasweq zasweq assigned zasweq and unassigned brianneville Sep 6, 2022
rosstimothy added a commit to gravitational/teleport that referenced this issue Sep 7, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
github-actions bot pushed a commit to gravitational/teleport that referenced this issue Sep 7, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
rosstimothy added a commit to gravitational/teleport that referenced this issue Sep 7, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
rosstimothy added a commit to gravitational/teleport that referenced this issue Sep 8, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
rosstimothy added a commit to gravitational/teleport that referenced this issue Sep 8, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
rosstimothy added a commit to gravitational/teleport that referenced this issue Sep 8, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
rosstimothy added a commit to gravitational/teleport that referenced this issue Sep 8, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
capnspacehook pushed a commit to gravitational/teleport that referenced this issue Sep 12, 2022
Bumps google.golang.org/grpc to v1.49.0 which includes a fix for
grpc/grpc-go#5358 which has been causing
tests to fail.
@arvindbr8 arvindbr8 assigned arvindbr8 and unassigned zasweq Dec 6, 2022
@ysoftman
Copy link

I also have the same EOF message rpc error: code = Internal desc = unexpected EOF in etcdctl

When I use 'get' command in etcdctl on mac, It always outputs 'unexpected EOF' . (ping endpoit: 15ms)
But it's OK for another Linux client (ping endpoint: < 2ms) or short results like (etcdctl get --prefix "/blabla" --limit 100)

I modified etcdctl source like below and I got all results.
https://github.com/etcd-io/etcd/blob/87e271701b3449c35d9ef41f855c7792488773ce/client/v3/client.go#L285
opts := append(dopts, grpc.WithResolvers(c.resolver), grpc.WithInitialWindowSize(1_000_000), grpc.WithInitialConnWindowSize(1_000_000))

@vimt
Copy link

vimt commented May 24, 2023

I recently encountered the same issue. The conclusion was that when communicating between the office network and the development environment network within my company, there is an "Man-in-the-middle" that identifies and forwards communication content. In a specific scenario, the "Man-in-the-middle" failed to forward a segment of the HTTP2 frame, resulting in the unexpected EOF error.

Here's how I discovered the issue:

I started a grpc client on the office network and a grp server on the development environment. When the server returned content with a length greater than 65535, there was a chance of getting the unexpected EOF error.

After capturing and comparing packets on both the client and server sides, I found that the packets on both sides could not be matched one-to-one, and some packet contents were modified. In the request that caused the unexpected EOF error, the server sent 5 DATA frames, but the client only received 4.

Other findings:

  • grpc.WithInitialWindowSize to a value greater than the size of the body returned by the server will not result in the unexpected EOF error.
  • With the grpc python client, everything works fine, and there are no errors. (Why??)

Therefore, I think this is not a grpc bug. If you need more detailed information, such as packet capture files, please let me know, and I will organize and share.

@FZambia
Copy link

FZambia commented Jun 27, 2023

We also came across this issue in our organization. Reproduced only in local environment when connecting to the remote server behind vpn (tried and reproduced on MacOS and Ubuntu VM on Windows). Most probably there are more proxies on the way which I do not know about. Setting larger settings for window sizes as suggested above helped. Interesting thing - it never works with grpc-go client until tuning those options. But it works when sending request from Postman (though not always, sometimes Postman shows empty response with status ok) - possibly it has bigger defaults in its underlying client or not using some smart algorithms, hard to say.

@dfawley
Copy link
Member

dfawley commented Oct 17, 2023

One thing we suspect now is that the server might not be setting the linger socket option (need to check on this), which could lead to this kind of behavior if there is any TCP error after the server finishes writing the RPC data and closes the connection.

Update: it looks like Go sets this by default, so I don't suspect this has anything to do with it anymore.

@dfawley
Copy link
Member

dfawley commented Feb 12, 2024

Thank you to @s-matyukevich for figuring out what was going on here and sending an initial fix!

(See #6977 for more details)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.