Unexpected transport closing: too many pings from client #1882

immesys · 2018-02-24T23:42:16Z

What version of gRPC are you using?

583a630

What version of Go are you using (`go version`)?

1.10

What operating system (Linux, Windows, …) and version?

Linux AMD64, Kernel 4.10

What did you do?

When I have the server configured with GZIP compression as so:

gzcp := grpc.NewGZIPCompressor()
grpcServer := grpc.NewServer(grpc.RPCCompressor(gzcp))

Then when serving thousands of concurrent requests a second, clients will occasionally be disconnected with

rpc error: code = Unavailable desc = transport is closing

I see no errors from the server, and the both the client and server are far from overloaded (<10% CPU usage etc). Not all clients are affected at once, it will just be one connection which gets this error.

While trying to debug this, I disabled GZIP compression so I could use more easily look at packet captures. I am unable to reproduce this error once the GZIP compressor is no longer in use.

This issue is mostly to ask what the best way to proceed with diagnosing the problem is, or if there are any reasons why having a compressor would change the behavior of the system (aside from CPU usage which I don't think is a problem).

The text was updated successfully, but these errors were encountered:

immesys · 2018-02-25T00:26:10Z

Sorry, I can reproduce this even without the GZIP compressor. I still don't know how to debug it though. The only error I get on the client with loglevel set to info is:

pickfirstBalancer: HandleSubConnStateChange: 0xc421bc6160, TRANSIENT_FAILURE

MakMukhi · 2018-02-25T00:42:53Z

Turn on transport level logs by setting these two env variables: export GRPC_GO_LOG_VERBOSITY_LEVEL=2 export GRPC_GO_LOG_SEVERITY_LEVEL=info May be the transport on the server of client side sees some error and closes.

…

On Sat, Feb 24, 2018 at 4:26 PM, Michael Andersen ***@***.***> wrote: Sorry, I can reproduce this even without the GZIP compressor. I still don't know how to debug it though. The only error I get on the client with loglevel set to info is: pickfirstBalancer: HandleSubConnStateChange: 0xc421bc6160, TRANSIENT_FAILURE — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1882 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnR6yLJFRNEv-Nn5h7ILZfqwbUZoMFks5tYKisgaJpZM4SSF8Z> .

immesys · 2018-02-25T00:52:29Z

The client respects those variables, but the server does not, perhaps the etcd client in the server (which uses grpc for its own stuff) is overriding it? Is there a way to set the log level in the code?

EDIT: I see, etcd calls SetLoggerV2, pity that is global, but I can work around that.

immesys · 2018-02-25T03:18:05Z

I finally got a new piece of information. The server prints this shortly before the client disconnects:

ERROR: 2018/02/24 19:16:02 transport: Got too many pings from the client, closing the connection.
ERROR: 2018/02/24 19:16:02 transport: Error while handling item. Err: connection error: desc = "transport is closing"

What does "too many pings" mean?

The client prints this with the new verbosity:

INFO: 2018/02/24 19:16:02 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2018/02/24 19:16:02 pickfirstBalancer: HandleSubConnStateChange: 0xc420b7c4b0, TRANSIENT_FAILURE

immesys · 2018-02-25T03:21:56Z

Incidentally, neither the server nor the client passes any keep alive parameters, everything is default

MakMukhi · 2018-02-25T04:39:02Z

The client is pinging the server too many times. Perhaps the keepalive settings on the client and the server are not the same.

…

On Sat, Feb 24, 2018, 7:18 PM Michael Andersen ***@***.***> wrote: I finally got a new piece of information. The server prints this shortly before the client disconnects: ERROR: 2018/02/24 19:16:02 transport: Got too many pings from the client, closing the connection. ERROR: 2018/02/24 19:16:02 transport: Error while handling item. Err: connection error: desc = "transport is closing" What does "too many pings" mean? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1882 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnR11QehqgSkWt1PprRu28OaEXAN7Lks5tYND3gaJpZM4SSF8Z> .

immesys · 2018-02-25T04:48:47Z

But both sides are built with the same version of grpc, neither specify any keepalive parameters and it only happens under heavy load.

MakMukhi · 2018-02-25T06:51:09Z

If it's not keepalive then the only thing inside gRPC that sends pings is bdp estimator which will be triggered on receiving data. However, every time the server sends data frames it resets its pingStrike counter. To quickly check if it's the bdp estimator or not, try turning it off by setting WithInitialWindowSize and WIthInitialConnectionWindowSize dial options on the client.

immesys · 2018-02-25T07:30:24Z

Thanks for the suggestion. I have set both of those options (to 1MB) and will leave it running overnight. It typically takes somewhere between 2 minutes and 20 minutes under load to trigger, so a few hours should be conclusive.

immesys · 2018-02-25T17:57:01Z

It's been running without any problem for about 9 hours, so I think that confirms that the BDP estimator is at fault. I know nothing about how pings or BDP estimation works inside grpc, but surely if the pings are sent when data goes from client to the server and the pingStrike counter is only reset when data is sent the other way then if the RPC operations spike in latency (e.g mine go to hundreds of milliseconds) then you can exceed the counter simply because the server isn't yet responding to any RPC calls so isn't doing any resets?

MakMukhi · 2018-02-25T20:45:22Z

A bdp estimator ping is sent by the client when it receives data from the server. And the server must have reset its counter for that dataFrame. In an old version the server was buggy and was resetting its counter only once for a message but you don't seem to be using that.
Thanks for narrowing it down. I'll take a look at it Monday. I don't suppose it would be easy for you to create a repro of it but if you find a way can you please upload it.

immesys · 2018-02-26T00:17:32Z

I made a simple reproducer where a program connects to itself and invokes an operation that has roughly the latency I observed with roughly the concurrency I had, but it doesn't seem to trigger it, sorry. Let me know if there is anything else I can do to help

MakMukhi · 2018-02-26T06:55:13Z

Thanks for trying to reproduce it. We might need to collaborate to debug it further. I'll look at the code again tomorrow to see if I can find any obvious mistakes.

…

On Sun, Feb 25, 2018, 4:17 PM Michael Andersen ***@***.***> wrote: I made a simple reproducer where a program connects to itself and invokes an operation that has roughly the latency I observed with roughly the concurrency I had, but it doesn't seem to trigger it, sorry. Let me know if there is anything else I can do to help — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1882 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnR2G2PhzC3C28Mkgu9vdg9P-T_2Zgks5tYfgigaJpZM4SSF8Z> .

immesys · 2018-03-01T00:44:26Z

I didn't succeed in making a standalone reproducer but I can reliably reproduce it in-situ even on completely different hardware. More than willing to help track this down, just let me know how.

MakMukhi · 2018-03-01T01:15:13Z

Hey, that's great that you can reliably reproduce it in your set up.
Following is what we know:

Here's where the client sends a bdp ping:

grpc-go/transport/http2_client.go

Line 840 in f0a1202

t.controlBuf.put(bdpPing)

As you see this is inside of the data handler which is called every time the client gets a dataFrame.
Moreover,client sends ping only if it already got an ack back from a previous ping.

On the server-side whenever a dataFrame is sent the server resets the ping counter:

grpc-go/transport/http2_server.go

Line 901 in f0a1202

atomic.StoreUint32(&t.resetPingStrikes, 1)

Client sees data and sends ping.
Client sees more data doesn't send ping.
Client sees more data doesn't send ping.
Server sees ping, but the counter was reset.
Server sends back ping-ack and now sets the ping strike counter again.
Client sees ping-ack and records bdp.
The whole process is repeated.

I would start with making sure the faulty pings are always bdp pings. (Assuming you can update your local code):

Make sure there's nothing else that's pinging the server: In the server's handlePing, right after this line

I'd add the following:

if f.Data != bdpPing.data {
  panic("Some other pinger?")
}

Perhaps, print out time whenever server sends data and receives ping.

I'm sorry I have not been able to find out anything obvious yet. But my best guess so far is that something else is pinging the server.
Let's see what you find and we'll go from there. I'd like to stare at these print statements and logs once you have them.

Also, thanks for helping out. :)

dfawley · 2018-03-01T17:27:31Z

@immesys, Random thought along the lines of @MakMukhi's reasoning: you aren't connecting through any kind of proxy, are you?

immesys · 2018-03-01T21:01:04Z

@dfawley no, there is no http2 proxy involved. I was initially running in kubernetes, but I have reproduced it on bare metal with no fancy networking at all.

EDIT: actually I lie there is probably some overlay networking still going on, but that would be at layer 3, nothing should be injecting HTTP2 pings

dfawley · 2018-03-01T22:27:53Z

Hmm, sounds like there should be no pings in the system except BDP pings, then. Hopefully you can try the suggestions from @MakMukhi and get back to us. Thanks!

immesys · 2018-03-01T22:33:35Z

Ok I reproduced it with the debug statements in place.
Here are the changes I made https://gist.github.com/immesys/be704af11354ba943f35b801abc9caf8
Here are the 2000 lines preceding the event. I included two files, one with all the log output and one filtered to only include log output from the specific http2Server that tripped (there are 10 concurrent connections): https://gist.github.com/immesys/aff2b454b9e0eab843e062e76bfb65c3

Let me know if I can re-run this with different print statements

immesys · 2018-03-01T22:34:51Z

@dfawley yeah in the entire log the OTHER PING DATA print statement never triggers, so I think this is just BDP pings.

immesys · 2018-03-01T22:38:23Z

Is there a reason the ping strikes are only reset when sending headers in itemHandler? Why not when sending dataFrame?

immesys · 2018-03-01T22:40:55Z

To save you reading the whole logs, here is the important sequence:

1519942464739172903 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739261537 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739266862 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739318578 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739452781 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739459106 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739535670 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739659435 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739674955 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739678469 BDPDEBUG[http2Server=0xc423778f00] >maxPingStrikes

Basically because we are writing three data frames and no header frames, we don't reset the ping strikes.

MakMukhi · 2018-03-01T22:45:30Z

They are reset while scheduling dataFrame here: https://github.com/grpc/grpc-go/blob/master/transport/http2_server.go#L901 But I think you're right if we moved this to just before writing dataFrame on the wire it would work out. Can you try doing that?

…

On Thu, Mar 1, 2018 at 2:41 PM, Michael Andersen ***@***.***> wrote: To save you reading the whole logs, here is the important sequence: 1519942464739172903 BDPDEBUG[http2Server=0xc423778f00] writing data1 1519942464739261537 BDPDEBUG[http2Server=0xc423778f00] handlePing 1519942464739266862 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2 1519942464739318578 BDPDEBUG[http2Server=0xc423778f00] writing data1 1519942464739452781 BDPDEBUG[http2Server=0xc423778f00] handlePing 1519942464739459106 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2 1519942464739535670 BDPDEBUG[http2Server=0xc423778f00] writing data1 1519942464739659435 BDPDEBUG[http2Server=0xc423778f00] handlePing 1519942464739674955 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2 1519942464739678469 BDPDEBUG[http2Server=0xc423778f00] >maxPingStrikes Basically because we are writing three data frames and no header frames, we don't reset the ping strikes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1882 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnR9_TJ7joHxuljxqyp2kjDJH0pojtks5taHkAgaJpZM4SSF8Z> .

immesys · 2018-03-01T22:51:54Z

Should it be before writing it to the wire or just after (the header is just after) ?

MakMukhi · 2018-03-01T23:24:53Z

I'd put it right before, but it shouldn't matter as much.

…

On Thu, Mar 1, 2018 at 2:52 PM, Michael Andersen ***@***.***> wrote: Should it be before writing it to the wire or just after (the header is just after) ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1882 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnRw5oyiJb42TnfzK7L8UjQXon4iEtks5taHuSgaJpZM4SSF8Z> .

immesys · 2018-03-01T23:26:52Z

It's been running for a while with no problems. I'll run it overnight and see. Does this completely remove the race condition or just make it harder to hit?

immesys · 2018-03-02T02:24:29Z

Still didn't hit this case now after four hours. I stopped the test because I needed the machine for a different test. Is it always the case that every data frame gets an interleaved reply 1:1 ? You can't queue a few frames in the TCP window, clearing the reset strikes after each and then receive the three pings in response all together? I don't know the code, just wondering

MakMukhi · 2018-03-02T19:14:18Z

I suspect the problem was a ping ack being scheduled between the two steps where write method resets the counter and schedules data. This scheduling of ping-ack sets the counter back. If this indeed was the case then moving the reset instruction in itemHandler will eliminate the race. Multiple data frames get queued up in gRPC itself and I'm sure in the OS kernel too.

…

On Thu, Mar 1, 2018 at 6:24 PM, Michael Andersen ***@***.***> wrote: Still didn't hit this case now after four hours. I stopped the test because I needed the machine for a different test. Is it always the case that every data frame gets an interleaved reply 1:1 ? You can't queue a few frames in the TCP window, clearing the reset strikes after each and then receive the three pings in response all together? I don't know the code, just wondering — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1882 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnR81g0qy4vfeZaUvmQ3YinIBywbxGks5taK1mgaJpZM4SSF8Z> .

dfawley · 2018-03-02T19:24:14Z

Is there any way to figure out why this was happening to @immesys so reproducibly, but none of our tests encountered it?

immesys · 2018-03-02T19:25:24Z

Well it only happens after a few minutes under very high load. I think it has to do with the order that goroutines get scheduled. We could probably make a reproducer by putting in some sleeps in the code in the right places

praserx · 2018-03-07T15:05:10Z

Similar/same issue here. We've got a one client, one 'relay' server and one server. If we create many concurrent requests (thousands as its mentioned above), we got same error on client:

rpc error: code = Unavailable desc = transport is closing

Thanks to @MakMukhi

... try turning it off by setting WithInitialWindowSize and WIthInitialConnectionWindowSize dial options on the client.

That solve our problem.

MakMukhi · 2018-03-07T18:43:27Z

@praserx We'd ideally like BDP estimator to be turned on. Is it possible for you to turn it back on and patch #1905 to see if that fixes your problem as well.

praserx · 2018-03-08T07:36:07Z

@MakMukhi It seems that it is working perfectly. Thank you for response.

immesys changed the title ~~Unexpected transport closing under load with GZIPCompressor~~ Unexpected transport closing: too many pings from client Feb 25, 2018

dfawley added P1 Status: Requires Reporter Clarification Type: Bug labels Mar 1, 2018

dfawley removed the Status: Requires Reporter Clarification label Mar 2, 2018

dfawley assigned MakMukhi Mar 2, 2018

MakMukhi mentioned this issue Mar 7, 2018

transport: fix race that could lead to the server closing a connection due to too many pings #1905

Merged

dfawley closed this as completed in #1905 Mar 12, 2018

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected transport closing: too many pings from client #1882

Unexpected transport closing: too many pings from client #1882

immesys commented Feb 24, 2018 •

edited

Loading

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018 via email

immesys commented Feb 25, 2018 •

edited

Loading

immesys commented Feb 25, 2018 •

edited

Loading

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018 via email

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018 •

edited

Loading

immesys commented Feb 25, 2018

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018

immesys commented Feb 26, 2018

MakMukhi commented Feb 26, 2018 via email

immesys commented Mar 1, 2018

MakMukhi commented Mar 1, 2018

dfawley commented Mar 1, 2018

immesys commented Mar 1, 2018 •

edited

Loading

dfawley commented Mar 1, 2018

immesys commented Mar 1, 2018

immesys commented Mar 1, 2018

immesys commented Mar 1, 2018

immesys commented Mar 1, 2018

MakMukhi commented Mar 1, 2018 via email

immesys commented Mar 1, 2018

MakMukhi commented Mar 1, 2018 via email

immesys commented Mar 1, 2018

immesys commented Mar 2, 2018

MakMukhi commented Mar 2, 2018 via email

dfawley commented Mar 2, 2018

immesys commented Mar 2, 2018

praserx commented Mar 7, 2018

MakMukhi commented Mar 7, 2018 •

edited by dfawley

Loading

praserx commented Mar 8, 2018

Unexpected transport closing: too many pings from client #1882

Unexpected transport closing: too many pings from client #1882

Comments

immesys commented Feb 24, 2018 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018 via email

immesys commented Feb 25, 2018 • edited Loading

immesys commented Feb 25, 2018 • edited Loading

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018 via email

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018 • edited Loading

immesys commented Feb 25, 2018

immesys commented Feb 25, 2018

MakMukhi commented Feb 25, 2018

immesys commented Feb 26, 2018

MakMukhi commented Feb 26, 2018 via email

immesys commented Mar 1, 2018

MakMukhi commented Mar 1, 2018

dfawley commented Mar 1, 2018

immesys commented Mar 1, 2018 • edited Loading

dfawley commented Mar 1, 2018

immesys commented Mar 1, 2018

immesys commented Mar 1, 2018

immesys commented Mar 1, 2018

immesys commented Mar 1, 2018

MakMukhi commented Mar 1, 2018 via email

immesys commented Mar 1, 2018

MakMukhi commented Mar 1, 2018 via email

immesys commented Mar 1, 2018

immesys commented Mar 2, 2018

MakMukhi commented Mar 2, 2018 via email

dfawley commented Mar 2, 2018

immesys commented Mar 2, 2018

praserx commented Mar 7, 2018

MakMukhi commented Mar 7, 2018 • edited by dfawley Loading

praserx commented Mar 8, 2018

immesys commented Feb 24, 2018 •

edited

Loading

What version of Go are you using (`go version`)?

immesys commented Feb 25, 2018 •

edited

Loading

immesys commented Feb 25, 2018 •

edited

Loading

MakMukhi commented Feb 25, 2018 •

edited

Loading

immesys commented Mar 1, 2018 •

edited

Loading

MakMukhi commented Mar 7, 2018 •

edited by dfawley

Loading