Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server tries to route traffic through "hard" disconnected agents #152

Closed
Avatat opened this issue Oct 15, 2020 · 22 comments
Closed

Server tries to route traffic through "hard" disconnected agents #152

Avatat opened this issue Oct 15, 2020 · 22 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@Avatat
Copy link
Contributor

Avatat commented Oct 15, 2020

Hello,
I have Kubernetes 1.18 cluster where Konnectivity v0.0.12 is running.
Most of the time, everything works perfectly, but after ~20 days of konnectivity-server uptime, some DIAL_REQ are getting nowhere.
For example, kubectl logs -n kube-system konnectivity-agent-h6bt9 -f only worked after the third execution.

Server log:

I1015 11:35:55.013608       1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:35:55.013887       1 server.go:257] start serving frontend stream
I1015 11:35:55.013926       1 server.go:268] >>> Received DIAL_REQ
I1015 11:35:55.013940       1 backend_manager.go:170] pick agentID=f42e628f-f3a5-42cc-bb75-8065768b1be1 as backend
I1015 11:35:55.014079       1 server.go:290] >>> DIAL_REQ sent to backend

I1015 11:36:10.378211       1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:36:10.378345       1 server.go:257] start serving frontend stream
I1015 11:36:10.378357       1 server.go:268] >>> Received DIAL_REQ
I1015 11:36:10.378363       1 backend_manager.go:170] pick agentID=f42e628f-f3a5-42cc-bb75-8065768b1be1 as backend
I1015 11:36:10.378418       1 server.go:290] >>> DIAL_REQ sent to backend

I1015 11:36:18.243790       1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:36:18.243873       1 server.go:257] start serving frontend stream
I1015 11:36:18.244049       1 server.go:268] >>> Received DIAL_REQ
I1015 11:36:18.244062       1 backend_manager.go:170] pick agentID=15fb1f4a-55a5-454d-9257-a717e70dc5dd as backend
I1015 11:36:18.244138       1 server.go:290] >>> DIAL_REQ sent to backend
I1015 11:36:18.247179       1 server.go:522] <<< Received DIAL_RSP(rand=2352227733928774978), agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd, connID 84)
I1015 11:36:18.247268       1 server.go:144] register frontend &{grpc 0xc020108df0 <nil> 0xc02cc82960 84 15fb1f4a-55a5-454d-9257-a717e70dc5dd {13824409201659358837 1806906010189152 0x227f060} 0xc02cd49da0} for agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd, connID 84
I1015 11:36:18.247867       1 server.go:308] >>> Received 239 bytes of DATA(id=84)
I1015 11:36:18.248012       1 server.go:324] >>> DATA sent to Backend
I1015 11:36:18.256637       1 server.go:551] <<< Received 2171 bytes of DATA from agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd, connID 84
...

Agent log:

Nothing before...
I1015 11:36:18.244396       1 client.go:271] [tracing] recv packet, type: DIAL_REQ
I1015 11:36:18.244482       1 client.go:280] received DIAL_REQ
I1015 11:36:18.247643       1 client.go:271] [tracing] recv packet, type: DATA
I1015 11:36:18.247680       1 client.go:339] received DATA(id=84)
I1015 11:36:18.247775       1 client.go:413] [connID: 84] write last 239 data to remote
I1015 11:36:18.255019       1 client.go:384] received 2171 bytes from remote for connID[84]
I1015 11:36:18.257829       1 client.go:271] [tracing] recv packet, type: DATA
I1015 11:36:18.257850       1 client.go:339] received DATA(id=84)
I1015 11:36:18.257908       1 client.go:413] [connID: 84] write last 64 data to remote
I1015 11:36:18.258552       1 client.go:271] [tracing] recv packet, type: DATA
I1015 11:36:18.258661       1 client.go:339] received DATA(id=84)
I1015 11:36:18.258777       1 client.go:413] [connID: 84] write last 203 data to remote
I1015 11:36:18.260344       1 client.go:384] received 106 bytes from remote for connID[84]
I1015 11:36:18.264715       1 client.go:384] received 183 bytes from remote for connID[84]

Sometimes, when I want to view logs from the second agent, I'm getting these warnings:
Server log:

I1015 11:51:13.078407       1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:51:13.078709       1 server.go:257] start serving frontend stream
I1015 11:51:13.078763       1 server.go:268] >>> Received DIAL_REQ
I1015 11:51:13.078790       1 backend_manager.go:170] pick agentID=d700cd66-6ae4-416f-89be-4523adda93c3 as backend
W1015 11:51:13.078907       1 server.go:288] >>> DIAL_REQ to Backend failed: rpc error: code = Unavailable desc = transport is closing
I1015 11:51:13.078966       1 server.go:290] >>> DIAL_REQ sent to backend
I1015 11:51:15.008189       1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:51:15.008413       1 server.go:257] start serving frontend stream
I1015 11:51:15.008458       1 server.go:268] >>> Received DIAL_REQ
I1015 11:51:15.008485       1 backend_manager.go:170] pick agentID=d700cd66-6ae4-416f-89be-4523adda93c3 as backend
W1015 11:51:15.008574       1 server.go:288] >>> DIAL_REQ to Backend failed: rpc error: code = Unavailable desc = transport is closing
I1015 11:51:15.008627       1 server.go:290] >>> DIAL_REQ sent to backend

Agent logs nothing.

I believe, that konnectivity-server (+ kube-apiserver) pod restart would help (because it helped in the past), but I don't want to do it, because maybe you will need some more tests.

@cheftako
Copy link
Contributor

Couple of quick clarifying questions. What does you setup look like? Single APIServer, Konnectivity Server and three Nodes each running an agent? There should be a line like "AgentID set to d700cd66-6ae4-416f-89be-4523adda93c3." in one of the agent logs. Can we get its log around W1015 11:51:13.078907?

@Avatat
Copy link
Contributor Author

Avatat commented Oct 15, 2020

My environment looks like below:
Network diagram

That's weird, but I can't see an agent with that AgentID...
The first node and its agent:

grep "AgentID set to" konnectivity-agent-h6bt9.log 
W1015 08:37:48.199164       1 main.go:115] AgentID set to f42e628f-f3a5-42cc-bb75-8065768b1be1.

The second node and its agent:

grep "AgentID set to" konnectivity-agent-s2nxb.log 
W0826 11:25:49.812128       1 main.go:115] AgentID set to 15fb1f4a-55a5-454d-9257-a717e70dc5dd.

I will add that the first konnectivity-agent was recreated ~13 h ago (probably it's related with kubelet restart on this node):

NAME                       READY   STATUS    RESTARTS   AGE
konnectivity-agent-h6bt9   1/1     Running   0          13h
konnectivity-agent-s2nxb   1/1     Running   0          50d

@cheftako
Copy link
Contributor

cheftako commented Oct 16, 2020

So the problem certainly looks like a failure to clean up a backend tunnel to an agent which has gone away. There were some known issues which sound similar in v0.0.11, fixed by @Jefftree but I believe those fixes are in v0.0.12. I've tried generating this issue at head using two agents and a server but any attempt to kill/restart an agent is seen by the server. May try to get another box involved so I can see if having a box go aware repro's the issue. Quick question, do you see any references to d700cd66-6ae4-416f-89be-4523adda93c3 in the server log, other than the attempts to pick it as an backend? Eg Do we know when it connected, do we see any indications that the server might have seen a broken connection but failed to do the cleanup? Do we know if (and when last) it was successfully used as a backend?

@Avatat
Copy link
Contributor Author

Avatat commented Oct 16, 2020

@cheftako, I'm sorry, but I don't have any logs collection there and I lost the log after pod rotation.

But! I replicated the issue.
I disabled network on the node with konnectivity-agent-h6bt9, and restarted an agent container, to generate new AgentID and new connection. Look at the konnectivity-server log:

I1016 09:47:24.534547       1 server.go:449] Connect request from agent 88a4a176-4579-4f2b-a6b9-7998c8cba754
I1016 09:47:24.534580       1 backend_manager.go:99] register Backend &{0xc02c58c3c0} for agentID 88a4a176-4579-4f2b-a6b9-7998c8cba754
I1016 09:49:04.565670       1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1016 09:49:04.565776       1 server.go:257] start serving frontend stream
I1016 09:49:04.565825       1 server.go:268] >>> Received DIAL_REQ
I1016 09:49:04.565834       1 backend_manager.go:170] pick agentID=f42e628f-f3a5-42cc-bb75-8065768b1be1 as backend
W1016 09:49:04.565912       1 server.go:288] >>> DIAL_REQ to Backend failed: rpc error: code = Unavailable desc = transport is closing
I1016 09:49:04.565935       1 server.go:290] >>> DIAL_REQ sent to backend

First, we see the new backend connection - that's OK.
Then we see DIAL_REQ from my kubectl logs, and konnectivity-server picks the old agent, with a broken connection.
That's definitely an issue with cleaning up agents lists, after connection losses.

@cheftako
Copy link
Contributor

Awesome work! That seems to confirm what I was suspecting. The problem would seem to occur when the connection is "silently" broken. On a single machine running both agents and a server, even if I "kill -9" the operating system will detect the connection having been broken and inform the server. I think by disabling the node your simulating a network failure. At that point no actual end connection message has ever been received. So the connection appears to still be alive, we're just not getting a response to anything we send. Anything we do send will timeout on the transport. We clearly want to at a minimum mark the backend as degraded/broken. I would like to think about whether we should immediately do a full cleanup on it.
In your case the agent is actually gone. However as mentioned earlier it could also be caused be a broken network connection. In that case the agent will (presumably) come back. The server cannot tell the difference between these cases and we want to ensure the solution works for both.

@Jefftree
Copy link
Member

Thanks for the detailed explanation! At the moment, only the agent is responsible for checking and disconnecting unhealthy connections so a "silently" broken connection coupled with the agent unable to reach the server seems to contribute to the connection not being properly cleaned up on the server. @caesarxuchao did originally bring up that we may need to perform health checks on the server side too, but we haven't encountered situations where it is required before. This issue seems to provide a compelling reason for further investigation on that front.

I agree that on the server side it would be difficult determine whether the network disconnect is transient, so we may not have enough information to determine whether to perform a full cleanup. Marking the connection and asking the server not to send traffic over a potentially broken connection does seem like a good first step though.

@Avatat
Copy link
Contributor Author

Avatat commented Oct 16, 2020

I tested, how konnectivity behave after connection loss - an agent reconnects and everything works fine:

I1016 17:48:36.357601       1 main.go:352] Starting master server for client connections.
I1016 17:48:36.360146       1 main.go:358] Starting agent server for tunnel connections.
I1016 17:48:36.379407       1 main.go:364] Starting admin server for debug connections.
I1016 17:48:36.379577       1 main.go:370] Starting health server for healthchecks.
I1016 17:48:37.120884       1 server.go:449] Connect request from agent 15fb1f4a-55a5-454d-9257-a717e70dc5dd
I1016 17:48:37.120923       1 backend_manager.go:99] register Backend &{0xc0000460c0} for agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd
I1016 17:48:37.361173       1 server.go:449] Connect request from agent 88a4a176-4579-4f2b-a6b9-7998c8cba754
I1016 17:48:37.361198       1 backend_manager.go:99] register Backend &{0xc000238240} for agentID 88a4a176-4579-4f2b-a6b9-7998c8cba754
...
W1016 17:59:11.560737       1 server.go:482] stream read error: rpc error: code = Canceled desc = context canceled
I1016 17:59:11.560791       1 backend_manager.go:121] remove Backend &{0xc0000460c0} for agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd
I1016 17:59:11.560836       1 server.go:584] <<< Close backend &{0xc0000460c0} of agent 15fb1f4a-55a5-454d-9257-a717e70dc5dd
I1016 17:59:11.560858       1 server.go:501] <<< Close 0 frontends connected to agent 15fb1f4a-55a5-454d-9257-a717e70dc5dd
I1016 17:59:13.980529       1 server.go:449] Connect request from agent 15fb1f4a-55a5-454d-9257-a717e70dc5dd
I1016 17:59:13.980553       1 backend_manager.go:99] register Backend &{0xc0000460c0} for agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd

@cheftako, @Jefftree, can't we simply send "ping"/NULL packets at some interval?
Can't we use Linux TCP keepalive feature? I see, that TCP keepalive is supported in Go. It's even implemented in GRPC library. Am I right?
If I'm right, and you will decide to use keepalive to solve this problem, how complicated it is? If it's easy, I would love to put some lines of code to konnectivity project, but I have very little experience with software developing ;)

@Jefftree
Copy link
Member

I will defer to @cheftako as for performance implications but TCP keepalive seems like it should be able to solve the problem.

As for the code change, it should be a very small change to pass in additional KeepaliveParams when starting the proxy server. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/cmd/server/main.go#L545). Something like replacing that line with

grpcServer := grpc.NewServer(serverOption, grpc.KeepaliveParams(keepalive.ServerParameters{
   Time: 2 * time.Hour,
}))

might already be enough. The difficult part is probably fine tuning the parameter values (the default keepalive time of 2 hours seems a bit long for the server -> agent connections) and verifying whether the addition actually solves this bug.

It would be awesome if you'd like to contribute to the codebase, we're all here and in the slack channel if you have any questions :).

@Avatat
Copy link
Contributor Author

Avatat commented Oct 16, 2020

Thank you for the description and your support! It's very kind :)

I will wait for @cheftako opinion and I will dig it through the weekend. That's a great opportunity for me!

@Avatat Avatat changed the title Agents do not always get DIAL_REQ Server tries to route traffic through "hard" disconnected agents Oct 17, 2020
@cheftako
Copy link
Contributor

I will defer to Walter Fender as for performance implications but TCP keepalive seems like it should be able to solve the problem.

As for the code change, it should be a very small change to pass in additional KeepaliveParams when starting the proxy server. (https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/cmd/server/main.go#L545). Something like replacing that line with

grpcServer := grpc.NewServer(serverOption, grpc.KeepaliveParams(keepalive.ServerParameters{
   Time: 2 * time.Hour,
}))

might already be enough. The difficult part is probably fine tuning the parameter values (the default keepalive time of 2 hours seems a bit long for the server -> agent connections) and verifying whether the addition actually solves this bug.

It would be awesome if you'd like to contribute to the codebase, we're all here and in the slack channel if you have any questions :).

So a packet every two hours is unlikely to cause any performance implications, however having a dead connection for upto 2 hours, while better than the current situation (?forever?) will have performance implications. I see no reason not to merge such a fix right now, as its strictly better than where we are. However I still think we should be looking to mark the connection as unhealthy. That is more work. Especially if we want to be able to detect that it has become healthy again.

@Avatat
Copy link
Contributor Author

Avatat commented Oct 19, 2020

What do you think about setting the Time to 30 seconds?
I added keepalive option:

	serverOption := grpc.Creds(credentials.NewTLS(tlsConfig))
	grpcServer := grpc.NewServer(serverOption, grpc.KeepaliveParams(keepalive.ServerParameters{Time: 30 * time.Second}))

And I believe that it solved the issue:

# date && ip link set dev eth0 down
Mon Oct 19 17:19:50 UTC 2020
17:20:29.201132 IP ....8132 > ....13451: Flags [FP.], seq 1332:1356, ack 675, win 12, options [nop,nop,TS val 996085842 ecr 2285780756], length 24
I1019 17:19:07.961996       1 main.go:362] Starting agent server for tunnel connections.
I1019 17:19:07.990979       1 main.go:367] Starting admin server for debug connections.
I1019 17:19:07.991116       1 main.go:372] Starting health server for healthchecks.
I1019 17:19:09.199058       1 server.go:450] "Connect request from agent" agentID="c0f61873-801b-4ee7-bde3-f1b9374490bd"
I1019 17:19:09.199143       1 backend_manager.go:125] "Register backend for agent" connection=&agent.agentServiceConnectServer{ServerStream:(*grpc.serverStream)(0xc000046180)} agentID="c0f61873-801b-4ee7-bde3-f1b9374490bd"
I1019 17:19:10.637155       1 server.go:450] "Connect request from agent" agentID="4802f592-7ea3-43ba-ad7a-772fa04d6c88"
I1019 17:19:10.637219       1 backend_manager.go:125] "Register backend for agent" connection=&agent.agentServiceConnectServer{ServerStream:(*grpc.serverStream)(0xc000046240)} agentID="4802f592-7ea3-43ba-ad7a-772fa04d6c88"
E1019 17:20:29.200748       1 server.go:483] "stream read failure" err="rpc error: code = Canceled desc = context canceled"
I1019 17:20:29.200800       1 backend_manager.go:147] "Remove connection for agent" connection=&agent.agentServiceConnectServer{ServerStream:(*grpc.serverStream)(0xc000046180)} agentID="c0f61873-801b-4ee7-bde3-f1b9374490bd"
I1019 17:20:29.200866       1 server.go:585] "Close backend of agent" backend=&agent.agentServiceConnectServer{ServerStream:(*grpc.serverStream)(0xc000046180)} agentID="c0f61873-801b-4ee7-bde3-f1b9374490bd"
I1019 17:20:29.200882       1 server.go:502] "Close frontends connected to agent" count=0 agentID="c0f61873-801b-4ee7-bde3-f1b9374490bd"

And reconnecting:

# date && ip link set dev eth0 up
Mon Oct 19 17:33:29 UTC 2020
I1019 17:33:33.666782       1 server.go:450] "Connect request from agent" agentID="c0f61873-801b-4ee7-bde3-f1b9374490bd"
I1019 17:33:33.666878       1 backend_manager.go:125] "Register backend for agent" connection=&agent.agentServiceConnectServer{ServerStream:(*grpc.serverStream)(0xc000046480)} agentID="c0f61873-801b-4ee7-bde3-f1b9374490bd"

I'm not sure, if adding grpc.KeepaliveParams() directly as grpc.NewServer() option is good. Should I create serverOptions array and pass it to grpc.NewServer()?

@caesarxuchao
Copy link
Member

Great findings, @Avatat!

I have some questions on implementing the health check.

Is the grpc.KeepaliveParams the same as the TCP keepalive, or is it at the GRPC level? It would be problematic if it's the TCP keepalive, because TCP keepalive won't kick in until the TCP max retransimission has been reached (see [1]).

Is there a way to check the GRPC connection state on the server-side, like what we did on the client-side?

@Avatat
Copy link
Contributor Author

Avatat commented Oct 19, 2020

It looks that GRPC keepalive is something different than "classic" TCP keepalive:

The keepalive ping is a way to check if a channel is currently working by sending HTTP2 pings over the transport. It is sent periodically, and if the ping is not acknowledged by the peer within a certain timeout period, the transport is disconnected.

Source

I don't know the answer to the second question, but I will dig it :)

@caesarxuchao
Copy link
Member

Nice, the GRPC keepalive looks promising. The godoc of ServerParameters .Time says that "After a duration of this time if the server doesn't see any activity it pings the client to see if the transport is still alive." Not sure what is considered "activity" in GRPC protocol. If the server is trying to send or retransmit package over a broken connection, is that considered "activity"? If so, then it has the same issue as the TCP keepalive.

I don't know the answer to the second question, but I will dig it :)

Thanks. Let's find out the canonical way of performing health check on the server-side in GRPC.

@Jefftree
Copy link
Member

I believe we don't have access to the underlying connection on the server side (grpc/grpc-go#2456) so health checks would have to be performed differently than on the client side. Interestingly enough, that issue in the grpc repo also points to using KeepaliveParams for closing problematic connections.

Nice, the GRPC keepalive looks promising. The godoc of ServerParameters .Time says that "After a duration of this time if the server doesn't see any activity it pings the client to see if the transport is still alive." Not sure what is considered "activity" in GRPC protocol. If the server is trying to send or retransmit package over a broken connection, is that considered "activity"? If so, then it has the same issue as the TCP keepalive.

That's an interesting case we will need to verify

I'm not sure, if adding grpc.KeepaliveParams() directly as grpc.NewServer() option is good. Should I create serverOptions array and pass it to grpc.NewServer()?

+1. The array might need to be unfolded with grpc.NewServer(serverOptions...) but grouping the options into an array is definitely the way to go.

@Avatat
Copy link
Contributor Author

Avatat commented Oct 20, 2020

On the client-side, we use grpc.ClientConn.GetState which works for client only, and I didn't find anything similar but for a server.

@Avatat
Copy link
Contributor Author

Avatat commented Oct 20, 2020

Maybe it's too early, but I've created a PR: #154

@cheftako
Copy link
Contributor

So I've been able to reproduce the problem using "sudo ifconfig down". This further confirms the belief in the problem. It also means I can implement a mechanism to mark a backend as unhealthy.

@Jefftree
Copy link
Member

Nice, the GRPC keepalive looks promising. The godoc of ServerParameters .Time says that "After a duration of this time if the server doesn't see any activity it pings the client to see if the transport is still alive." Not sure what is considered "activity" in GRPC protocol. If the server is trying to send or retransmit package over a broken connection, is that considered "activity"? If so, then it has the same issue as the TCP keepalive.

I just looked into the grpc code, "activity" is defined by read activity on the connection, so the client needs to send data to the server to be quantified as "activity". server -> client transmit attempts should not be considered read activity so I think this keepalive option does actually solve our issue.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2021
@Jefftree
Copy link
Member

Jefftree commented Feb 4, 2021

@Avatat It seems like #154 solved the issue?

I'm going to close this, feel free to reopen if that is not the case.

/close

@k8s-ci-robot
Copy link
Contributor

@Jefftree: Closing this issue.

In response to this:

@Avatat It seems like #154 solved the issue?

I'm going to close this, feel free to reopen if that is not the case.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants