-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stolon-proxy has no TCP keepalive #323
Comments
In case it helps: I killed https://gist.github.com/Dirbaio/6d25d3baa843662cb44821513815a9d6 |
Okay, I think I know what's going on. Running My theory is this:
Maybe this could be solved enabling some sort of keepalive, so stolon-proxy can notice a client is dead? |
@Dirbaio. Thanks for the detailed report. I'll tale a detailed look at it tomorrow. I never noticed the stolon proxy leaking connection (though we should add some integration tests in pollon to catch possible regressions). What version of stolon are you using (git master, a specific commit, a released version? we should add a version option to print it...)? If instead it's a problem in the k8s netfilter based service proxy or in the pod network between nodes I'm not sure we can do something inside stolon (we can add an option to set custom socket keepalives values but that will be a workaround to the real problem). I'll try to trigger it to see if I can reproduce this behavior on a multi node cluster. What k8s pod network communication type are you using? (static, flannel, custom like GCE etc...). |
Some details on my setup: K8s 1.5.2 on GCE Ubuntu 16.04 VMs. Pod network is Calico with IP-IP tunneling and NAT enabled. EDIT: And the Stolon version is 0.5.0 plus the patch for #257 Anyway, I think in the general case you just can't rely on the remote end to properly shut down connections. This could also be caused by the VM on the remote end to be forcibly shut down or crash, or by network issues, or many other causes. IMO stolon-proxy should simply time out these connections and move on. Also I'm pretty sure Postgres server itself enables TCP keepalive for this reason :) |
I agree. I think we should add to the proxy some options to enable tcp keepalive (or perhaps enable it by default on the socket like postgres does) and to set the related parameter (since the os defaults, if not changed, should be too high) like postgres already permits. I'll open a new issue since there're some implementation detail related on how to do this with golang.
I tried to simulate missing connection reset on a little k8s cluster but when I delete a pod (also forcing it without a grace period) the process is always stopped before removing the container virtual interface so a FIN/RST is always sent to the proxy. Obviously I can reproduce this if I detach the k8s node network cable or abruptly shutdown the node. @Dirbaio Since looks like the issue isn't the stolon-proxy leaking connection can you please change the title? Thanks. |
Changed the title :) Yeah, I tried to reproduce it too, but couldn't. Shutting down a pod the regular way always sends FIN. I don't know why it happened to me, maybe there's something else going on, but it did happen. IMO adding keepalive by default to the proxy is a great fix |
Fixed by #357 |
We had some issues in prod due to stolon-proxy not closing connections to the master when the incoming connection is closed. This caused all the available connections in the master to run out, and nothing would fix it other than killing
stolon-proxy
.Here you can see the Conntrack output (
100.92.128.134
is the IP ofstolon-proxy
,100.122.206.245
is the IP of the current master keeper.There are 49
ESTABLISHED
connections from the proxy to the master, but just 1ESTABLISHED
connection from somewhere else to the proxy. Shouldn't the proxy close all the connections?The text was updated successfully, but these errors were encountered: