sysctl net.ipv4.tcp_keepalive_time / other kernel parameter setting needed #165

h0nIg · 2020-01-10T12:11:28Z

Description

This is a follow up issue on #70, since the issue was closed.

We’ve seen apps on the platform (running on AWS) which talk to DBs exposed via public IPs. These connections are going via an AWS NAT Gateway which has an idle timeout of 350 sec. If the apps run some queries (don’t know what these are :)) which need longer to get a response from the server, then the connections is “freed” on the NAT GW and only later if the app tries to send agains some data over the connection it gets a RST.

The connections via the NAT GW can be kept open if one of the sides would send tcp keepalive packets. However, the containers in which the apps are running have the defaults (net.ipv4.tcp_keepalive_time = 7200) - the first probe is made after 2h. On the Diego Cell VM the settings are different (net.ipv4.tcp_keepalive_time = 120, see https://github.com/cloudfoundry/bosh-linux-stemcell-builder/blob/acc0c1d039be5beeb30be0c9385a1b1c54e89218/stemcell_builder/stages/bosh_sysctl/assets/60-bosh-sysctl.conf#L35) but the latter are not inherited in the container namespaces, there the defaults are used. So at the moment neither the app developers can modify the settings for the containers, neither we as operators of the platform (at least we haven’t figured out how).

There should be a mechanism to set kernel parameters inside a container to overcome problematic default parameters. It will not help by modifying cflinuxfs3 with e.g. /etc/sysctl.d/20-myconfiguration.conf, because a couple of kernel parameters can not get changed since they are readonly and can either get set for privileged containers (you do not want to do this...) or during creation of the container.

Steps to reproduce

login to a CF diego cell

diego-cell/f287dd7c-87db-42ea-b4e0-c490172fcd5c:~$ grep tcp_keepalive_time /etc/sysctl.d/60-bosh-sysctl.conf
net.ipv4.tcp_keepalive_time=120
diego-cell/f287dd7c-87db-42ea-b4e0-c490172fcd5c:~$ cat /proc/sys/net/ipv4/tcp_keepalive_time
120
diego-cell/f287dd7c-87db-42ea-b4e0-c490172fcd5c:~$ sudo /var/vcap/packages/runc/bin/runc --root /run/containerd/runc/garden/ exec -t 85580519-104b-4a7d-4240-7007 /bin/bash
root@85580519-104b-4a7d-4240-7007:/# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

Outline the steps to test or reproduce the PR here. Please also provide the
following information if applicable:

Garden: 1.19.9
Linux kernel version: 4.15.0-72-generic / AWS & Azure ubuntu stemcell 621.29

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2020-01-10T12:11:31Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/170644987

The labels on this github issue will be updated when the story is started.

julz · 2020-01-16T11:49:40Z

Hi @h0nIg - if I understand correctly(?) net.ipv4.tcp_keepalive_time sets a default for the keep-alive time on tcp connections in the container/host, but you can also just set the timeout using the SO_TIMEOUT variable when opening the socket. For example in Java that's in ExtendedSocketOptions.html#TCP_KEEPIDLE and in Go it's Dialer.KeepAlive.

Do we in fact need to have people change a global default on the container or might it be possible to solve this by just setting the keep alive time on the relevant connection in the app code? (My hesitation on having it at the container level is it seems quite non-12-factor to have the app rely on the particular OS configuration it's running on and it's not clear to me from a CF CLI POV how it'd work? We could set it for every container in garden, but then your app would only work on that tweaked CF which seems sad)

h0nIg · 2020-01-17T20:19:05Z

@julz it is quite hard to convince every library or developer to set those for their applications in order to prevent support tickets. Lets take a look how customers see this: As a customer with a deployed application running into connection problems, i barely understand whats going on with my container OS.

The behavior was changed due to the kernel upgrade by upgrading the stemcell. If it is not a way to go for you, to restore the behaviour like it was with kernel 4.4 (inherit from host), please make it possible to set those for the full CF landscape as a configuration. If this is still not a good idea, what about adding a generic mechanism to let the operator of the landscape set certain parameter for all garden containers?

julz · 2020-02-03T11:28:48Z

Hi @h0nIg - really sorry for slow response on this, somehow missed your reply.

I'd still personally say it's not great if an app is relying on particular linux kernel tuning parameter being set to work correctly BUT yeah I totally see the problem you're pointing out here and we'd be happy to accept a PR to allow setting a global default value for all containers if you're interested in creating one (or we can prioritise a story to do it ourselves, but it might take us a little more time to get to that way).

arjenw · 2020-02-04T15:57:48Z

To chime in on your discussion here: @julz Java did add that option to configure Socket Keep Alive options only since Java 11 (which is still relatively recent). That means that before Java 11 there was no way to configure Keepalive on a Socket. As a result, most libraries doing some socket stuff have not exposed this option in their library. This also effects for example JDBC drivers. They are usually compiled on an earlier Java version to make them compatible for most relevant JRE's.

Point is that the assumption that you can change in in your application itself basically prevents all Java applications from dropping connections quickly when the connection is broken. This assumption therefore also prevents use of proper failover in most Java applications.

Ideally indeed this should be handled by the app, but in reality in a lot of cases you still can't as the socket is hidden by a library. The only way to configure it then is by changing the OS keep alive settings.

h0nIg · 2020-02-07T09:34:21Z

@julz:

as a result of my discussions with @krumts and that several kernel parameters might be required to get changed (net.ipv4.tcp_keepalive_time / net.ipv4.tcp_keepalive_intvl / net.ipv4.tcp_keepalive_probes), i would opt in to have flexibility to set several parameters instead of a static hardcoded one. If an CF landscape operator will change certain settings required by his IaaS setup, we should trust him that he knowns what he is doing.

As this is important for SAP, lets try to find a way through your backlog and our colleagues (e.g. @yulianedyalkova) who work on the backlog

julz · 2020-02-07T11:55:10Z

Hi @arjenw and @h0nIg - I understand where you're coming from, but I'd really like to avoid a situation where an app works on one CF but not another, which is something that so far has (I think) always been the case. I'd also really like to avoid a situation where we're asking an operator to tweak lots of kernel parameters (and potentially having hard to debug issues where the settings differ between CFs).

Is there any chance there are reasonable default values for these things we could set that would work for all CF apps?

krumts · 2020-02-07T16:00:36Z

Hi @julz

but I'd really like to avoid a situation where an app works on one CF but not another, which is something that so far has (I think) always been the case

potentially having hard to debug issues

Interesting. These are actually the two goals we also wanted to achieve as we started the discussions, so I guess we have different understanding :)
Yes, I know I threw some of the statements away, I'll come back to them.

I'll try to explain the issue from my PoV.
We offer (we are the operators in this case) a CF based platform on various cloud providers (AWS, Azure, GCP, ...). On each of them, the outbound connectivity has some constraints, which are imposed by various infrastructure components we use.
To be very concrete, let's stick to a NAT Gateway:

idle timeout - on Azure (at present) there is an idle timeout of 240s (not configurable), on AWS it is 350s (not configurable), on GCP it is something different but can be configured, on AliCloud it is 900s (not configurable)
NAT behavior after a timeout - on Azure, depending on the LB/NAT-type used packets may be silently dropped, or may be responded with a RST, on AWS the NAT will send a RST

Let's consinder an app, which uses a connection pool a - it creates a connection, uses it, puts it in the pool and takes it again from there after 400s.
If we run the same app on a CF installation in each of these infrastructures, we would end up:

on AliCloud - connection would still be established and working
on GCP it depends on what we configured
on AWS the app will get a connection reset immediately when it tries to use the connection, and may need to retry
on Azure, if the packets are silently dropped, it would take 15+ minutes (in which the kernel would be doing retransmissions with backoff) until the connection is considered dead

And it gets really funky when the NAT GW starts reusing its local ports (which are free from its PoV) as src port for the connection, for which the server may still think it has an ESTABLISHED connection (one that timed out in the NAT GW).
Then other apps, running on different diego cells, which have nothing to do with the timeout start getting issues when trying to open a new connection.

Unfortunately these are not just theoretical issues and they are definitely hard to debug - I've spent probably months of my own time and this is usually after the app developers and may be the team operating the destination spend lots of their time.

My hope is, that if we as operators use the knowledge about what the concrete constrains of the infrastructure are (apps don't know these details), we could try to to come up with a setup which is working on each infrastructure.

Now back to

I'd also really like to avoid a situation where we're asking an operator to tweak lots of kernel parameters

I understand that. Also for us fiddling with kernel parameters is not the first thing we looked for. However, we've seen concrete issues, and see no real way to solve them otherwise (I don't claim there isn't another way, may be we just don't have the experience yet)

Modifying the tcp parameters by the app is hard

it can't be done by kernel settings in the container (needs privileged access)
setting such parameters on per socket is often not an option - Apps use a lib, which uses a lib, which uses a lib, ..., which uses a lib which has access to the socket, e.g. socket <- jdbc <- OR Mapper <- API XYZ
but first of all people first have to fail, then go to through the painful analysis, and only afterwards start looking for options

Is there any chance there are reasonable default values for these things we could set that would work for all CF apps?

Difficult to say. I can give an answer that would fit the 4 clouds mentioned above, but I guess there are other setups too.
The cells and all other VMs have a default for tcp_keepalive_time of 120s (I guess coming from the stemcell?). May be this is a good starting point. But it really depends on the surrounding infrastructure and it operators should (in my PoV) have a way to control this.

I hope I could explain how we came to the discussion.
Does this sound reasonable?

julz · 2020-02-10T13:29:23Z

In the past, we used to default DNS configuration to match whatever the host was set to -- on the assumption that you probably want your containers to share a similar configuration to the host. If I understand correctly in previous kernels that was also the behaviour of the network namespace with regards to these properties (they were inherited from the host netns). What if we restored that behaviour and set the configs to match the same values on the host, would that work?

krumts · 2020-02-10T13:56:30Z

@julz I think this should work. At least so far I don't have a use case where we want to have host and containers set differently, so this could be a reasonable pragmatic approach.
If the settings from the host are inherited in the container namespaces, then the current settings I see on the Diego cells would help us avoid the issues I listed above on the above mentioned cloud providers.
And operators will still have an (indirect) way to manipulate the settings if this is needed in a specific environment.

Let's see what the others think.

julz · 2020-02-10T14:31:28Z

So, summarising: if we allowed configuring net.ipv4.tcp_keepalive_time, net.ipv4.tcp_keepalive_intvl and net.ipv4.tcp_keepalive_probes for all containers as a bosh property and set them, by default to the host-side value of those properties then we think we could make it so that a user never needs to worry about this (assuming that either the hosts are already correctly set up or an operator takes manual action to set these values up to work for their environment).

@arjenw @h0nIg - does something like the above lines seem like a reasonable solution to you?

h0nIg · 2020-02-12T16:19:06Z

@julz this would be a solution for us as well

cf-gitbot added the unscheduled label Jan 10, 2020

cf-gitbot added scheduled in progress and removed unscheduled scheduled labels Feb 17, 2020

cf-gitbot added accepted and removed in progress labels Mar 2, 2020

jamieklassen mentioned this issue Jun 3, 2020

After update to 6.2.0: failed to retrieve kernel parameter concourse/concourse#5711

Closed

amotl mentioned this issue Oct 15, 2020

Robustness and resiliency on Azure confluentinc/librdkafka#3109

Open

amotl mentioned this issue Nov 18, 2020

Networking robustness and resiliency on Azure and beyond (AWS, GCP, AliCloud) crate/crate#10779

Closed

gcapizzi closed this as completed May 26, 2021

cf-gitbot removed the accepted label May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sysctl net.ipv4.tcp_keepalive_time / other kernel parameter setting needed #165

sysctl net.ipv4.tcp_keepalive_time / other kernel parameter setting needed #165

h0nIg commented Jan 10, 2020 •

edited

Loading

cf-gitbot commented Jan 10, 2020

julz commented Jan 16, 2020 •

edited

Loading

h0nIg commented Jan 17, 2020

julz commented Feb 3, 2020

arjenw commented Feb 4, 2020

h0nIg commented Feb 7, 2020

julz commented Feb 7, 2020

krumts commented Feb 7, 2020

julz commented Feb 10, 2020

krumts commented Feb 10, 2020

julz commented Feb 10, 2020

h0nIg commented Feb 12, 2020

sysctl net.ipv4.tcp_keepalive_time / other kernel parameter setting needed #165

sysctl net.ipv4.tcp_keepalive_time / other kernel parameter setting needed #165

Comments

h0nIg commented Jan 10, 2020 • edited Loading

Description

Steps to reproduce

cf-gitbot commented Jan 10, 2020

julz commented Jan 16, 2020 • edited Loading

h0nIg commented Jan 17, 2020

julz commented Feb 3, 2020

arjenw commented Feb 4, 2020

h0nIg commented Feb 7, 2020

julz commented Feb 7, 2020

krumts commented Feb 7, 2020

julz commented Feb 10, 2020

krumts commented Feb 10, 2020

julz commented Feb 10, 2020

h0nIg commented Feb 12, 2020

h0nIg commented Jan 10, 2020 •

edited

Loading

julz commented Jan 16, 2020 •

edited

Loading