-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kops Debian images need to use a newer kernel to fix intermittent network timeouts caused by connection tracking bugs. #8224
Comments
Hello, |
Oh, that's awesome. Thanks a lot. I don't suppose you know what is different between the kops images like |
This is the repo responsible for building the kope.io AMIs: That file gives you a sense of what changes are made from the official Debian images. I added an item to tomorrow's Kops office hours to get newer AMIs built, I'll be sure to update this issue with the results of that discussion. |
@rifelpet Do you have a result for this discussion? |
Thank you @rifelpet Please, if possible, consider generating an image for It would be really great if we can include the Many thanks! |
@rifelpet will you updated this issue when the Buster image is built or will be announced somewhere else? |
I updated our existing stretch AMIs and investigated this a bit: #8361 (comment) Our AMIs do run the stock kernels, and with that it looks like:
Given that, I proposed that we stick to the stock kernels, and expedite getting buster as an option and also make it the default in a newer version of kops (1.18?). There was an iptables blocker, but that should now be fixed. |
@justinsb I agree, one of the patches is ok but would rather have both of them to be covered. Iptables blocker still applies for Kubernetes versions below 1.17 afaik |
As I recall, that's tracked in #7379 -- which is still open. Is this the same issue? Or perhaps you're referring to this 1.17 patch kubernetes/kubernetes#82966 ? [edit] doh @mariusv beat me to the punch! |
Yes - exactly, I'm referring to the iptables nft switch. It was fixed in kubernetes/kubernetes#82966 and that should be in k8s >= 1.17. I also did just bring up a k8s 1.17.0 with a stock buster image. So I figure I can build an image for buster for 1.17, we can test with it and then confidently close #7379 ... and work to make buster the default. |
It's possible to use Debian Buster with an earlier k8s version (I tried 1.13) if you change the iptables mode to legacy. There is a way to to that with cloud init #7381 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Still valid, please remove stale label. |
/remove-lifecycle stale |
Kops now has Ubuntu Focal as the default image, and also supports using Debian Buster as an image if you set Therefore I suspect that this can now be closed. |
Yeah, it should definitely be closed now. |
/close |
@johngmyers: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.3. What cloud provider are you using?
4. What commands did you run? What is the simplest way to reproduce this issue?
We have a production cluster that is running many jobs.
We see network timeouts many times per day.
Running
conntrack -S
to show the in-kernel connection tracking system statistics for each of our nodes shows a large number ofinsert_failed
entries.5. What happened after the commands executed?
Jobs that don't handle network connection timeouts fail.
6. What did you expect to happen?
We should not be seeing random network timeouts.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
Apologies for the sparse answers to many of the above questions; I'm not sure if they are applicable to the problem.
The following 2 Linux patches were merged into the Linux kernel that address some problems with its connection tracking that result in network connection issues.
http://patchwork.ozlabs.org/patch/937963/
http://patchwork.ozlabs.org/patch/1032812/
I believe they are both in the Linux 5.1 kernel.
There are a number of topics on the Internet discussing this issue causing DNS timeouts.
Here are a couple of bug reports talking about it:
kubernetes/kubernetes#56903
weaveworks/weave#3287
Although these are talking timeouts with DNS, it is applicable to everything else as well; it's just that DNS lookups happen regularly and so trip the problem frequently.
We've already put in place the node-local-dns solution as presented in these bug reports to vastly improve the issue around DNS.
However we still have random jobs trip up on the problem and
conntrack -S
is showing we are hitting the problem because we still have manyinsert_failed
errors.Our instance groups are using the
kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2019-09-26
image.Looking at https://github.com/kubernetes/kops/blob/master/channels/stable these seem to be the latest versions of the images currently available.
They report the following kernel when running
uname -rvm
on the nodes:Installing the
linux-source-4.9
package on Debian that matches the same kernel version, it appears that one of the 2 kernel patches have been back-ported into Debian Stretch.I also looked at the kernel sources for Debian Buster (the current stable release of Debian) and see that both of these patches have been back-ported into the 4.19 kernel it is currently using.
We are using the default Weave CNI that Kops installs.
If the kops images were updated to use the current stable version of Debian Buster, or if they were able to incorporate the missing fix into Debian Stretch, then I am hoping our connection problems will go away.
The text was updated successfully, but these errors were encountered: