-
Notifications
You must be signed in to change notification settings - Fork 672
DNS lookup timeouts due to races in conntrack #3287
Comments
Whoa! Good job for finding that. However:
this might be an issue. |
@bboreham my kernel networking Fu is weak, so I'm not even able to suggest any work arounds. I'm hoping others here have stronger Fu... Challenge proposed! naysayers frequently make scary, handwavey stability arguments against container stacks. Usually I laugh in the face of danger, but this appears to be the first ever case I've seen in which a little known kernel level gotcha actually does create issues for containers that would otherwise be unlikely to surface |
I just spent several hours trouble shooting this problem, ran into the same XING blog post and then this issue report which was opened while I was trouble shooting! Anyway, I'm seeing the same issues reported in the XING blog. DNS 5 second delays and a lot of
More details can be provided if needed. |
@btalbot one workaround you might try is to set this option in resolv.conf: options single-request-reopen It is a workaround that will basically make glibc retry the lookup, which will work most of the time. Another bandaid that helps is to change ndots from 5 (the default) to 3, which will generate far fewer requests to your dns servers ,and lessen the frequency. The problem is that it's kind of a pain to force changes into resolve.conf. it's done with kubelet --resolve-conf option, but then you have to create the whole file yourself which stinks. |
@bboreham it does appear that the patched iptables is available. Can weave use a patched iptables? |
The easiest thing is to use an iptables from a released Apline package. From there it gets progressively harder. (Sorry for closing/reopening - finger slipped) |
BTW my top tip to reduce DNS requests is to put a dot at the end when you know the full address. Eg instead of For an in-cluster address if you know the namespace you can construct the fqdn, e.g. |
@bboreham great tip, I didn't know that one! Thanks |
I did a little investigation on netfilter.org. alpine:latest packages v 1.6.1, however alpine:edge packages v 1.6.2 |
This only works for some apps or resolvers. The bind tools honor that of course since that is a decades old syntax for bind's zone files. But any apps that try to From inside an alpine container |
in our testing, we have found that only the options single-request-reopen change actually addresses this issue. Its a band-aid-- but dns lookups are fast, so we get aberrations of like 100ms, not 5 seconds,w hich is acceptable for us. Now we're trying to figure out how to inject that into resolv.conf on all the pods. Anyone know how to do that? |
I found this hack in some other related github issues and it's working for me apiVersion: v1
data:
resolv.conf: |
nameserver 1.2.3.4
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:3 single-request-reopen
kind: ConfigMap
metadata:
name: resolvconf Then in your affected pods and containers volumeMounts:
- name: resolv-conf
mountPath: /etc/resolv.conf
subPath: resolv.conf
...
volumes:
- name: resolv-conf
configMap:
name: resolvconf
items:
- key: resolv.conf
path: resolv.conf |
Experiencing the same issue here. 5s delays on every, single, DNS lookup, 100% of the time. Similarly, Mounting a Here is the flannel patch: https://gist.github.com/maxlaverse/1fb3bfdd2509e317194280f530158c98 |
@Quentin-M what k8s version are you using? I'm curious why it's 100% repeatable for some but intermittent for others. Another method to inject resolve.conf change s would be a deployment initializer. I've been trying to avoid creating one, but it's beginning to seem inevitable that in an Enterprise environment you need a way to enforce various things on every launched workload in a central way. I'm still investigating the use of kubelet --resolve-conf, but what I'm really worried about is that all this is just a bandaid.. The only actual fix is the iptables flag |
Has anyone tried installing and running iptables-1.6.2 from the alpine packages for edge on Alpine 3.7? |
@brb i was wondering the same thing. It would be nice to make progress and get a PR ready in anticipation of availability of 1.6.2. My go Fu is too week to take a shot at making the fix, but I'm guessing the fix goes somewhere around expose.go? If it were possible to create a frankenversion that has this fix, we could test it out. |
Just installed it with
Yes, you are right.
I've just created the weave-kube image with the fix for amd64 arch only and kernel >= 3.13 (https://github.com/weaveworks/weave/tree/issues/3287-iptables-random-fully). To use it, please change the image name of weave-kube to "brb0/weave-kube:iptables-random-fully" in DaemonSet of Weave. |
@brb Score! that's awesome! we'll try this out asap! |
I can't think of anything which would prevent it from working. Please let us know whether it works, thanks! |
@dcowden Kubernetes 1.10.1, Container Linux 1688.5.3-1758.0.0, AWS VPCs, Weave 2.3.0, kube-proxy IPVS. My guess is that it depends how fast/stable your network is? |
I have tried the other day, while it changed the resolv.conf of my static pods, all the other pods (with default dnsPolicy) were still based on what dns.go constructs. Note that the DNS options are written as a constant there. No possibility to get |
@brb Thanks! I haven't realized yesterday that the patched iptables was already in an Alpine release. My issue is surely still present and both
|
@brb, i tried this out. I was able to upgrade successfully, but it didnt help my problems. I think maybe i don't have it installed correctly, because my iptables rules do not show the fully-random flag anywhere. Here's my daemonset ( annotations and stuff after the image omitted ):
The daemonset was updated ok. Here's the iptables rules i see on a host. I dont see --random-fully anywhere:
I don't know what to try next. |
@dcowden You need to make sure you are calling iptables 1.6.2, otherwise you will not see the flag. One solution is to run iptables from within the weave container. As for you, it did not help my issue, the first AAAA query still appears to be dropped. I am compiling kube-proxy/kubelet to add the fully-random flag there as well, but this is going to take a while. |
[EDIT: I was confused so scoring out this part. See later comment too.] The problem [EDIT: in this specifc GitHub issue] comes when certain DNS clients make two simultaneous UDP requests with identical source ports (and the destination port is always 53), so we get a race. The best mitigation is a DNS service which does not go via NAT. This is being worked on in Kubernetes, basically one per node and disabling NAT for on-node connections. |
But isn't there a race condition in that source port uniqueness algorithm during SNAT, regardless of protocol and affecting different pods on the same host in the same way as the dns UDP client issue within one? Basically as in https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 |
Sorry, yes, there is a different race condition to do with picking unique outgoing ports for SNAT. If you are actually encountering this please open a new issue giving the details. |
Thank you for the response. Indeed I'm seeing insert_failed despite implementing several workarounds and I'm note sure whether it's TCP, UDP, SNAT or DNAT. We can't bump the kernel yet. If I understood correctly the SNAT case should be mitigated by the "random fully" flag, but Weave never went on with it? I think kubelet and kube-proxy would need those as well anyway, I don't know where things stand there. There is one more head scratching case for me which is how all those cases fare when one uses NodePort. Isn't there a similar conntrack problem if NodePort forwards to cluster ip? |
We investigated the problem reported here, and developed fixes to that problem. If someone reports symptoms that are improved by "random fully" then we might add that. We have finite resources and have to concentrate on what is actually reported (and within that set, on paying customers). Or, since it's Open Source, anyone else can do the investigation and contribute a PR. |
I understand :) I was merely trying to comprehend where things stand with regards to the different races and available mitigations, since there exist several blog posts and several github issues with a massive amount of comments to parse. From my understanding of all of it, even with 2 kernel fixes and dns workarounds and iptables flags there is still an issue at least with multipod -> Cluster IP multipod connection, and without kernel 5.0 or "random fully" also an issue with simple multipod -> External IP connection. But yeah, I'll raise a new issue if that proves true and impactful enough for us in production. Thank you |
@Quentin-M @brb We are using weave as well for our CNI and I tried to use the workaround mentioned by @Quentin-M. But I am getting error:
I am using debian: And I have mounted on Can you please correct where I am getting wrong ?
Edit: |
@Krishna1408 If you change |
Hi @hairyhenderson thanks a lot, it works for me :) |
@brb May I ask if the problem (5 sec DNS delay) is solved with the 5.x Kernel? Have you have some more details and feedback from people already? |
@phlegx It depends which race condition you hit. The first two out of three got fixed in the kernel, and someone reported a success (kubernetes/kubernetes#56903 (comment)). However, not much can be done from the kernel side about the third race condition. See my comments in the linked issue. |
I will repeat what a few others have said in this thread: the best way forward, if you have this problem, is “node-local dns”. Then there is no NAT on the DNS requests from pods and so no race condition. Support for this configuration is slowly improving in Kubernetes and installers. |
We upgraded to Linux 5.x now and for now the "5 second" problem seem to be "solved". Need to check about this third race condition. Thanks for your replies! |
You mean the Linux 5.x is kernel 5.x ? |
I just wanted to pop in and say thanks for this excellent and detailed explanation. 2 years since it was filed and 1 year since it was fixed, some people still hit this issue, and frankly the DNAT part of it had me baffled. It took a bit of reasoning but as I understand it - the client sends multiple UDP requests on the same {source IP, source port, dest IP, dest port, protocol} and one just gets lost. Since clients are INTENTIONALLY sending them in parallel, the race is exacerbated. |
I was able to solve the issue by using the SessionAffinity feature by kubernetes: This solution makes all DNS request packets from one pod be delivered to the same kube-dns pod, thus eliminating the problem that the conntrack DNAT race condition causes |
@DerGenaue |
Session affinity should work fine in iptables, but you still have the
race the first time any pod starts sending DNS, any time the chosen backend
dies, and (if you use a lot of DNS) you get no balancing.
It's kind of hacky, but a fair mitigation for many people.
…On Sun, Apr 12, 2020 at 2:58 AM Bryan Boreham ***@***.***> wrote:
@DerGenaue <https://github.com/DerGenaue> as far as I can tell
sessionAffinity only works with --proxy-mode userspace, which will slow
down service traffic to an extent that some people will not tolerate.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3287 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVD7FJB5JX3GDGOZXHLRMGGDHANCNFSM4E473DHQ>
.
|
I checked the kube-proxy code and the iptables version generates sessionAffinity just fine. |
NodeLocal DNS avoids this problem, yes, by avoiding conntrack. But we
have definitely experienced a single pod that issues DNS 2 queries in
parallel (A and AAAA) and triggers this race.
…On Sun, Apr 12, 2020 at 4:27 PM DerGenaue ***@***.***> wrote:
I checked the code and the iptables version generates sessionAffinity just fine.
I don't think any single pod will ever do so many DNS requests to cause any problems in this regard.
Also, the way I understood it, the current plan for the future is to route all DNS requests to the pod running on the same node (aka. only node-local DNS traffic), which basically would be very similar to this solution.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi. Why are you not implement dnsmasq instead of working with usual dns clients? |
@elmiedo it is uncommon to have the opportunity to change DNS client - it's bound into each container image in code from glibc or musl or similar. And the problem we are discussing hits between that client and the Linux kernel, so the server (such as dnsmasq) does not have a chance to affect things. |
Again: The Kubernetes node-local DNS cache effort is trying to bypass these
problems by using NOTRACK for connections from pods to the local cache,
then using TCP exclusively from the local cache to upstream resolvers.
…On Fri, May 22, 2020 at 3:20 AM Bryan Boreham ***@***.***> wrote:
@elmiedo <https://github.com/elmiedo> it is uncommon to have the
opportunity to change DNS *client* - it's bound into each container image
in code from glibc or musl or similar. And the problem we are discussing
hits between that client and the Linux kernel, so the server (such as
dnsmasq) does not have a chance to affect things.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3287 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVDQ4TU4YIJY656AYTLRSZGW5ANCNFSM4E473DHQ>
.
|
@brb Thanks for your excellent explains. But there is a little doubt that confused me. I viewed the glibc source codes, it used |
https://elixir.bootlin.com/linux/v5.14.14/source/net/socket.c#L2548 |
What happened?
We are experiencing random 5 second DNS timeouts in our kubernetes cluster.
How to reproduce it?
It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.
Anything else we need to know?
We believe this is a result of a kernel level SNAT race condition that is described quite well here:
https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02
The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.
What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.
We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.
The text was updated successfully, but these errors were encountered: