-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy does not receive update from Contour anymore #1523
Comments
I've seen similar issues in #1514 which I'm actively working on now, but interested to see you're running v0.13 and are seeing this which makes me think it's not the same issue. I'll post back with the resolution to my issue, possibly that will resolve your problems as well. |
@FournierAlexandre thank you for raising this issue. I have determined that #1514 is not present in Contour 0.15 and earlier so it is probable that the issue you are seeing is not #1514. The first thing to do is you should upgrade to Contour 0.15.0. Can you please do that and let me know if the problem continues. Thank you. |
Hello @davecheney, thanks for the quick reply. I will upgrade Contour to 0.15 and let you know if we still have the same problem. Thanks! |
Hello @davecheney, we did update to the latest version and we have the same behavior. Configuration are not the same on each node. In attachment, save the configuration from all the node of the clusters, I am not sure if that can help you, let me know. Do you have any pointer on how I can debug this problem or what can cause this issue? Thanks! |
Hey @FournierAlexandre, I'm picking this issue up. I can see how an intermittent problem like this would be very frustrating, so I'd really like to get to the bottom of it with you. Firstly, though, so I can do my best to reproduce the issue, can you pass me the Contour deployment and Envoy daemonset YAML? Scrub names if you need to, but I really need to be able to make sure I'm running a config that's the same as yours to be able to troubleshoot properly. While I'm asking for info, I can see from the Envoy configs you sent that you have a reasonable number of services, can you provide any information about how you're using IngressRoute? Ideally, you could send a sample set of IngressRoute YAMLs, but if that's not practical, the information I really need is:
That way, I can see if I can replicate something close to what you have. The last thing I'd like is some sample logs from Envoy and Contour around the time you've had the problem as well. I know that's a lot of information, these intermittent problems are tricky. Thanks very much! |
Hello @youngnick, thank you very much for your help! In attachment, all the necessary YAML (I think) you need to have the proper setup. Let me know if there is something missing. Number of ingressRoute: 31 We have a shared cluster for all the teams. They can create ingressRoute in their respective namespace and use Envoy as the unique load balance of our Kubernetes cluster. Basically, all traffic goes to Envoy and the teams decided which route goes to their namespace. Here an example of a ingressRoute |
Hi @FournierAlexandre, sorry to ask for more, but could you also check the Contour pod metrics for me? Specifically, I'm looking for average CPU usage, average memory usage, and if the pod is ever OOMkilled by the memory limit of 100Mi. |
Hello @youngnick, I did not see any restart for the last 2 days but here are the metric that we have : Let me know if you need more information. On my side, I am setting up my dev environment to trace the code. I will let you know If I found something. Thanks again for your help! |
Hi @FournierAlexandre, thanks for that data, I was curious to see if there were any OOMKills from hitting the memory limit - good to know that Contour is behaving on the memory front at least. I'm still trying to narrow in on what could be the cause here, so I'm asking for a lot of info, please bear with me. I'd really like to see the full logs for a Contour and an Envoy around the time that the issue occurs. Has it happened as much since you upgraded to v0.15.0? If so, is there any way I could get some of those logs? In the past, we've seen partial configuration problems caused by Envoy reading config from a Contour that hasn't finished syncing its internal cache with the Kubernetes API yet, but that doesn't seem to apply here because the Envoy state is stable. I've been over the Envoy configs you kindly supplied a few times and I can't see anything obvious, so we are reduced to this slow information gathering. I'm sorry not to be able to help more yet. So, in summary, the information I'd like please:
Thanks again for being patient! |
Hi @FournierAlexandre, any word of recurrence of the issue? |
Hello @youngnick, Sorry for the late response about this issue. We had to recreate our Kubernetes cluster from scratch because of network configurations. There was some conflict between internal and external IPs. Also, there was lingering invalid ingressRoute and certificates which has been cleanup. On a clean cluster it seems that I am not able to reproduce the problem yet. I am setting up synthetic tests that delete pod, create ingressRoute, etc. to accelerate the rate of change and see if we can reproduce the problem. As for now, I cannot provide you with proper logs around the exact time it happens because I need to set Envoy log to Debug to have verbosity that log connections issues. To answer your questions above How did you first notice of the problem : We started to have 503 errors when accessing the different services in the Kubernetes Cluster. It was usually after pods have been replace by another one, AKA after a deployment. Do you have any other ingress controllers deployed in the cluster? Are you using ingress class annotations at all : There are no other ingress controller and we are only using IngressRoute. Thanks again for your help, I will stress test our new cluster and provide you with information in the following days. |
Thanks for all the info @FournierAlexandre! I'll be waiting to see what happens for you. My fingers are crossed. Thanks for being patient with all my information requests. |
Hello @youngnick , We have been able to reproduce the problem (it is harder since the upgrade) and I have attach the logs of Contour (Envoy did not produce any relevant logs, only http calls). Problem happens around 2019-10-01T13:30, it seems that there is a force update from contour and I notice some resource_names empty (it may be a log thing). We will try to get you debug logs from Envoy. Also, there are 3 invalid IngressRoute in the cluster. This could be the cause of this issue. I don't know if this is relevant but we are usually unable to reproduce it during the day after a restart of the Envoy pods. It is usually an overnight thing. It definitely could be a coincidence. Thanks again for your help! |
@FournierAlexandre thank you for your reply
|
@FournierAlexandre thank you for providing the log file. What I see is at some point between 2019-10-01T12:43:34Z and 2019-10-01T13:30:12Z an envoy was added to the daemonset. My suspicion is that we don't see any connection closed lines indicating a reset of an envoy -- however the log may be truncated before those events happen (probably seconds if not minutes depending on how the OS handles dangling connections). Something like this
Can you please do a few things:
Could you please tell me more about the invalid ingress routes. |
I am picking up this issue from my colleague @FournierAlexandre.
Parameters:
On the logging side, we have:
This monitoring setup is meant to:
Immediately after setting up the probe, 4 out of the 16 Envoys were failing consistently. I've deleted each pod one after the other and confirmed that the new pod on the same node then passed the probe. Since then (2 days ago), I have not observed any desynchronization, which could mean:
I really hope it's the first option and not the second, but I'll be watching closely. In the meantime, I am gathering additional logs from the past to answer @davecheney's questions. I do have evidence of the We've also fixed the Finally, we no longer have invalid ingress routes, but I will try to deliberately create some to see if it triggers the condition. |
Some graphs around the event: Legend:
Observations:
|
Thank you for your detailed response.
Immediately after setting up the probe, 4 out of the 16 Envoys were
failing consistently
Are you able to share more details about this failure? What was failing?
How did you detect it? How did it manifest itself?
…On Sat, 5 Oct 2019 at 03:27, Benoit Gagnon ***@***.***> wrote:
Some graphs around the event:
[image: image]
<https://user-images.githubusercontent.com/81865/66226941-05581c00-e6aa-11e9-85f3-248727c363a2.png>
Legend:
- the red dotted lines are k8s events where an Envoy exited or started
- the red dotted lines correspond to manual kubectl delete pod envoy-*
- the blue dotted line shows where the Nginx rotation interval was
increased from 60s to 300s (ie. 80% reduction of Endpoint churn)
Observations:
- the Contour DAG rebuild sawtooth looks OK with a predictable forced
refresh at regular intervals
- the four probes were initially failing on 503 errors
(_cx_connect_fail) -- an EDS update bug
- the same four probes then switched to 404 errors -- a CDS update bug
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1523?email_source=notifications&email_token=AAABYAYDJMEIPVHJ36VCHRLQM54GNA5CNFSM4IXSVEGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAMLABQ#issuecomment-538488838>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABYAZ4MEHAYRPMYZUKBVTQM54GNANCNFSM4IXSVEGA>
.
|
Two failure states, on the same Envoy pods. In the first case, Envoy was failing to connect to one of the upstream cluster members. This turned out to be an incorrect set of pod IP addresses that no longer matched the k8s Endpoints. This was confirmed by three sources:
We concluded that EDS was broken/stuck on those 4 pods. In a later case on the same pods, Envoy was failing to discover the new cluster/route. We transitioned to this state when the Service, IngressRoute and Pods were moved from one namespace to another. This was confirmed by two sources:
We concluded that RDS and/or CDS was also broken/stuck on those 4 pods.
The Prometheus/Alertmanager alert fired immediately for those Envoy pods once the probe and alert rules were deployed.
Any request through the beacon service's IngressRoute would consistently fail with a 503 error, then a 404 error. Before the introduction of the probe and forced testing on every Envoy pod, there was a 25% chance (4/16) of hitting one of the affected envoys through the cloud loadbalancer assigned to the DNS records. Furthermore, it only manifested itself if there was traffic on those routes, which is little to none at this stage of our rollout. |
Thanks @bgagnon for all the information you've provided. The key part for me out of what you wrote in that last comment is this:
What I'm thinking is that possibly, this is happening to the connection back to Contour. Remember that the connection back to Contour is a statically defined cluster that is then referenced as the source for EDS, CDS, RDS, SDS, and LDS. If, for some reason, that cluster (the Contour one) ended up with a stale set of IP addresses, then the problem would be unrecoverable without a restart, since Envoy has no way to reach Contour to get the updated addresses that would fix the problem. That shouldn't happen, because you are using the However, if Envoy is ending up with a stale set of addresses for Contour, that means that we have to look at kube-proxy and kube-dns and what they are doing, as well as what Envoy is doing about caching DNS entries. Actual, testable things that come out of this theory:
Actions I'll take for now:
|
Ah yes, that's a very good theory! We've obviously suspected something along the lines of a broken connection between Envoy and Contour, but we haven't considered the possibility of an issue with the
I checked, and unfortunately there was a single Contour pod for the entire lifetime of the cluster (since Sep 24). However, it does not mean there wasn't a DNS blip that caused the contour upstream cluster to empty/break itself, losing the connectivity with Contour. I have just killed it hoping to reproduce the bug, and I have observed the following: The two yellow dotted lines represent the "contour killed" and "contour start" k8s events. There's an immediate increase in Meanwhile, the probe and membership changes for the dynamic cluster is constant, and no probes fail. This is also normal since we rotate the Nginx beacon pods only every 5 minutes. I will now try the following:
|
Voluntarily killing the Contour pod and having zero replicas for 10 minutes caused all probes to fail, as expected: Legend:
Since all Envoys recovered almost instantly when contour's |
Here are some Contour logs between the creation of its pod and the last known occurence of this bug:
|
After waiting for several days with our monitoring in place, we noticed zero occurences of the problem. This got us to switch gears and consider the fact that we may be experiencing the observer effect -- forcing EDS changes prevents the bug from happening. So we tried something else: over the weekend, we turned off the forced EDS updates. This caused complete xDS silence for a prolonged period -- the cluster is not busy and developers don't deploy on the weekend, so things are quiet. And behold, on Monday morning, we re-enabled the probing and 15/16 Envoys were failing: Legend:
We started looking into idle / timeout / keepalive issues and found some open bugs on Envoy, notably envoyproxy/envoy#6323 The Contour boostrap did not mitigate this bug by setting the TCP keepalive options. The stars were aligned because @youngnick fixed this less than 24 hours ago in #1744. While TCP keepalive might be fixed, there are still some open questions on the gRPC side after reading https://github.com/grpc/grpc/blob/master/doc/keepalive.md |
I spent some time looking at the Below is a recap of the investigation. First, we locate the node where the Contour pod is running: $ kubectl get pod -l app=contour -n runway-ingress -o json | jq '.items[0].spec.nodeName'
"runway-uks-ncsa-east-ne1-worker8" Then we SSH to the node ( runway-uks-ncsa-east-ne1-worker8$ docker ps -f "label=io.kubernetes.container.name=contour"
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8a87e0d7b76d 978cfcd67636 "contour serve --inc…" 12 days ago Up 12 days k8s_contour_contour-5cf94978b5-sqht6_runway-ingress_19c6ebe4-ea06-11e9-bc68-fa163e79a2f2_0 We use runway-uks-ncsa-east-ne1-worker8$ nsenter -t $(docker inspect -f '{{.State.Pid}}' 8a87e0d7b76d) -n netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 runway-uks-ncsa-e:42298 192.168.128.1:https ESTABLISHED
tcp6 0 0 192.168.10.66:8000 192.168.14.11:57150 TIME_WAIT
tcp6 0 0 192.168.10.66:8001 192.168.5.0:51586 ESTABLISHED
tcp6 0 0 192.168.10.66:8000 testy.testzonecml:49496 TIME_WAIT
tcp6 0 0 192.168.10.66:8000 192.168.14.11:56938 TIME_WAIT
tcp6 0 0 192.168.10.66:8000 testy.testzonecml:49756 TIME_WAIT
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path We are looking for local address
Already we know there's a problem, because we would expect N open connections for N running Envoy pods The Foreign Address of $ kubectl get nodes -o json | jq '.items[] | select(.spec.podCIDR == "192.168.5.0/24") | .metadata.name'
"runway-uks-ncsa-east-ne1-etcd4" Now we know that the only connected Envoy pod is the one running on We confirm this from the other side by SSH to the node. runway-uks-ncsa-east-ne1-etcd4$ docker ps -f "label=io.kubernetes.container.name=envoy"
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
012018109ad8 d3340c53fdcd "envoy -c /config/co…" 2 weeks ago Up 2 weeks k8s_envoy_envoy-gwc6q_runway-ingress_a32ed7ee-e6dc-11e9-bc68-fa163e79a2f2_0 Knowing the container id, we resolve its PID and then list open connections with runway-uks-ncsa-east-ne1-etcd4$ nsenter -t $(docker inspect -f '{{.State.Pid}}' 012018109ad8) -n netstat | grep 8001
tcp 0 0 runway-uks-ncsa-e:51586 192.168.138.21:8001 ESTABLISHED The destination address is $ kubectl get svc contour -n runway-ingress
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
contour ClusterIP 192.168.138.21 <none> 8001/TCP 17d This connection is legit. It is Next, we try any other node and see if it thinks it has an open connection from Envoy to Contour. runway-uks-ncsa-east-ne1-worker1$ docker ps -f "label=io.kubernetes.container.name=envoy"
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b4911deeb5c8 d3340c53fdcd "envoy -c /config/co…" 2 weeks ago Up 2 weeks k8s_envoy_envoy-xcf9j_runway-ingress_b6896a5f-e6dc-11e9-bc68-fa163e79a2f2_0 We run runway-uks-ncsa-east-ne1-worker1$ nsenter -t $(docker inspect -f '{{.State.Pid}}' b4911deeb5c8) -n netstat | grep :8001
tcp 0 0 runway-uks-ncsa-e:37230 192.168.138.21:8001 ESTABLISHED Indeed, this side thinks it is connected to At this point, we have identified a half open (or is it half-closed?) TCP connection, which is not discoverable from this side of the connection without a keep-alive strategy. This can be shortened to a one-liner and tried on each of the 16 nodes: $ nsenter -t $(docker inspect -f '{{.State.Pid}}' $(docker ps -qf 'label=io.kubernetes.container.name=envoy')) -n netstat | grep tcp | grep :8001 We find that 16 of them claim to be connected to Contour, but only one is actually connected because Contour only sees one Since Envoy is the client to Contour, there is no way for Contour to force the client to re-establish |
We have tested v0.15.3 with the TCP and gRPC keepalive patches and they fix the issue. Thanks @youngnick , @ravilr and @davecheney! |
Fantastic to hear @bgagnon! Thanks again for all your assistance. Closing. |
What steps did you take and what happened:
The problem is intermittent. Sometimes, after deploying workload or adding new ingressRoute, envoy do not update its configuration on all pods. It does not happen all the time and when it happens, it seems that Envoy and Contour are stuck in a state that only killing Envoy or Contour pod can fix the issue. However, only a few pods are not updated. Thus that is enough to receive 503 response from envoy because the IP of the target pod is not up to date.
What did you expect to happen:
I do realize that network issue can happen and thus loss of connection between Envoy and Contour can happen. However, It should reconnect properly without any human intervention.
Anything else you would like to add:
So our setup:
I took a look at the logs of Contour and Envoy and there is nothing substantial. We do have these logs
By looking at the Envoy metrics, we could see a lot of configuration update Failure (RDS and CDS).
Envoy Configuration for Contour
Environment:
The text was updated successfully, but these errors were encountered: