Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy does not receive update from Contour anymore #1523

Closed
FournierAlexandre opened this issue Sep 17, 2019 · 27 comments
Closed

Envoy does not receive update from Contour anymore #1523

FournierAlexandre opened this issue Sep 17, 2019 · 27 comments
Assignees
Labels
blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it.
Milestone

Comments

@FournierAlexandre
Copy link

FournierAlexandre commented Sep 17, 2019

What steps did you take and what happened:
The problem is intermittent. Sometimes, after deploying workload or adding new ingressRoute, envoy do not update its configuration on all pods. It does not happen all the time and when it happens, it seems that Envoy and Contour are stuck in a state that only killing Envoy or Contour pod can fix the issue. However, only a few pods are not updated. Thus that is enough to receive 503 response from envoy because the IP of the target pod is not up to date.

What did you expect to happen:
I do realize that network issue can happen and thus loss of connection between Envoy and Contour can happen. However, It should reconnect properly without any human intervention.
Anything else you would like to add:
So our setup:

  • Envoy is deployed as a daemonset (on around 15 nodes)
  • Contour is deployed using a deployment of 1 replica
    I took a look at the logs of Contour and Envoy and there is nothing substantial. We do have these logs
forcing update" context=HoldoffNotifier last_update=5.75642593s pending=1 performing delayed 
update" context=HoldoffNotifier last_update=237.860631ms pending=39 ...

By looking at the Envoy metrics, we could see a lot of configuration update Failure (RDS and CDS).
Envoy Configuration for Contour

"static_resources": {
    "clusters": [
     {
      "name": "contour",
      "type": "STRICT_DNS",
      "connect_timeout": "5s",
      "circuit_breakers": {
       "thresholds": [
        {
         "priority": "HIGH",
         "max_connections": 100000,
         "max_pending_requests": 100000,
         "max_requests": 60000000,
         "max_retries": 50
        },
        {
         "max_connections": 100000,
         "max_pending_requests": 100000,
         "max_requests": 60000000,
         "max_retries": 50
        }
       ]
      },
      "http2_protocol_options": {},
      "alt_stat_name": "heptio-contour_contour_8001",
      "load_assignment": {
       "cluster_name": "contour",
       "endpoints": [
        {
         "lb_endpoints": [
          {
           "endpoint": {
            "address": {
             "socket_address": {
              "address": "contour",
              "port_value": 8001
             }
            }
           }
          }
         ]
        }
       ]
      }
     },
Cluster 
----------------------
contour::default_priority::max_connections::100000 
contour::default_priority::max_pending_requests::100000 
contour::default_priority::max_requests::60000000 contour::default_priority::max_retries::50 
contour::high_priority::max_connections::100000 
contour::high_priority::max_pending_requests::100000 
contour::high_priority::max_requests::60000000
 contour::high_priority::max_retries::50 

Environment:

  • Contour version: 0.13.0
  • Envoy: 1.10
  • Kubernetes version: 1.13
  • Rancher : 2.2.3
  • Cloud provider : Open Stack
  • OS : 4.19.43-coreos
@stevesloka
Copy link
Member

I've seen similar issues in #1514 which I'm actively working on now, but interested to see you're running v0.13 and are seeing this which makes me think it's not the same issue.

I'll post back with the resolution to my issue, possibly that will resolve your problems as well.

@davecheney
Copy link
Contributor

@FournierAlexandre thank you for raising this issue. I have determined that #1514 is not present in Contour 0.15 and earlier so it is probable that the issue you are seeing is not #1514.

The first thing to do is you should upgrade to Contour 0.15.0. Can you please do that and let me know if the problem continues. Thank you.

@davecheney davecheney added the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Sep 18, 2019
@FournierAlexandre
Copy link
Author

Hello @davecheney, thanks for the quick reply. I will upgrade Contour to 0.15 and let you know if we still have the same problem.

Thanks!

@FournierAlexandre
Copy link
Author

Hello @davecheney, we did update to the latest version and we have the same behavior. Configuration are not the same on each node. In attachment, save the configuration from all the node of the clusters, I am not sure if that can help you, let me know.

Do you have any pointer on how I can debug this problem or what can cause this issue?

Thanks!
clusterConfig2019-09-19.zip

@youngnick youngnick self-assigned this Sep 20, 2019
@youngnick
Copy link
Member

youngnick commented Sep 20, 2019

Hey @FournierAlexandre, I'm picking this issue up. I can see how an intermittent problem like this would be very frustrating, so I'd really like to get to the bottom of it with you.

Firstly, though, so I can do my best to reproduce the issue, can you pass me the Contour deployment and Envoy daemonset YAML? Scrub names if you need to, but I really need to be able to make sure I'm running a config that's the same as yours to be able to troubleshoot properly.

While I'm asking for info, I can see from the Envoy configs you sent that you have a reasonable number of services, can you provide any information about how you're using IngressRoute?

Ideally, you could send a sample set of IngressRoute YAMLs, but if that's not practical, the information I really need is:

  • how many IngressRoutes do you have?
  • How much do you use delegation?
  • How often do they change?

That way, I can see if I can replicate something close to what you have.

The last thing I'd like is some sample logs from Envoy and Contour around the time you've had the problem as well.

I know that's a lot of information, these intermittent problems are tricky.

Thanks very much!

@FournierAlexandre
Copy link
Author

Hello @youngnick, thank you very much for your help! In attachment, all the necessary YAML (I think) you need to have the proper setup. Let me know if there is something missing.

manifests.zip

Number of ingressRoute: 31
How often to they change : several times per day
Delegation : All of them

We have a shared cluster for all the teams. They can create ingressRoute in their respective namespace and use Envoy as the unique load balance of our Kubernetes cluster. Basically, all traffic goes to Envoy and the teams decided which route goes to their namespace.

Here an example of a ingressRoute

route.zip

@youngnick
Copy link
Member

Hi @FournierAlexandre, sorry to ask for more, but could you also check the Contour pod metrics for me? Specifically, I'm looking for average CPU usage, average memory usage, and if the pod is ever OOMkilled by the memory limit of 100Mi.

@FournierAlexandre
Copy link
Author

Hello @youngnick, I did not see any restart for the last 2 days but here are the metric that we have :
ContourCPURam-23-09-2019
ContourCPURamLimit-23-09-2019

Let me know if you need more information. On my side, I am setting up my dev environment to trace the code. I will let you know If I found something.

Thanks again for your help!

@youngnick
Copy link
Member

Hi @FournierAlexandre, thanks for that data, I was curious to see if there were any OOMKills from hitting the memory limit - good to know that Contour is behaving on the memory front at least.

I'm still trying to narrow in on what could be the cause here, so I'm asking for a lot of info, please bear with me.

I'd really like to see the full logs for a Contour and an Envoy around the time that the issue occurs. Has it happened as much since you upgraded to v0.15.0? If so, is there any way I could get some of those logs?

In the past, we've seen partial configuration problems caused by Envoy reading config from a Contour that hasn't finished syncing its internal cache with the Kubernetes API yet, but that doesn't seem to apply here because the Envoy state is stable.

I've been over the Envoy configs you kindly supplied a few times and I can't see anything obvious, so we are reduced to this slow information gathering. I'm sorry not to be able to help more yet.

So, in summary, the information I'd like please:

  • Contour logs from the time of the issue
  • Envoy logs from the time of the issue
  • Some answers to these questions:
    • how did you first notice the problem?
    • Do you have any other ingress controllers deployed in the cluster? Are you using ingress class annotations at all?

Thanks again for being patient!

@youngnick
Copy link
Member

Hi @FournierAlexandre, any word of recurrence of the issue?

@FournierAlexandre
Copy link
Author

Hello @youngnick,

Sorry for the late response about this issue. We had to recreate our Kubernetes cluster from scratch because of network configurations. There was some conflict between internal and external IPs. Also, there was lingering invalid ingressRoute and certificates which has been cleanup.

On a clean cluster it seems that I am not able to reproduce the problem yet. I am setting up synthetic tests that delete pod, create ingressRoute, etc. to accelerate the rate of change and see if we can reproduce the problem. As for now, I cannot provide you with proper logs around the exact time it happens because I need to set Envoy log to Debug to have verbosity that log connections issues.

To answer your questions above

How did you first notice of the problem : We started to have 503 errors when accessing the different services in the Kubernetes Cluster. It was usually after pods have been replace by another one, AKA after a deployment.

Do you have any other ingress controllers deployed in the cluster? Are you using ingress class annotations at all : There are no other ingress controller and we are only using IngressRoute.

Thanks again for your help, I will stress test our new cluster and provide you with information in the following days.

@youngnick
Copy link
Member

Thanks for all the info @FournierAlexandre! I'll be waiting to see what happens for you. My fingers are crossed.

Thanks for being patient with all my information requests.

@FournierAlexandre
Copy link
Author

FournierAlexandre commented Oct 1, 2019

Hello @youngnick ,

We have been able to reproduce the problem (it is harder since the upgrade) and I have attach the logs of Contour (Envoy did not produce any relevant logs, only http calls).

Problem happens around 2019-10-01T13:30, it seems that there is a force update from contour and I notice some resource_names empty (it may be a log thing). We will try to get you debug logs from Envoy. Also, there are 3 invalid IngressRoute in the cluster. This could be the cause of this issue.

I don't know if this is relevant but we are usually unable to reproduce it during the day after a restart of the Envoy pods. It is usually an overnight thing. It definitely could be a coincidence.

Thanks again for your help!

contour.log

@davecheney
Copy link
Contributor

@FournierAlexandre thank you for your reply

Problem happens around 2019-10-01T13:30, it seems that there is a force update from contour and I notice some resource_names empty (it may be a log thing

resource_names is an optional filter that envoy can supply to contour to subscribe to updates for only a particular resource. Depending on the type of resource being subscribed too Envoy may want to filter for a specific resource, ClusterLoadAssignments is an example, or to ask for all the resources, Listeners is an example.

@davecheney
Copy link
Contributor

davecheney commented Oct 3, 2019

@FournierAlexandre thank you for providing the log file. What I see is at some point between 2019-10-01T12:43:34Z and 2019-10-01T13:30:12Z an envoy was added to the daemonset. My suspicion is that we don't see any connection closed lines indicating a reset of an envoy -- however the log may be truncated before those events happen (probably seconds if not minutes depending on how the OS handles dangling connections). Something like this

time="2019-10-03T03:52:52Z" level=error msg="stream terminated" connection=7 context=grpc error="context canceled"

Can you please do a few things:

  1. Can you provide the envoy logs for the corresponding period
  2. Can you please check the events for that time period.
  3. Can you please make the following changes to your envoy deployment manifests, https://github.com/projectcontour/contour/blob/master/examples/contour/03-envoy.yaml#L30 so that envoy identifies itself with its corresponding pod name.

Also, there are 3 invalid IngressRoute in the cluster. This could be the cause of this issue.

Could you please tell me more about the invalid ingress routes.

@bgagnon
Copy link
Contributor

bgagnon commented Oct 4, 2019

I am picking up this issue from my colleague @FournierAlexandre.
To improve the oberservability, I've setup the following:

  • a vanilla nginx Deployment and a matching Service
  • an HTTP IngressRoute with a known fqdn, beacon.run.ubisoft.org that points to that service
  • Prometheus discovery that keeps track of all Envoy pods (Daemonset, 16 nodes, 16 pods)
  • blackbox_exporter probes through each specific Envoy pod (using http://$POD_IP:8443), with an override the Host header to match the beacon FQDN and satisfy the Contour/Envoy routing
  • another Deployment with a simple bash script that periodically patches the Nginx deployment to change an annotation on the pod template spec
  • some Prometheus alerts to catch any failures of the beacon probe

Parameters:

  • Prometheus scrapes each of the 16 probes every 30 seconds
  • the Nginx deployment is invalidated every 300 seconds
  • the probe reports a failure if status code is not 200 or if the page does not contain the phrase "Welcome to nginx!"

On the logging side, we have:

  • centralized logging for all containers and control plane with 30 days retention
  • persistent logging for all Kubernetes Events (also 30 days)
  • Envoy logging set to the default level of info
  • Contour logging with default configuration

This monitoring setup is meant to:

  • force regular traffic through Envoy and a known FQDN
  • force EDS updates at regular intervals
  • catch the exact time where an Envoy stops updating itself

Immediately after setting up the probe, 4 out of the 16 Envoys were failing consistently. I've deleted each pod one after the other and confirmed that the new pod on the same node then passed the probe.

Since then (2 days ago), I have not observed any desynchronization, which could mean:

  • it takes more time to reach that state
  • forcing frequent updates prevents the condition from happening

I really hope it's the first option and not the second, but I'll be watching closely.

In the meantime, I am gathering additional logs from the past to answer @davecheney's questions. I do have evidence of the stream terminated errors in the logs which were not included in @FournierAlexandre's attachment.

We've also fixed the --service-node flag to be unique for each Envoy (ie. pass in the pod name).

Finally, we no longer have invalid ingress routes, but I will try to deliberately create some to see if it triggers the condition.

@bgagnon
Copy link
Contributor

bgagnon commented Oct 4, 2019

Some graphs around the event:

image

Legend:

  • the red dotted lines are k8s events where an Envoy exited or started
  • the red dotted lines correspond to manual kubectl delete pod envoy-*
  • the blue dotted line shows where the Nginx rotation interval was increased from 60s to 300s (ie. 80% reduction of Endpoint churn)

Observations:

  • the Contour DAG rebuild sawtooth looks OK with a predictable forced refresh at regular intervals
  • the four probes were initially failing on 503 errors (_cx_connect_fail) -- an EDS update bug
  • the same four probes then switched to 404 errors -- a CDS update bug

@davecheney
Copy link
Contributor

davecheney commented Oct 4, 2019 via email

@bgagnon
Copy link
Contributor

bgagnon commented Oct 7, 2019

What was failing?

Two failure states, on the same Envoy pods.

In the first case, Envoy was failing to connect to one of the upstream cluster members. This turned out to be an incorrect set of pod IP addresses that no longer matched the k8s Endpoints.

This was confirmed by three sources:

  • the envoy_cluster_upstream_cx_connect_fail Prometheus counter monotonically increasing on those pods
  • the UF response flag (documented here) in the Envoy access logs next to each of the 503 errors
  • manual stats dumps on each Envoy's admin interface which revealed inconsistencies on the cluster member IP addresses -- note that this is not available in a config dump and not exposed to Prometheus metrics, so the non-prometheus stats it the only way we found to detect this

We concluded that EDS was broken/stuck on those 4 pods.

In a later case on the same pods, Envoy was failing to discover the new cluster/route. We transitioned to this state when the Service, IngressRoute and Pods were moved from one namespace to another.

This was confirmed by two sources:

  • the 404 status code on each request
  • the absence of any encoy_cluster_ timeseries on the 4 envoy pods for the specific envoy_cluster_name label (the 12 other Envoy had such timeseries)

We concluded that RDS and/or CDS was also broken/stuck on those 4 pods.

How did you detect it?

The Prometheus/Alertmanager alert fired immediately for those Envoy pods once the probe and alert rules were deployed.

How did it manifest itself?

Any request through the beacon service's IngressRoute would consistently fail with a 503 error, then a 404 error.

Before the introduction of the probe and forced testing on every Envoy pod, there was a 25% chance (4/16) of hitting one of the affected envoys through the cloud loadbalancer assigned to the DNS records.

Furthermore, it only manifested itself if there was traffic on those routes, which is little to none at this stage of our rollout.

@youngnick
Copy link
Member

Thanks @bgagnon for all the information you've provided.

The key part for me out of what you wrote in that last comment is this:

This turned out to be an incorrect set of pod IP addresses that no longer matched the k8s Endpoints.

What I'm thinking is that possibly, this is happening to the connection back to Contour. Remember that the connection back to Contour is a statically defined cluster that is then referenced as the source for EDS, CDS, RDS, SDS, and LDS.

If, for some reason, that cluster (the Contour one) ended up with a stale set of IP addresses, then the problem would be unrecoverable without a restart, since Envoy has no way to reach Contour to get the updated addresses that would fix the problem.

That shouldn't happen, because you are using the contour DNS entry, which should always be pointing to the set of Kubernetes Endpoints that are the Contour pods, via the Contour Service. (That's why we recommend doing it that way).

However, if Envoy is ending up with a stale set of addresses for Contour, that means that we have to look at kube-proxy and kube-dns and what they are doing, as well as what Envoy is doing about caching DNS entries.

Actual, testable things that come out of this theory:

  • Contour restarts may be a trigger. Contour pods cycling is the only thing that would induce a change in the contour Endpoint set. Do you see any correlation between Contour restarts and this problem?

Actions I'll take for now:

  • Read and grok the Envoy documentation for STRICT_DNS clusters (which the Contour cluster is), and make sure there's no settings to tune there.
  • Check and see if there's any stats I can find that would show information about Contour cluster members, so we can try and check the history.

@bgagnon
Copy link
Contributor

bgagnon commented Oct 8, 2019

If, for some reason, that cluster (the Contour one) ended up with a stale set of IP addresses, then the problem would be unrecoverable without a restart, since Envoy has no way to reach Contour to get the updated addresses that would fix the problem.

Ah yes, that's a very good theory!

We've obviously suspected something along the lines of a broken connection between Envoy and Contour, but we haven't considered the possibility of an issue with the STRICT_DNS discovery and/or CoreDNS related problems.

Contour restarts may be a trigger. Contour pods cycling is the only thing that would induce a change in the contour Endpoint set. Do you see any correlation between Contour restarts and this problem?

I checked, and unfortunately there was a single Contour pod for the entire lifetime of the cluster (since Sep 24). However, it does not mean there wasn't a DNS blip that caused the contour upstream cluster to empty/break itself, losing the connectivity with Contour.

I have just killed it hoping to reproduce the bug, and I have observed the following:

image

The two yellow dotted lines represent the "contour killed" and "contour start" k8s events.

There's an immediate increase in envoy_cluster_upstream_cx_connect_fail right after; this is expected since the contour ip has changed. However, is resolves itself quickly, in less than a minute or so, after which the errors no longer increase.

Meanwhile, the probe and membership changes for the dynamic cluster is constant, and no probes fail. This is also normal since we rotate the Nginx beacon pods only every 5 minutes.

I will now try the following:

  • simulate a prolonged loss of Contour, long enough to cause the probes to fail
  • I'll do this by setting the spec.replicas of the Contour deployment to 0
  • I will increase spec.replicas back to 1 10 minutes or so later

@bgagnon
Copy link
Contributor

bgagnon commented Oct 8, 2019

Voluntarily killing the Contour pod and having zero replicas for 10 minutes caused all probes to fail, as expected:

image

Legend:

  • yellow markers for contour kill / contour start
  • purple markers for nginx pod rotations

Since all Envoys recovered almost instantly when contour's spec.replicas was scaled back up to 1, this experiment was not successful in "corrupting" Envoys or causing them to stop receiving updates.

@bgagnon
Copy link
Contributor

bgagnon commented Oct 8, 2019

Here are some Contour logs between the creation of its pod and the last known occurence of this bug:

contour-5cf94978b5-krkzq	time="2019-10-01T07:33:55Z" level=info msg="forcing update" context=contourEventHandler last_update=1h32m12.378831299s outstanding=1
contour-5cf94978b5-krkzq	time="2019-10-01T10:01:42Z" level=error msg="stream terminated" connection=3281 context=grpc error="context canceled"
...previous message repeated several hundred times for every connection...
contour-5cf94978b5-krkzq	time="2019-10-01T10:01:42Z" level=error msg="stream terminated" connection=3306 context=grpc error="rpc error: code = Unavailable desc = transport is closing"
...previous message repeated several hundred times for every connection...
contour-5cf94978b5-krkzq	time="2019-10-01T07:33:55Z" level=info msg="forcing update" context=contourEventHandler last_update=1h32m12.378831299s outstanding=1
contour-5cf94978b5-krkzq	time="2019-10-01T07:33:55Z" level=info msg="performing delayed update" context=contourEventHandler last_update=198.782526ms outstanding=15

@bgagnon
Copy link
Contributor

bgagnon commented Oct 21, 2019

After waiting for several days with our monitoring in place, we noticed zero occurences of the problem. This got us to switch gears and consider the fact that we may be experiencing the observer effect -- forcing EDS changes prevents the bug from happening.

So we tried something else: over the weekend, we turned off the forced EDS updates. This caused complete xDS silence for a prolonged period -- the cluster is not busy and developers don't deploy on the weekend, so things are quiet.

And behold, on Monday morning, we re-enabled the probing and 15/16 Envoys were failing:

image

Legend:

  • orange dotted lines represent the stop of forced EDS updates on 10/19 and the re-activation on 10/21 (to detect the problematic envoys)
  • green/red bar chart shows number of Envoy pods affected (out of 16 total)

We started looking into idle / timeout / keepalive issues and found some open bugs on Envoy, notably envoyproxy/envoy#6323

The Contour boostrap did not mitigate this bug by setting the TCP keepalive options. The stars were aligned because @youngnick fixed this less than 24 hours ago in #1744.

While TCP keepalive might be fixed, there are still some open questions on the gRPC side after reading https://github.com/grpc/grpc/blob/master/doc/keepalive.md

@bgagnon
Copy link
Contributor

bgagnon commented Oct 21, 2019

I spent some time looking at the netstat information on each side (contour and envoy), and at this point it's pretty clear that 15 out of thr 16 TCP connections are half-closed. This explains with Envoy is not receiving any new XDS updates over gRPC.

Below is a recap of the investigation.


First, we locate the node where the Contour pod is running:

$ kubectl get pod -l app=contour -n runway-ingress -o json | jq '.items[0].spec.nodeName'
"runway-uks-ncsa-east-ne1-worker8"

Then we SSH to the node (runway-uks-ncsa-east-ne1-worker8), and find contour's Docker container id:

runway-uks-ncsa-east-ne1-worker8$ docker ps -f "label=io.kubernetes.container.name=contour"
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
8a87e0d7b76d        978cfcd67636        "contour serve --inc…"   12 days ago         Up 12 days                              k8s_contour_contour-5cf94978b5-sqht6_runway-ingress_19c6ebe4-ea06-11e9-bc68-fa163e79a2f2_0

We use netstat to list open connection within that network namespace, via nsenter:

runway-uks-ncsa-east-ne1-worker8$ nsenter -t $(docker inspect -f '{{.State.Pid}}' 8a87e0d7b76d) -n netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 runway-uks-ncsa-e:42298 192.168.128.1:https     ESTABLISHED
tcp6       0      0 192.168.10.66:8000      192.168.14.11:57150     TIME_WAIT
tcp6       0      0 192.168.10.66:8001      192.168.5.0:51586       ESTABLISHED
tcp6       0      0 192.168.10.66:8000      testy.testzonecml:49496 TIME_WAIT
tcp6       0      0 192.168.10.66:8000      192.168.14.11:56938     TIME_WAIT
tcp6       0      0 192.168.10.66:8000      testy.testzonecml:49756 TIME_WAIT
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node   Path

We are looking for local address 0.0.0.0:8001, and we see only one open connection:

tcp6       0      0 192.168.10.66:8001      192.168.5.0:51586       ESTABLISHED

Already we know there's a problem, because we would expect N open connections for N running Envoy pods
(N=16 in our environment).

The Foreign Address of 192.168.5.0:51586 identifies the node within the overlay network.
For this, we check the spec.podCIDR attribute of every node and look for 192.168.5.0/24:

$ kubectl get nodes -o json | jq '.items[] | select(.spec.podCIDR == "192.168.5.0/24") | .metadata.name'
"runway-uks-ncsa-east-ne1-etcd4"

Now we know that the only connected Envoy pod is the one running on runway-uks-ncsa-east-ne1-etcd4.

We confirm this from the other side by SSH to the node.
First, we find the docker container for Envoy:

runway-uks-ncsa-east-ne1-etcd4$ docker ps -f "label=io.kubernetes.container.name=envoy"
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
012018109ad8        d3340c53fdcd        "envoy -c /config/co…"   2 weeks ago         Up 2 weeks                              k8s_envoy_envoy-gwc6q_runway-ingress_a32ed7ee-e6dc-11e9-bc68-fa163e79a2f2_0

Knowing the container id, we resolve its PID and then list open connections with nestat, via nsenter:

runway-uks-ncsa-east-ne1-etcd4$ nsenter -t $(docker inspect -f '{{.State.Pid}}' 012018109ad8) -n netstat | grep 8001
tcp        0      0 runway-uks-ncsa-e:51586 192.168.138.21:8001     ESTABLISHED

The destination address is 192.168.138.21:8001 which matches the cluster service IP:

$ kubectl get svc contour -n runway-ingress
NAME      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
contour   ClusterIP   192.168.138.21   <none>        8001/TCP   17d

This connection is legit. It is ESTABLISHED on both sides and the addresses are those we expect.

Next, we try any other node and see if it thinks it has an open connection from Envoy to Contour.
For example, runway-uks-ncsa-east-ne1-worker1:

runway-uks-ncsa-east-ne1-worker1$ docker ps -f "label=io.kubernetes.container.name=envoy"
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
b4911deeb5c8        d3340c53fdcd        "envoy -c /config/co…"   2 weeks ago         Up 2 weeks                              k8s_envoy_envoy-xcf9j_runway-ingress_b6896a5f-e6dc-11e9-bc68-fa163e79a2f2_0

We run netstat via nsenter, looking for port :8001:

runway-uks-ncsa-east-ne1-worker1$ nsenter -t $(docker inspect -f '{{.State.Pid}}' b4911deeb5c8) -n netstat | grep :8001
tcp        0      0 runway-uks-ncsa-e:37230 192.168.138.21:8001     ESTABLISHED

Indeed, this side thinks it is connected to 192.168.138.21:8001, but there is no sign of this connection on the other side. Envoy thinks it has a live connection, but that's not true.

At this point, we have identified a half open (or is it half-closed?) TCP connection, which is not discoverable from this side of the connection without a keep-alive strategy.

This can be shortened to a one-liner and tried on each of the 16 nodes:

$ nsenter -t $(docker inspect -f '{{.State.Pid}}' $(docker ps -qf 'label=io.kubernetes.container.name=envoy')) -n netstat | grep tcp | grep :8001

We find that 16 of them claim to be connected to Contour, but only one is actually connected because Contour only sees one ESTABLISHED connection.

Since Envoy is the client to Contour, there is no way for Contour to force the client to re-establish
a TCP connection.

@bgagnon
Copy link
Contributor

bgagnon commented Oct 25, 2019

We have tested v0.15.3 with the TCP and gRPC keepalive patches and they fix the issue.
This issue can be closed.

Thanks @youngnick , @ravilr and @davecheney!

@youngnick
Copy link
Member

Fantastic to hear @bgagnon! Thanks again for all your assistance. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it.
Projects
None yet
Development

No branches or pull requests

5 participants