Orphan sidecar proxies affect service mesh #15908

mr-miles · 2023-01-05T18:41:25Z

Overview of the Issue

We are running a consul service mesh on k8s (eks) with external servers and ingress/terminating gateways. Most of the time this is working great!

I've noticed over time that the pods in the mesh are not always unregistering themselves from consul cleanly when they are shut down. I am seeing:

a pod and its current status remain in the catalog "for ever" after the pod has been shutdown, and the status is never updated. I need to deregister the pod and sidecar manually via the api to complete the clean up.
a pod's sidecar proxy remains registered when the main service has been unregistered. this leaves ips in the service mesh but since the sidecars are not shown in the UI its hard to track down other than by using the /v1/catalog/connect endpoint. sidecars still seem to count as healthy even if the service they reference has disappeared.

The cluster is in our dev environment and I thought it was down to the level of churn as we were iterating a few things. However I am still seeing it every few days even with limited deployments. I am suspecting a race condition maybe due to a leadership change or some timeout. I also have a suspicion that it may be related to karpenter rearranging the nodes periodically but I don't have any evidence for that specifically.

It is quite impacting since envoy attempts to route to the now-non-existent pods in a round robin way and so every other request gets a 503 response which makes things very broken.

Questions

I have looked through the server and pod logs but haven't found anything useful. Is there a namespace or phrase particularly useful to search for in the logs?
What are the mechanics of deregistration when using the consul dataplane in a pod? How does it guard against stragglers when there's an unclean shutdown of the pod/node?
Are there any other areas where you think deregistration might not complete? This would help guide some more specific testing.

Consul info for both Client and Server

Consul 1.14.3 servers, 5 nodes in aws
EKS 1.21 cluster, same region
Consul clients installed via helm chart 1.0.2

nrichu-hcp · 2023-01-12T19:30:40Z

@mr-miles How best can we contact you? This issue is very interesting and we would love to be able to touch base with you and get a little bit more detail from you! Thanks.

mr-miles · 2023-01-12T19:38:47Z

@nrichu-hcp i'll mail you direct now

vorobiovv · 2023-01-23T11:08:55Z

Hi everyone, seems met with the same issue
May I know if you have some clues on this?
Thanks :)
Consul 1.14.3 servers with 14 nodes in aws
EKS 1.24 cluster in a single region

mr-miles · 2023-01-23T21:44:38Z

I chatted to @nrichu-hcp last week about it, hoping that hashicorp can give me some more pointers on where to look for the smoking gun. When it happens to a terminating gateway it is a killer for us.

We had an instance over the weekend. Looking at the logs and comparing with a cycle due to a deployment, I noted:

normal looking events from k8s, including container shutdowns from the pod
consul-endpoint-controller did not fire at any point for deregistration
it then proceeded to register the new pod just fine

Currently I'm suspecting there's some event-order dependence in the controller but I couldn't work out what it was subscribed to exactly to investigate further.

Also we use karpenter and that seems to be implicated somehow (certainly it tries to consolidate instances in the middle of the night and that also seems to be when it occurs quite often). Maybe that is shutting the node down rather than the pods and all the events aren't making it out ... or something

nrichu-hcp · 2023-01-26T19:14:10Z

@vorobiovv can you give us more context on how this issue starts?

@mr-miles unfortunately for us to be able to help you further we do need you to be able to reproduce the issue and that way we can take a crack at it ourselves and see where that leads us

mr-miles · 2023-01-26T22:15:20Z

I've had some success reproducing this at last. The steps with greatest chance of success seems to be:

delete the node containing the active connect-injector pod, when there are consul-registered pods running on the same node. (I have about 12 pods running on the node)

I'm tailing events from the cluster and see things like this:

{"level":"info","ts":1674767502.7706947,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Error updating Endpoint Slices for Service consul/consul-ingress-gateway: skipping Pod consul-ingress-gateway-787c754bf9-xbrbz for Service consul/consul-ingress-gateway: Node ip-xxx Not Found","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d57f3d267","involvedObject":{"name":"Service/consul-ingress-gateway"},"reason":"FailedToUpdateEndpointSlices","source.component":"endpoint-slice-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}}
{"level":"info","ts":1674767502.770735,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Failed to update endpoint consul/consul-ingress-gateway: Operation cannot be fulfilled on endpoints "consul-ingress-gateway": the object has been modified; please apply your changes to the latest version and try again","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d58f02a8d","involvedObject":{"name":"Endpoints/consul-ingress-gateway"},"reason":"FailedToUpdateEndpoint","source.component":"endpoint-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}}

In one occurrence, I saw only the EndpointSlice error. In a subsequent attempt I got these two errors.

I also observe that the endpoint-controller doesn't receive the event about the service to deregister - so it doesn't seem like the problem is in the service catalog.

Last of all I picked through the helm chart and I noticed the connect-injector pod doesn't have any anti-affinity or spreadConstraints by default. Does that make it possible for there to be nothing running to pick the events up in some circumstances?

How long a backlog of events will the controller receive when it becomes leader, if the original leader is brutally killed?

mr-miles · 2023-04-10T16:32:09Z

@nrichu-hcp - have you tried to reproduce the issue? Do you need any more information? There are a few tickets across the various repositories relating to orphans in the service catalog that all implicate node removal.

david-yu · 2023-07-20T21:31:37Z

The following PR is now merged which may address your issues: hashicorp/consul-k8s#2571. This should be released in consul-k8s 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases. Will go ahead and close as this also a duplicate of hashicorp/consul-k8s#2491 and hashicorp/consul-k8s#1817

jkirschner-hashicorp added the theme/kubernetes Consul-helm/kubernetes related questions label Jan 5, 2023

mr-miles mentioned this issue Apr 10, 2023

Daemonset services are listed after the nodes is decommissioned from cluster. hashicorp/consul-k8s#1935

Closed

mr-miles mentioned this issue Apr 25, 2023

pods not deregistering from service catalog on termination hashicorp/consul-k8s#1817

Closed

mr-miles mentioned this issue Jun 2, 2023

Handle errors properly when services are de-registered from the catalog hashicorp/consul-k8s#2258

Closed

2 tasks

mr-miles mentioned this issue Jun 29, 2023

BUG+FIX: Endpoints controller fails to deregister services hashicorp/consul-k8s#2491

Closed

curtbushko mentioned this issue Jul 20, 2023

Handle errors properly when services are de-registered from the catalog hashicorp/consul-k8s#2571

Merged

2 tasks

david-yu closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orphan sidecar proxies affect service mesh #15908

Orphan sidecar proxies affect service mesh #15908

mr-miles commented Jan 5, 2023

nrichu-hcp commented Jan 12, 2023

mr-miles commented Jan 12, 2023

vorobiovv commented Jan 23, 2023 •

edited

Loading

mr-miles commented Jan 23, 2023

nrichu-hcp commented Jan 26, 2023

mr-miles commented Jan 26, 2023

mr-miles commented Apr 10, 2023

david-yu commented Jul 20, 2023

Orphan sidecar proxies affect service mesh #15908

Orphan sidecar proxies affect service mesh #15908

Comments

mr-miles commented Jan 5, 2023

Overview of the Issue

Questions

Consul info for both Client and Server

nrichu-hcp commented Jan 12, 2023

mr-miles commented Jan 12, 2023

vorobiovv commented Jan 23, 2023 • edited Loading

mr-miles commented Jan 23, 2023

nrichu-hcp commented Jan 26, 2023

mr-miles commented Jan 26, 2023

mr-miles commented Apr 10, 2023

david-yu commented Jul 20, 2023

vorobiovv commented Jan 23, 2023 •

edited

Loading