-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orphan sidecar proxies affect service mesh #15908
Comments
@mr-miles How best can we contact you? This issue is very interesting and we would love to be able to touch base with you and get a little bit more detail from you! Thanks. |
@nrichu-hcp i'll mail you direct now |
Hi everyone, seems met with the same issue |
I chatted to @nrichu-hcp last week about it, hoping that hashicorp can give me some more pointers on where to look for the smoking gun. When it happens to a terminating gateway it is a killer for us. We had an instance over the weekend. Looking at the logs and comparing with a cycle due to a deployment, I noted:
Currently I'm suspecting there's some event-order dependence in the controller but I couldn't work out what it was subscribed to exactly to investigate further. Also we use karpenter and that seems to be implicated somehow (certainly it tries to consolidate instances in the middle of the night and that also seems to be when it occurs quite often). Maybe that is shutting the node down rather than the pods and all the events aren't making it out ... or something |
@vorobiovv can you give us more context on how this issue starts? @mr-miles unfortunately for us to be able to help you further we do need you to be able to reproduce the issue and that way we can take a crack at it ourselves and see where that leads us |
I've had some success reproducing this at last. The steps with greatest chance of success seems to be:
I'm tailing events from the cluster and see things like this: {"level":"info","ts":1674767502.7706947,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Error updating Endpoint Slices for Service consul/consul-ingress-gateway: skipping Pod consul-ingress-gateway-787c754bf9-xbrbz for Service consul/consul-ingress-gateway: Node ip-xxx Not Found","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d57f3d267","involvedObject":{"name":"Service/consul-ingress-gateway"},"reason":"FailedToUpdateEndpointSlices","source.component":"endpoint-slice-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}} In one occurrence, I saw only the EndpointSlice error. In a subsequent attempt I got these two errors. I also observe that the endpoint-controller doesn't receive the event about the service to deregister - so it doesn't seem like the problem is in the service catalog. Last of all I picked through the helm chart and I noticed the connect-injector pod doesn't have any anti-affinity or spreadConstraints by default. Does that make it possible for there to be nothing running to pick the events up in some circumstances? How long a backlog of events will the controller receive when it becomes leader, if the original leader is brutally killed? |
@nrichu-hcp - have you tried to reproduce the issue? Do you need any more information? There are a few tickets across the various repositories relating to orphans in the service catalog that all implicate node removal. |
The following PR is now merged which may address your issues: hashicorp/consul-k8s#2571. This should be released in consul-k8s 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases. Will go ahead and close as this also a duplicate of hashicorp/consul-k8s#2491 and hashicorp/consul-k8s#1817 |
Overview of the Issue
We are running a consul service mesh on k8s (eks) with external servers and ingress/terminating gateways. Most of the time this is working great!
I've noticed over time that the pods in the mesh are not always unregistering themselves from consul cleanly when they are shut down. I am seeing:
The cluster is in our dev environment and I thought it was down to the level of churn as we were iterating a few things. However I am still seeing it every few days even with limited deployments. I am suspecting a race condition maybe due to a leadership change or some timeout. I also have a suspicion that it may be related to karpenter rearranging the nodes periodically but I don't have any evidence for that specifically.
It is quite impacting since envoy attempts to route to the now-non-existent pods in a round robin way and so every other request gets a 503 response which makes things very broken.
Questions
I have looked through the server and pod logs but haven't found anything useful. Is there a namespace or phrase particularly useful to search for in the logs?
What are the mechanics of deregistration when using the consul dataplane in a pod? How does it guard against stragglers when there's an unclean shutdown of the pod/node?
Are there any other areas where you think deregistration might not complete? This would help guide some more specific testing.
Consul info for both Client and Server
Consul 1.14.3 servers, 5 nodes in aws
EKS 1.21 cluster, same region
Consul clients installed via helm chart 1.0.2
The text was updated successfully, but these errors were encountered: