-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to auto re-register application pods to consul cluster in k8s upon killing a consul client pod #1206
Comments
Hi @ashwinkupatkar this is a well known problem that exists today as the Consul client but we accommodate for in Consul K8s using our endpoints controller that reconciles endpoints from K8s to Consul. When a client pod is rescheduled and healthy , we get all Kubernetes endpoints and filter addresses of endpoints that belong to this client by comparing the Kubernetes nodeName of the address with the nodeName of the Consul client pod. Each endpoint will then be re-registered using the endpoints controller’s reconciler. I think we need exact commands and config.yaml as well as deployment.yaml to see if this is indeed an issue and something that is reproducible. |
Sure @david-yu let me provide you the commands to reproduce the issue. Will reply back. |
The steps are as below:
This shows that despite the application pod being healthy and serviceable, it won't be receiving the traffic since it's not registered with consul. |
Hey @ashwinkupatkar I just tried these steps and was not able to reproduce. I tried it on kind with k8s version 1.21.2. We also have an acceptance test that verifies that services are re-registered if a consul client pod is killed here: consul-k8s/acceptance/tests/connect/connect_inject_test.go Lines 222 to 224 in 53e191b
We run this test on different versions of Kubernetes and different kubernetes distributions, and it has been passing. Could you give us more detailed steps so we can reproduce this issue? What are your Helm values? |
Hi @ishustava , I could reproduce the issue again today. I am using Kubernetes v1.21.9 for . I am having a federated consul cluster setup with primary being on VM and secondary being on K8S. Consul -> 1.11.4 My override configuration looks like below (Redacted few details): global:
datacenter: <primary-dc>
image: "consul:1.11.4"
imageK8S: "hashicorp/consul-k8s-control-plane:0.42.0"
name: "consul"
tls:
enabled: true
enableAutoEncrypt: true
caCert:
secretName: <consul-ca-cert-name>
secretKey: tls.crt
caKey:
secretName: <consul-ca-key-name>
secretKey: tls.key
imageEnvoy: "envoyproxy/envoy-alpine:v1.20.0"
federation:
enabled: true
primaryDatacenter: "<primary-dc>"
primaryGateways: ["<primary_meshgateway_ip>:<port>"]
k8sAuthMethodHost: "https://<k8s_host_ip>:<port>"
gossipEncryption:
secretName: <consul-encrypt-key-name>
secretKey: gossipEncryptionKey
acls:
manageSystemACLs: true
replicationToken:
secretName: <consul-replication-token-name>
secretKey: replicationToken
metrics:
enabled: true
enableAgentMetrics: false
agentMetricsRetentionTime: 1m
enableGatewayMetrics: true
consulSidecarContainer:
resources:
requests:
memory: xxxxMi
cpu: xxxxm
limits:
memory: xxxxMi
cpu: xxxxm
server:
enabled: true
containerSecurityContext:
# The consul server agent container
# @type: map
# @recurse: false
server:
runAsNonRoot: true
runAsGroup: xxxx
runAsUser: xxxx
allowPrivilegeEscalation: false
storageClass: <storage-class-name>
resources:
requests:
memory: "xxxxMi"
cpu: "xxxxm"
limits:
memory: "xxxxMi"
cpu: "xxxxm"
extraConfig: |
{
"primary_datacenter": "<primary-dc>",
"primary_gateways":["<primary_meshgateway_ip>:<port>"],
"ports": {"serf_wan": <serf_port>},
"telemetry": {
"prometheus_retention_time": "1m"
}
}
client:
containerSecurityContext:
# The consul client agent container
# @type: map
# @recurse: false
client:
runAsNonRoot: true
runAsGroup: xxxx
runAsUser: xxx
allowPrivilegeEscalation: false
# The acl-init initContainer
# @type: map
# @recurse: false
aclInit:
runAsNonRoot: true
runAsGroup: xxxx
runAsUser: xxx
allowPrivilegeEscalation: false
# The tls-init initContainer
# @type: map
# @recurse: false
tlsInit:
runAsNonRoot: true
runAsGroup: xxxx
runAsUser: xxx
allowPrivilegeEscalation: false
extraConfig: |
{
"ports": {"serf_wan": <serf_port>},
"telemetry": {
"prometheus_retention_time": "1m"
}
}
connectInject:
enabled: true
default: false
transparentProxy:
defaultEnabled: false
metrics:
defaultEnableMerging: true
failurePolicy: "Ignore"
sidecarProxy:
resources:
requests:
memory: xxxxMi
cpu: xxxxm
limits:
memory: xxxxMi
cpu: xxxxm
envoyExtraArgs: "- -- --restart-epoch 0"
controller:
enabled: true
meshGateway:
enabled: true
replicas: 1
service:
enabled: true
type: NodePort
nodePort: <node_port_for_meshgateway>
consulServiceName: "mesh-gateway"
resources:
requests:
memory: "xxxxMi"
cpu: "xxxxm"
limits:
memory: "xxxxMi"
cpu: "xxxxm"
ingressGateways:
defaults:
replicas: 1
enabled: true
gateways:
- name: <secondary-ingress-gateway-name>
service:
type: NodePort
ports:
- port: <http_port_for_ingressgateway>
nodePort: <http_node_port_for_ingressgateway>
- port: <https_port_for_ingressgateway>
nodePort: <https_node_port_for_ingressgateway> Please let me know, what configuration, if any, i am missing here. Thank you. |
The steps to reproduce the issue is as provided above. Let me put it in simple steps further: Deploy some app on k8s with service mesh. Make a note of the k8s worker node to which the consul agent pod is tied with and the application is registered to this consul agent pod. For example, if a service named "Testapp" with service id as "Testapp-9hds9-hi89-Testapp" is registered to consul agent pod and this consul agent pod is running on some k8s worker node say, kubeworker03 , then manually delete this consul agent pod using kubectl delete pod command.... After sometime new consul agent pod comes up and reregisters the kubeworker03 node ... but the "Testapp" service with service id "Testapp-9hds9-hi89-Testapp" doesnot show up. This is a big problem. Despite the service instance being healthy, it won't receive any traffic from the mesh since its not part of the catalog itself. Let me know, if any more inputs are required from my end... I am happy to provide it. Thanks |
Hey @ashwinkupatkar could you provide your values overrides as YAML? |
I do not see a way to properly represent it in the comment. Provided it as an attachment now. |
Hi @ashwinkupatkar, thank you for the information and your patience. As @ishustava mentioned, we have an acceptance test that validates this scenario. I've also tried recreating this on kind and in GKE with the same versions as you have used for consul and consul on kubernetes. I've attached the videos here for each delete.client.pod.on.kind.movI have a couple of questions:
Thank you for your help and patience! |
@jmurret Thanks for the elaborate reply.. I will go through the video and the comment back. Thank you again. |
@jmurret , as per the above video this seems to be working as expected in your environment (GKE). My deployment environment is vSphere. This issue is very consistent in our environment when we manually delete a consul client pod every time. Auto-scaling is not picture as of now here. As requested by you, I will grab the connect injector logs when we delete the consul agent pod manually and post it here. Thanks! |
Hey @ashwinkupatkar we also run the test John and I linked on GKE, AKS, EKS and kind and it did not fail in a while, so I'm confident it works in all those environment. Could you try it on a different Kubernetes cluster (like kind or GKE) and see if you're seeing the same issue? If not, then I think this is likely something related to your environment and not a bug in the code base. We will try to help you debug if you provide the logs, but unfortunately, because of our current workload and because we cannot reproduce with the steps you provided, we may not be able to respond quickly. |
Hi @jmurret, @ishustava, After deleting the consul agent pod , the logs produced on the connect injector pods are as below:
Can you please let me know if it provides the cause of the issue ? I am curious to know what the last lines of the above log output mean, I think thats the reason of the issue. Can you please format the output. I am not sure how we do it here. Thanks! |
Hey @ashwinkupatkar Based on these logs (if they are after consul client pod has been restarted), it looks like the
The last line is saying that it's ignoring the One thing that might be happening is that the client agent is not reconnected to the rest of the cluster and so the UI cannot display the information about the services registered with it. You can check if the client agent is aware of this service by |
Hi @ishustava, the log lines:
These occur when the static-server application is deployed.... so yes the static-server application do get registered and looks as desired. The actual log lines from :
are the ones that are outputted once I delete the consul agent pod manually ..beyond this no more logs are produced. (sorry i my above comment and log mislead you ) Comment One thing that might be happening is that the client agent is not reconnected to the rest of the cluster and so the UI cannot display the information about the services registered with it. Response The new client agent is able to connect back to the cluster as I can see the k8s worker node registers back as part of the cluster and the agent health shown is heathy + the consul agent pod is running healthy + running. consul member command from this new client agent pod shows me all the peers. Comment You can check if the client agent is aware of this service by kubectl execing into the client pod and running consul catalog services after restart. If the static-server service is there, then it probably means your consul cluster is in a bad state. You could then look at client logs after restart to check if there's any errors there. Response After exec'ing into the new client agent pod and running consul catalog services i see all the registered services including static-server. However my service instance is lost if i navigate to the UI and go to the node (whose client agent pod was deleted ) and then to the service instance I see nothing, its empty, no service instance registered. Before restarting client agent pod I had the service instance there. Comment You could then look at client logs after restart to check if there's any errors there. Response The agent logs are good. No relevant errors in the output. |
Closing as unable to reproduce. |
Hello, this bug has been filed based on https://discuss.hashicorp.com/t/auto-re-register-application-pods-to-consul-cluster/38797
Consul Binary -> 1.11.4
Consul-K8S -> 0.42.0
Envoy -> 1.20.0
Steps to reproduce the issue:
The text was updated successfully, but these errors were encountered: