Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to auto re-register application pods to consul cluster in k8s upon killing a consul client pod #1206

Closed
ashwinkupatkar opened this issue May 3, 2022 · 16 comments
Labels
type/bug Something isn't working

Comments

@ashwinkupatkar
Copy link

Hello, this bug has been filed based on https://discuss.hashicorp.com/t/auto-re-register-application-pods-to-consul-cluster/38797

Consul Binary -> 1.11.4
Consul-K8S -> 0.42.0
Envoy -> 1.20.0

Steps to reproduce the issue:

  1. Deploy consul cluster on k8s
  2. Deploy an application on k8s
  3. Notice the application gets registered to consul cluster by navigating to the UI
  4. Now manually delete one of the consul client pod belonging to anyone of the k8s worker node where the application pod is also running.
  5. Once the new consul client pod comes up on that specific worker node check if the application pod is registered back to the consul or not.
@ashwinkupatkar ashwinkupatkar added the type/bug Something isn't working label May 3, 2022
@ashwinkupatkar ashwinkupatkar changed the title Unable to auto re-register application pods to consul cluster in k8s upon killing a consul agent Unable to auto re-register application pods to consul cluster in k8s upon killing a consul client pod May 3, 2022
@david-yu
Copy link
Contributor

david-yu commented May 3, 2022

Hi @ashwinkupatkar this is a well known problem that exists today as the Consul client but we accommodate for in Consul K8s using our endpoints controller that reconciles endpoints from K8s to Consul.

When a client pod is rescheduled and healthy , we get all Kubernetes endpoints and filter addresses of endpoints that belong to this client by comparing the Kubernetes nodeName of the address with the nodeName of the Consul client pod. Each endpoint will then be re-registered using the endpoints controller’s reconciler.

I think we need exact commands and config.yaml as well as deployment.yaml to see if this is indeed an issue and something that is reproducible.

@david-yu david-yu added the waiting-reply Waiting on the issue creator for a response before taking further action label May 3, 2022
@ashwinkupatkar
Copy link
Author

Sure @david-yu let me provide you the commands to reproduce the issue. Will reply back.

@ashwinkupatkar
Copy link
Author

The steps are as below:

  1. To validate the issue, deploy consul cluster on k8s using default configurations.

  2. Deploy the HashiCorp provided application (static-server and static client) mentioned on this link in the k8s namespace of your choice.

  3. Once the above application comes up, go to the Consul UI and check the k8s instance where the application is running.

  4. On the K8S cluster, manually delete the consul client pod where the application pod is also running. Run the command -> kubectl delete pod <name_of_the_consul_client_pod> -n <namespace_name>

  5. Wait for the new consul client pod to come up on that K8S worker node. Once the new consul client pod comes up, navigate to the consul UI, you will not find the application pod on that K8S worker node registered to consul.

This shows that despite the application pod being healthy and serviceable, it won't be receiving the traffic since it's not registered with consul.

@ishustava
Copy link
Contributor

Hey @ashwinkupatkar

I just tried these steps and was not able to reproduce. I tried it on kind with k8s version 1.21.2.

We also have an acceptance test that verifies that services are re-registered if a consul client pod is killed here:

// Test that when Consul clients are restarted and lose all their registrations,
// the services get re-registered and can continue to talk to each other.
func TestConnectInject_RestartConsulClients(t *testing.T) {

We run this test on different versions of Kubernetes and different kubernetes distributions, and it has been passing.

Could you give us more detailed steps so we can reproduce this issue? What are your Helm values?

@ashwinkupatkar
Copy link
Author

ashwinkupatkar commented May 16, 2022

Hi @ishustava , I could reproduce the issue again today. I am using Kubernetes v1.21.9 for .

I am having a federated consul cluster setup with primary being on VM and secondary being on K8S.

Consul -> 1.11.4
Consul Helm -> 0.42.0
Envoy -> 1.20.0

My override configuration looks like below (Redacted few details):

global:
  datacenter: <primary-dc>
  image: "consul:1.11.4"
  imageK8S: "hashicorp/consul-k8s-control-plane:0.42.0"
  name: "consul"
  tls:
    enabled: true
    enableAutoEncrypt: true
    caCert:
      secretName: <consul-ca-cert-name>
      secretKey: tls.crt
    caKey:
      secretName: <consul-ca-key-name>
      secretKey: tls.key
  imageEnvoy: "envoyproxy/envoy-alpine:v1.20.0"
  federation:
    enabled: true
    primaryDatacenter: "<primary-dc>"
    primaryGateways: ["<primary_meshgateway_ip>:<port>"]
    k8sAuthMethodHost: "https://<k8s_host_ip>:<port>"
  gossipEncryption:
     secretName: <consul-encrypt-key-name>
     secretKey: gossipEncryptionKey
  acls:
    manageSystemACLs: true
    replicationToken:
      secretName: <consul-replication-token-name>
      secretKey: replicationToken
  metrics:
    enabled: true
    enableAgentMetrics: false
    agentMetricsRetentionTime: 1m
    enableGatewayMetrics: true
  consulSidecarContainer:
    resources:
      requests:
        memory: xxxxMi
        cpu: xxxxm
      limits:
        memory: xxxxMi
        cpu: xxxxm
server:
  enabled: true
  containerSecurityContext:
    # The consul server agent container
    # @type: map
    # @recurse: false
    server:
     runAsNonRoot: true
     runAsGroup: xxxx
     runAsUser: xxxx
     allowPrivilegeEscalation: false
  storageClass: <storage-class-name>
  resources:
    requests:
      memory: "xxxxMi"
      cpu: "xxxxm"
    limits:
      memory: "xxxxMi"
      cpu: "xxxxm"
  extraConfig: |
    {
      "primary_datacenter": "<primary-dc>",
      "primary_gateways":["<primary_meshgateway_ip>:<port>"],
      "ports": {"serf_wan": <serf_port>},
      "telemetry": {
        "prometheus_retention_time": "1m"
       }
    }
client:
  containerSecurityContext:
    # The consul client agent container
    # @type: map
    # @recurse: false
    client:
     runAsNonRoot: true
     runAsGroup: xxxx
     runAsUser: xxx
     allowPrivilegeEscalation: false
    # The acl-init initContainer
    # @type: map
    # @recurse: false
    aclInit:
     runAsNonRoot: true
     runAsGroup: xxxx
     runAsUser: xxx
     allowPrivilegeEscalation: false
    # The tls-init initContainer
    # @type: map
    # @recurse: false
    tlsInit:
     runAsNonRoot: true
     runAsGroup: xxxx
     runAsUser: xxx
     allowPrivilegeEscalation: false
  extraConfig: |
    {
      "ports": {"serf_wan": <serf_port>},
      "telemetry": {
        "prometheus_retention_time": "1m"
       }
    }
connectInject:
  enabled: true
  default: false
  transparentProxy:
    defaultEnabled: false
  metrics:
    defaultEnableMerging: true
  failurePolicy: "Ignore"
  sidecarProxy:
    resources:
      requests:
        memory: xxxxMi
        cpu: xxxxm
      limits:
        memory: xxxxMi
        cpu: xxxxm
  envoyExtraArgs: "- -- --restart-epoch 0"
controller:
  enabled: true
meshGateway:
  enabled: true
  replicas: 1
  service:
    enabled: true
    type: NodePort
    nodePort: <node_port_for_meshgateway>
  consulServiceName: "mesh-gateway"
  resources:
    requests:
      memory: "xxxxMi"
      cpu: "xxxxm"
    limits:
      memory: "xxxxMi"
      cpu: "xxxxm"
ingressGateways:
  defaults:
    replicas: 1
  enabled: true
  gateways:
    - name: <secondary-ingress-gateway-name>
      service:
        type: NodePort
        ports:
          - port: <http_port_for_ingressgateway>
            nodePort: <http_node_port_for_ingressgateway>
          - port: <https_port_for_ingressgateway>
            nodePort: <https_node_port_for_ingressgateway>

Please let me know, what configuration, if any, i am missing here. Thank you.

@ashwinkupatkar
Copy link
Author

ashwinkupatkar commented May 16, 2022

The steps to reproduce the issue is as provided above.

Let me put it in simple steps further:

Deploy some app on k8s with service mesh.

Make a note of the k8s worker node to which the consul agent pod is tied with and the application is registered to this consul agent pod.

For example, if a service named "Testapp" with service id as "Testapp-9hds9-hi89-Testapp" is registered to consul agent pod and this consul agent pod is running on some k8s worker node say, kubeworker03 , then manually delete this consul agent pod using kubectl delete pod command....

After sometime new consul agent pod comes up and reregisters the kubeworker03 node ... but the "Testapp" service with service id "Testapp-9hds9-hi89-Testapp" doesnot show up. This is a big problem. Despite the service instance being healthy, it won't receive any traffic from the mesh since its not part of the catalog itself.

Let me know, if any more inputs are required from my end... I am happy to provide it. Thanks

@ishustava
Copy link
Contributor

Hey @ashwinkupatkar could you provide your values overrides as YAML?

@ashwinkupatkar
Copy link
Author

I do not see a way to properly represent it in the comment. Provided it as an attachment now.
consul-override.txt

@jmurret jmurret removed the waiting-reply Waiting on the issue creator for a response before taking further action label May 18, 2022
@jmurret
Copy link
Member

jmurret commented May 18, 2022

Hi @ashwinkupatkar, thank you for the information and your patience. As @ishustava mentioned, we have an acceptance test that validates this scenario. I've also tried recreating this on kind and in GKE with the same versions as you have used for consul and consul on kubernetes. I've attached the videos here for each
https://user-images.githubusercontent.com/2481360/169104006-d5a5ac9f-3b71-48ee-9ef3-936bd33342bf.mov

delete.client.pod.on.kind.mov

I have a couple of questions:

  1. Did you see this behavior happen with manually deleting the pod or was some other event happening like auto scaling or spot instance terminating? Also, how consistently can you recreate this issue?

  2. Would it be possible for you to send the logs of your connect injector pods form when this occurs? (We had issue Endpoints Controller queuing up service registrations/deregistrations when request to agent on a terminated pod does not time out #714 with similar symptoms where it was very hard to recreate and only occured under load and with auto scaling back down and spot instance terminations. It took us looking at the connect injector logs to get a hypothesis for what was occuring and ultimately fix the problem.)

Thank you for your help and patience!

@ashwinkupatkar
Copy link
Author

@jmurret Thanks for the elaborate reply.. I will go through the video and the comment back. Thank you again.

@ashwinkupatkar
Copy link
Author

@jmurret , as per the above video this seems to be working as expected in your environment (GKE). My deployment environment is vSphere.

This issue is very consistent in our environment when we manually delete a consul client pod every time. Auto-scaling is not picture as of now here.

As requested by you, I will grab the connect injector logs when we delete the consul agent pod manually and post it here.

Thanks!

@ishustava
Copy link
Contributor

Hey @ashwinkupatkar we also run the test John and I linked on GKE, AKS, EKS and kind and it did not fail in a while, so I'm confident it works in all those environment.

Could you try it on a different Kubernetes cluster (like kind or GKE) and see if you're seeing the same issue? If not, then I think this is likely something related to your environment and not a bug in the code base.

We will try to help you debug if you provide the logs, but unfortunately, because of our current workload and because we cannot reproduce with the steps you provided, we may not be able to respond quickly.

@ashwinkupatkar
Copy link
Author

ashwinkupatkar commented May 19, 2022

Hi @jmurret, @ishustava,

After deleting the consul agent pod , the logs produced on the connect injector pods are as below:

2022-05-19T18:10:51.376Z        INFO    controller.endpoints    retrieved       {"name": "static-server", "ns": “consul"}
2022-05-19T18:10:56.324Z        INFO    controller.endpoints    retrieved       {"name": "static-server", "ns": "consul"}
2022-05-19T18:10:56.324Z        INFO    controller.endpoints    registering service with Consul {"name": "static-server", "id": "static-server-5c5b685965-kt79v-static-server", "agentIP": "10.45.xxx.xxx”}
2022-05-19T18:10:56.363Z        INFO    controller.endpoints    registering proxy service with Consul   {"name": "static-server-sidecar-proxy"}
2022-05-19T18:10:56.483Z        INFO    controller.endpoints    updating health check status for service        {"name": "static-server", "reason": "Pod \"consul/static-server-5c5b685965-kt79v\" is not ready", "status": "critical"}
2022-05-19T18:10:56.644Z        INFO    controller.endpoints    updating health check   {"id": "consul/static-server-5c5b685965-kt79v-static-server/kubernetes-health-check"}
2022-05-19T18:11:13.291Z        INFO    controller.endpoints    retrieved       {"name": "static-server", "ns": "consul"}
2022-05-19T18:11:13.291Z        INFO    controller.endpoints    registering service with Consul {"name": "static-server", "id": "static-server-5c5b685965-kt79v-static-server", "agentIP": "10.45.xxx.xxx”}
2022-05-19T18:11:13.314Z        INFO    controller.endpoints    registering proxy service with Consul   {"name": "static-server-sidecar-proxy"}
2022-05-19T18:11:13.510Z        INFO    controller.endpoints    updating health check status for service        {"name": "static-server", "reason": "Kubernetes health checks passing", "status": "passing"}
2022-05-19T18:11:13.513Z        INFO    controller.endpoints    updating health check   {"id": "consul/static-server-5c5b685965-kt79v-static-server/kubernetes-health-check"}
2022-05-19T18:12:25.372Z        INFO    controller.endpoints    retrieved       {"name": "consul-dns", "ns": "consul"}
2022-05-19T18:12:25.372Z        INFO    controller.endpoints    ignoring because endpoints pods have not been injected  {"name": "consul-dns", "ns": "consul"}
2022-05-19T18:12:39.508Z        INFO    controller.endpoints    retrieved       {"name": "consul-dns", "ns": "consul"}
2022-05-19T18:12:39.508Z        INFO    controller.endpoints    ignoring because endpoints pods have not been injected  {"name": "consul-dns", "ns": "consul"}
2022-05-19T18:12:56.193Z        INFO    controller.endpoints    retrieved       {"name": "consul-dns", "ns": "consul"}
2022-05-19T18:12:56.203Z        INFO    controller.endpoints    ignoring because endpoints pods have not been injected  {"name": "consul-dns", "ns": "consul"}

Can you please let me know if it provides the cause of the issue ?

I am curious to know what the last lines of the above log output mean, I think thats the reason of the issue.

Can you please format the output. I am not sure how we do it here.

Thanks!

@ishustava
Copy link
Contributor

ishustava commented May 19, 2022

Hey @ashwinkupatkar

Based on these logs (if they are after consul client pod has been restarted), it looks like the static-server service and its proxy have been registered without errors as indicated by these lines:

2022-05-19T18:10:56.324Z        INFO    controller.endpoints    registering service with Consul {"name": "static-server", "id": "static-server-5c5b685965-kt79v-static-server", "agentIP": "10.45.xxx.xxx”}
2022-05-19T18:10:56.363Z        INFO    controller.endpoints    registering proxy service with Consul   {"name": "static-server-sidecar-proxy"}

The last line is saying that it's ignoring the consul-dns service which is expected because this service is not on the service mesh.

One thing that might be happening is that the client agent is not reconnected to the rest of the cluster and so the UI cannot display the information about the services registered with it.

You can check if the client agent is aware of this service by kubectl execing into the client pod and running consul catalog services after restart. If the static-server service is there, then it probably means your consul cluster is in a bad state. You could then look at client logs after restart to check if there's any errors there.

@ashwinkupatkar
Copy link
Author

ashwinkupatkar commented May 19, 2022

Hi @ishustava, the log lines:

2022-05-19T18:10:56.324Z INFO controller.endpoints registering service with Consul {"name": "static-server", "id": "static-server-5c5b685965-kt79v-static-server", "agentIP": "10.45.xxx.xxx”} 2022-05-19T18:10:56.363Z INFO controller.endpoints registering proxy service with Consul {"name": "static-server-sidecar-proxy"}

These occur when the static-server application is deployed.... so yes the static-server application do get registered and looks as desired.

The actual log lines from :

2022-05-19T18:12:25.372Z INFO controller.endpoints retrieved {"name": "consul-dns", "ns": "consul"} 2022-05-19T18:12:25.372Z INFO controller.endpoints ignoring because endpoints pods have not been injected {"name": "consul-dns", "ns": "consul"} 2022-05-19T18:12:39.508Z INFO controller.endpoints retrieved {"name": "consul-dns", "ns": "consul"} 2022-05-19T18:12:39.508Z INFO controller.endpoints ignoring because endpoints pods have not been injected {"name": "consul-dns", "ns": "consul"} 2022-05-19T18:12:56.193Z INFO controller.endpoints retrieved {"name": "consul-dns", "ns": "consul"} 2022-05-19T18:12:56.203Z INFO controller.endpoints ignoring because endpoints pods have not been injected {"name": "consul-dns", "ns": "consul"}

are the ones that are outputted once I delete the consul agent pod manually ..beyond this no more logs are produced. (sorry i my above comment and log mislead you )

Comment One thing that might be happening is that the client agent is not reconnected to the rest of the cluster and so the UI cannot display the information about the services registered with it.

Response The new client agent is able to connect back to the cluster as I can see the k8s worker node registers back as part of the cluster and the agent health shown is heathy + the consul agent pod is running healthy + running. consul member command from this new client agent pod shows me all the peers.

Comment You can check if the client agent is aware of this service by kubectl execing into the client pod and running consul catalog services after restart. If the static-server service is there, then it probably means your consul cluster is in a bad state. You could then look at client logs after restart to check if there's any errors there.

Response After exec'ing into the new client agent pod and running consul catalog services i see all the registered services including static-server. However my service instance is lost if i navigate to the UI and go to the node (whose client agent pod was deleted ) and then to the service instance I see nothing, its empty, no service instance registered. Before restarting client agent pod I had the service instance there.

Comment You could then look at client logs after restart to check if there's any errors there.

Response The agent logs are good. No relevant errors in the output.

@david-yu
Copy link
Contributor

Closing as unable to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants