Wrong Instance Count because of Cluster Restart #2085

maheshrajrp · 2023-04-24T23:18:51Z

Overview of the Issue

When a Azure Cluster is restarted, the previous pods running inside the mesh are lost and new pods are created. Post restart Consul isn't picking up that the previous pods are deleted/not present. Hence, it is still trying to route to the pods resulting in the following error when I try to consume the service be UI/API.

Further more:
These are the active pods:

Whereas, Consul UI shows this,

Notice that there is only one frontend pod is running in AKS whereas UI shows two instances of the service

Reproduction Steps

Steps to reproduce this issue, eg:

Create a service mesh within Azure Kubernetes Service.
Stop and Start the AKS.
Notice the previous pod info still exists in Consul UI whereas in reality it doesn't exist in AKS.

-->

Consul info for both Client and Server

Client info

/ $ consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 7c04b6a0 version = 1.15.1 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.244.0.12:8300 server = true raft: applied_index = 2213 commit_index = 2213 fsm_pending = 0 last_contact = 0 last_log_index = 2213 last_log_term = 4 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:b9744a41-cccd-861f-eca2-f3b18496e5b4 Address:10.244.0.12:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 4 runtime: arch = amd64 cpu_count = 4 goroutines = 259 max_procs = 4 os = linux version = go1.20.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 4 failed = 0 health_score = 0 intent_queue = 1 left = 0 member_time = 4 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1

Operating system and Environment details

Kubernetes Version: 1.24.10
Cloud Provider: Azure
Environment: Azure Kubernetes Service

david-yu · 2023-04-25T01:17:00Z

Thanks for re-opening. Could you attach your helm values.yaml file and your yaml file for how you deploy your apps? What version of Consul and Consul K8s are you using?

FelipeEmerim · 2023-04-27T14:22:36Z

#2065 is probably related to this issue. We see this in our environment because we use spot VMs, sadly we have to manually deregister 'dead' nodes.

FelipeEmerim · 2023-05-19T14:30:04Z

We have had two occurrences of this. After a node restart, the services of that node were still marked as healthy. We use 2 replicas on our services, consul was showing 3 healthy replicas, this means that 1 every three requests to our services would fail with a Connection Reset error. We've had this issue two times in two different environments.

This happens on consul-k8s 1.1.1 and consul 1.15.2 (force by image tag as 1.15.1 has a bug in cert rotation).

Helm values:

global:
  enableConsulNamespaces: false
  name: consul
  datacenter: redacted
  image: "hashicorp/consul:1.15.2"
  logLevel: "error"
  logJSON: true
  metrics:
    enabled: true
    enableAgentMetrics: true
  tls:
    enabled: true
    enableAutoEncrypt: true
    verify: true
    httpsOnly: false
    serverAdditionalDNSSANs:
      ## Add the K8s domain name to the consul server certificate
      - "redacted"
  ## For production turn on ACLs and gossipEncryption:
  # acls:
  #   manageSystemACLs: true
  # gossipEncryption:
  #   secretName: "consul-gossip-encryption-key"
  #   secretKey: "key"

server:
  replicas: 3
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "1Gi"
      cpu: "500m"
  securityContext:
    runAsNonRoot: false
    runAsUser: 0
  # Change to DEBUG for debugging
  # Point to specific grafana dash:
  # "service": "https://grafana-lab.randon.com.br/d/lDlaj-NGz/service-overview?orgId=1&var-service={{ "{{" }}Service.Name}}&var-namespace={{ "{{" }}Service.Namespace}}&var-dc={{ "{{" }}Datacenter}}"
  extraConfig: |
    {
      "log_level": "error",
      "ui_config": {
        "dashboard_url_templates": {
          "service": "redacted"
        }
      }
    }

client:
  enabled: false

ui:
  enabled: true
  service:
    type: "ClusterIP"
  metrics:
    baseURL: redacted

# This is a demo prometheus, for demo purposes only
prometheus:
  enabled: false

acls:
  manageSystemACLs: false

connectInject:
  # This method will inject the sidecar container into Pods:
  enabled: true
  transparentProxy:
    defaultEnabled: true
    defaultOverwriteProbes: true
  # But not by default, only do this for Pods that have the explicit annotation:
  #        consul.hashicorp.com/connect-inject: "true"
  default: false
  replicas: 2
  resources:
    requests:
      cpu: 125m
      memory: 128Mi
    limits:
      cpu: 250m
      memory: 256Mi
  sidecarProxy:
    resources:
      requests:
        cpu: 125m
        memory: 128Mi
      limits:
        cpu: 125m
        memory: 128Mi

syncCatalog:
  # This method will automatically synchronize Kubernetes services to Consul:
  # (No sidecar is injected by this method):
  enabled: true
  # But not by default, only for Services that have the explicit annotation:
  #        consul.hashicorp.com/service-sync: "true"
  default: false
  # Synchronize from Kubernetes to Consul:
  toConsul: true
  # But not from Consul to K8s:
  toK8S: false
  k8sAllowNamespaces: ["*"]
  addK8SNamespaceSuffix: true
  # Change to debug for debugging
  logLevel: "error"
  resources:
    requests:
      memory: "128Mi"
      cpu: "100m"
    limits:
      memory: "128Mi"
      cpu: "100m"

david-yu · 2023-07-20T21:34:20Z

Closing as the PR is now merged: #2571. This should be released in 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases.

maheshrajrp added the type/bug Something isn't working label Apr 24, 2023

maheshrajrp mentioned this issue Apr 24, 2023

Wrong Instance Count because of Cluster Restart hashicorp/consul#17013

Open

mr-miles mentioned this issue Jun 2, 2023

Handle errors properly when services are de-registered from the catalog #2258

Closed

2 tasks

FelipeEmerim mentioned this issue Jun 20, 2023

Nodes and dead services remaining in the consul catalog #2065

Closed

mr-miles mentioned this issue Jun 29, 2023

BUG+FIX: Endpoints controller fails to deregister services #2491

Closed

curtbushko mentioned this issue Jul 20, 2023

Handle errors properly when services are de-registered from the catalog #2571

Merged

2 tasks

david-yu closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong Instance Count because of Cluster Restart #2085

Wrong Instance Count because of Cluster Restart #2085

maheshrajrp commented Apr 24, 2023

david-yu commented Apr 25, 2023

FelipeEmerim commented Apr 27, 2023

FelipeEmerim commented May 19, 2023

david-yu commented Jul 20, 2023

Wrong Instance Count because of Cluster Restart #2085

Wrong Instance Count because of Cluster Restart #2085

Comments

maheshrajrp commented Apr 24, 2023

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

david-yu commented Apr 25, 2023

FelipeEmerim commented Apr 27, 2023

FelipeEmerim commented May 19, 2023

david-yu commented Jul 20, 2023