Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Instance Count because of Cluster Restart #2085

Closed
maheshrajrp opened this issue Apr 24, 2023 · 4 comments
Closed

Wrong Instance Count because of Cluster Restart #2085

maheshrajrp opened this issue Apr 24, 2023 · 4 comments
Labels
type/bug Something isn't working

Comments

@maheshrajrp
Copy link

Overview of the Issue

When a Azure Cluster is restarted, the previous pods running inside the mesh are lost and new pods are created. Post restart Consul isn't picking up that the previous pods are deleted/not present. Hence, it is still trying to route to the pods resulting in the following error when I try to consume the service be UI/API.

image

Further more:
These are the active pods:
image

Whereas, Consul UI shows this,
image

Notice that there is only one frontend pod is running in AKS whereas UI shows two instances of the service


Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a service mesh within Azure Kubernetes Service.
  2. Stop and Start the AKS.
  3. Notice the previous pod info still exists in Consul UI whereas in reality it doesn't exist in AKS.

-->

Consul info for both Client and Server

Client info / $ consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 7c04b6a0 version = 1.15.1 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.244.0.12:8300 server = true raft: applied_index = 2213 commit_index = 2213 fsm_pending = 0 last_contact = 0 last_log_index = 2213 last_log_term = 4 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:b9744a41-cccd-861f-eca2-f3b18496e5b4 Address:10.244.0.12:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 4 runtime: arch = amd64 cpu_count = 4 goroutines = 259 max_procs = 4 os = linux version = go1.20.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 4 failed = 0 health_score = 0 intent_queue = 1 left = 0 member_time = 4 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1

Operating system and Environment details

Kubernetes Version: 1.24.10
Cloud Provider: Azure
Environment: Azure Kubernetes Service

@david-yu
Copy link
Contributor

Thanks for re-opening. Could you attach your helm values.yaml file and your yaml file for how you deploy your apps? What version of Consul and Consul K8s are you using?

@FelipeEmerim
Copy link

#2065 is probably related to this issue. We see this in our environment because we use spot VMs, sadly we have to manually deregister 'dead' nodes.

@FelipeEmerim
Copy link

We have had two occurrences of this. After a node restart, the services of that node were still marked as healthy. We use 2 replicas on our services, consul was showing 3 healthy replicas, this means that 1 every three requests to our services would fail with a Connection Reset error. We've had this issue two times in two different environments.

This happens on consul-k8s 1.1.1 and consul 1.15.2 (force by image tag as 1.15.1 has a bug in cert rotation).

Helm values:

global:
  enableConsulNamespaces: false
  name: consul
  datacenter: redacted
  image: "hashicorp/consul:1.15.2"
  logLevel: "error"
  logJSON: true
  metrics:
    enabled: true
    enableAgentMetrics: true
  tls:
    enabled: true
    enableAutoEncrypt: true
    verify: true
    httpsOnly: false
    serverAdditionalDNSSANs:
      ## Add the K8s domain name to the consul server certificate
      - "redacted"
  ## For production turn on ACLs and gossipEncryption:
  # acls:
  #   manageSystemACLs: true
  # gossipEncryption:
  #   secretName: "consul-gossip-encryption-key"
  #   secretKey: "key"

server:
  replicas: 3
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "1Gi"
      cpu: "500m"
  securityContext:
    runAsNonRoot: false
    runAsUser: 0
  # Change to DEBUG for debugging
  # Point to specific grafana dash:
  # "service": "https://grafana-lab.randon.com.br/d/lDlaj-NGz/service-overview?orgId=1&var-service={{ "{{" }}Service.Name}}&var-namespace={{ "{{" }}Service.Namespace}}&var-dc={{ "{{" }}Datacenter}}"
  extraConfig: |
    {
      "log_level": "error",
      "ui_config": {
        "dashboard_url_templates": {
          "service": "redacted"
        }
      }
    }

client:
  enabled: false

ui:
  enabled: true
  service:
    type: "ClusterIP"
  metrics:
    baseURL: redacted

# This is a demo prometheus, for demo purposes only
prometheus:
  enabled: false

acls:
  manageSystemACLs: false

connectInject:
  # This method will inject the sidecar container into Pods:
  enabled: true
  transparentProxy:
    defaultEnabled: true
    defaultOverwriteProbes: true
  # But not by default, only do this for Pods that have the explicit annotation:
  #        consul.hashicorp.com/connect-inject: "true"
  default: false
  replicas: 2
  resources:
    requests:
      cpu: 125m
      memory: 128Mi
    limits:
      cpu: 250m
      memory: 256Mi
  sidecarProxy:
    resources:
      requests:
        cpu: 125m
        memory: 128Mi
      limits:
        cpu: 125m
        memory: 128Mi

syncCatalog:
  # This method will automatically synchronize Kubernetes services to Consul:
  # (No sidecar is injected by this method):
  enabled: true
  # But not by default, only for Services that have the explicit annotation:
  #        consul.hashicorp.com/service-sync: "true"
  default: false
  # Synchronize from Kubernetes to Consul:
  toConsul: true
  # But not from Consul to K8s:
  toK8S: false
  k8sAllowNamespaces: ["*"]
  addK8SNamespaceSuffix: true
  # Change to debug for debugging
  logLevel: "error"
  resources:
    requests:
      memory: "128Mi"
      cpu: "100m"
    limits:
      memory: "128Mi"
      cpu: "100m"

@david-yu
Copy link
Contributor

Closing as the PR is now merged: #2571. This should be released in 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants