-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong Instance Count because of Cluster Restart #2085
Comments
Thanks for re-opening. Could you attach your helm values.yaml file and your yaml file for how you deploy your apps? What version of Consul and Consul K8s are you using? |
#2065 is probably related to this issue. We see this in our environment because we use spot VMs, sadly we have to manually deregister 'dead' nodes. |
We have had two occurrences of this. After a node restart, the services of that node were still marked as healthy. We use 2 replicas on our services, consul was showing 3 healthy replicas, this means that 1 every three requests to our services would fail with a Connection Reset error. We've had this issue two times in two different environments. This happens on consul-k8s 1.1.1 and consul 1.15.2 (force by image tag as 1.15.1 has a bug in cert rotation). Helm values: global:
enableConsulNamespaces: false
name: consul
datacenter: redacted
image: "hashicorp/consul:1.15.2"
logLevel: "error"
logJSON: true
metrics:
enabled: true
enableAgentMetrics: true
tls:
enabled: true
enableAutoEncrypt: true
verify: true
httpsOnly: false
serverAdditionalDNSSANs:
## Add the K8s domain name to the consul server certificate
- "redacted"
## For production turn on ACLs and gossipEncryption:
# acls:
# manageSystemACLs: true
# gossipEncryption:
# secretName: "consul-gossip-encryption-key"
# secretKey: "key"
server:
replicas: 3
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "500m"
securityContext:
runAsNonRoot: false
runAsUser: 0
# Change to DEBUG for debugging
# Point to specific grafana dash:
# "service": "https://grafana-lab.randon.com.br/d/lDlaj-NGz/service-overview?orgId=1&var-service={{ "{{" }}Service.Name}}&var-namespace={{ "{{" }}Service.Namespace}}&var-dc={{ "{{" }}Datacenter}}"
extraConfig: |
{
"log_level": "error",
"ui_config": {
"dashboard_url_templates": {
"service": "redacted"
}
}
}
client:
enabled: false
ui:
enabled: true
service:
type: "ClusterIP"
metrics:
baseURL: redacted
# This is a demo prometheus, for demo purposes only
prometheus:
enabled: false
acls:
manageSystemACLs: false
connectInject:
# This method will inject the sidecar container into Pods:
enabled: true
transparentProxy:
defaultEnabled: true
defaultOverwriteProbes: true
# But not by default, only do this for Pods that have the explicit annotation:
# consul.hashicorp.com/connect-inject: "true"
default: false
replicas: 2
resources:
requests:
cpu: 125m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
sidecarProxy:
resources:
requests:
cpu: 125m
memory: 128Mi
limits:
cpu: 125m
memory: 128Mi
syncCatalog:
# This method will automatically synchronize Kubernetes services to Consul:
# (No sidecar is injected by this method):
enabled: true
# But not by default, only for Services that have the explicit annotation:
# consul.hashicorp.com/service-sync: "true"
default: false
# Synchronize from Kubernetes to Consul:
toConsul: true
# But not from Consul to K8s:
toK8S: false
k8sAllowNamespaces: ["*"]
addK8SNamespaceSuffix: true
# Change to debug for debugging
logLevel: "error"
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "100m" |
Closing as the PR is now merged: #2571. This should be released in 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases. |
Overview of the Issue
When a Azure Cluster is restarted, the previous pods running inside the mesh are lost and new pods are created. Post restart Consul isn't picking up that the previous pods are deleted/not present. Hence, it is still trying to route to the pods resulting in the following error when I try to consume the service be UI/API.
Further more:
These are the active pods:
Whereas, Consul UI shows this,
Notice that there is only one frontend pod is running in AKS whereas UI shows two instances of the service
Reproduction Steps
Steps to reproduce this issue, eg:
-->
Consul info for both Client and Server
Client info
/ $ consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 7c04b6a0 version = 1.15.1 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 10.244.0.12:8300 server = true raft: applied_index = 2213 commit_index = 2213 fsm_pending = 0 last_contact = 0 last_log_index = 2213 last_log_term = 4 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:b9744a41-cccd-861f-eca2-f3b18496e5b4 Address:10.244.0.12:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 4 runtime: arch = amd64 cpu_count = 4 goroutines = 259 max_procs = 4 os = linux version = go1.20.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 4 failed = 0 health_score = 0 intent_queue = 1 left = 0 member_time = 4 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1Operating system and Environment details
Kubernetes Version: 1.24.10
Cloud Provider: Azure
Environment: Azure Kubernetes Service
The text was updated successfully, but these errors were encountered: