Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Master nodes on upgraded cluster do not come ready after restart #3628

Closed
chreichert opened this issue Jul 22, 2020 · 24 comments
Closed

Master nodes on upgraded cluster do not come ready after restart #3628

chreichert opened this issue Jul 22, 2020 · 24 comments
Labels
bug Something isn't working stale

Comments

@chreichert
Copy link

chreichert commented Jul 22, 2020

Describe the bug
After upgrading an older cluster, that was initially created with ACS-Engine 0.21.2, from 1.16.4 to 1.16.11 first everything worked fine. Usually we shut down all master VM's and agent VMSS over the night and the weekends when we do not need the cluster for testing. After the first restart past the upgrade, the cluster was not reachable with kubectl anymore. Looking at docker ps on the master nodes, it looks like it did not start calico networking. You can only see api-server, controller-manager, scheduler and addon-manager in the list of running containers.

Steps To Reproduce

Latest Upgrade of the cluster has been done with AKS_Engine Version 0.45.0.
Resulting API-Model:

api-model

{
  "apiVersion": "vlabs",
  "location": "northeurope",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.16",
      "orchestratorVersion": "1.16.4",
      "kubernetesConfig": {
        "kubernetesImageBase": "k8s.gcr.io/",
        "mcrKubernetesImageBase": "mcr.microsoft.com/k8s/core/",
        "clusterSubnet": "10.244.0.0/16",
        "dnsServiceIP": "10.0.0.10",
        "serviceCidr": "10.0.0.0/16",
        "networkPolicy": "calico",
        "networkPlugin": "kubenet",
        "containerRuntime": "docker",
        "dockerBridgeSubnet": "172.17.0.1/16",
        "mobyVersion": "3.0.8",
        "useInstanceMetadata": true,
        "enableRbac": true,
        "enableSecureKubelet": true,
        "enableAggregatedAPIs": true,
        "privateCluster": {
          "enabled": true
        },
        "gchighthreshold": 85,
        "gclowthreshold": 80,
        "etcdVersion": "3.3.15",
        "etcdDiskSizeGB": "1024",
        "enablePodSecurityPolicy": true,
        "addons": [
          {
            "name": "blobfuse-flexvolume",
            "enabled": false
          },
          {
            "name": "smb-flexvolume",
            "enabled": false
          },
          {
            "name": "keyvault-flexvolume",
            "enabled": false
          },
          {
            "name": "cluster-autoscaler",
            "enabled": false
          },
          {
            "name": "heapster",
            "enabled": true,
            "containers": [
              {
                "name": "heapster",
                "image": "k8s.gcr.io/heapster-amd64:v1.5.4",
                "cpuRequests": "88m",
                "memoryRequests": "204Mi",
                "cpuLimits": "88m",
                "memoryLimits": "204Mi"
              },
              {
                "name": "heapster-nanny",
                "image": "k8s.gcr.io/addon-resizer:1.8.5",
                "cpuRequests": "88m",
                "memoryRequests": "204Mi",
                "cpuLimits": "88m",
                "memoryLimits": "204Mi"
              }
            ]
          },
          {
            "name": "tiller",
            "enabled": true,
            "containers": [
              {
                "name": "tiller",
                "image": "gcr.io/kubernetes-helm/tiller:v2.13.1",
                "cpuRequests": "50m",
                "memoryRequests": "150Mi",
                "cpuLimits": "50m",
                "memoryLimits": "150Mi"
              }
            ],
            "config": {
              "max-history": "0"
            }
          },
          {
            "name": "aci-connector",
            "enabled": false
          },
          {
            "name": "kubernetes-dashboard",
            "enabled": true,
            "containers": [
              {
                "name": "kubernetes-dashboard",
                "image": "k8s.gcr.io/kubernetes-dashboard-amd64:v1.10.1",
                "cpuRequests": "300m",
                "memoryRequests": "150Mi",
                "cpuLimits": "300m",
                "memoryLimits": "150Mi"
              }
            ]
          },
          {
            "name": "rescheduler",
            "enabled": false
          },
          {
            "name": "metrics-server",
            "enabled": true,
            "containers": [
              {
                "name": "metrics-server",
                "image": "k8s.gcr.io/metrics-server-amd64:v0.3.4"
              }
            ]
          },
          {
            "name": "nvidia-device-plugin",
            "enabled": false
          },
          {
            "name": "container-monitoring",
            "enabled": false
          },
          {
            "name": "azure-cni-networkmonitor",
            "enabled": false
          },
          {
            "name": "azure-npm-daemonset",
            "enabled": false
          },
          {
            "name": "ip-masq-agent",
            "enabled": true,
            "containers": [
              {
                "name": "ip-masq-agent",
                "image": "k8s.gcr.io/ip-masq-agent-amd64:v2.5.0",
                "cpuRequests": "50m",
                "memoryRequests": "50Mi",
                "cpuLimits": "50m",
                "memoryLimits": "250Mi"
              }
            ],
            "config": {
              "enable-ipv6": "false",
              "non-masq-cni-cidr": "",
              "non-masquerade-cidr": "10.244.0.0/16",
              "secondary-non-masquerade-cidr": ""
            }
          },
          {
            "name": "dns-autoscaler",
            "enabled": false
          },
          {
            "name": "calico-daemonset",
            "enabled": true,
            "containers": [
              {
                "name": "calico-typha",
                "image": "calico/typha:v3.8.0"
              },
              {
                "name": "calico-cni",
                "image": "calico/cni:v3.8.0"
              },
              {
                "name": "calico-node",
                "image": "calico/node:v3.8.0"
              },
              {
                "name": "calico-pod2daemon",
                "image": "calico/pod2daemon-flexvol:v3.8.0"
              },
              {
                "name": "calico-cluster-proportional-autoscaler",
                "image": "k8s.gcr.io/cluster-proportional-autoscaler-amd64:1.1.2-r2"
              }
            ]
          },
          {
            "name": "cloud-node-manager",
            "enabled": false
          },
          {
            "name": "aad-pod-identity",
            "enabled": false
          },
          {
            "name": "appgw-ingress",
            "enabled": false
          },
          {
            "name": "azuredisk-csi-driver",
            "enabled": false
          },
          {
            "name": "azurefile-csi-driver",
            "enabled": false
          },
          {
            "name": "azure-policy",
            "enabled": false
          },
          {
            "name": "node-problem-detector",
            "enabled": false
          },
          {
            "name": "kube-dns",
            "enabled": false
          },
          {
            "name": "coredns",
            "enabled": true,
            "containers": [
              {
                "name": "coredns",
                "image": "k8s.gcr.io/coredns:1.6.5"
              }
            ],
            "config": {
              "clusterIP": "10.0.0.10",
              "domain": "cluster.local"
            }
          },
          {
            "name": "kube-proxy",
            "enabled": true,
            "containers": [
              {
                "name": "kube-proxy",
                "image": "k8s.gcr.io/hyperkube-amd64:v1.16.4"
              }
            ],
            "config": {
              "cluster-cidr": "10.244.0.0/16",
              "featureGates": "{}",
              "proxy-mode": "iptables"
            }
          }
        ],
        "kubeletConfig": {
          "--address": "0.0.0.0",
          "--anonymous-auth": "false",
          "--authentication-token-webhook": "true",
          "--authorization-mode": "Webhook",
          "--azure-container-registry-config": "/etc/kubernetes/azure.json",
          "--cgroups-per-qos": "true",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-dns": "10.0.0.10",
          "--cluster-domain": "cluster.local",
          "--enforce-node-allocatable": "pods",
          "--event-qps": "0",
          "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
          "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
          "--image-gc-high-threshold": "85",
          "--image-gc-low-threshold": "80",
          "--image-pull-progress-deadline": "30m",
          "--keep-terminated-pod-volumes": "false",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--max-pods": "110",
          "--network-plugin": "cni",
          "--node-status-update-frequency": "10s",
          "--non-masquerade-cidr": "0.0.0.0/0",
          "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
          "--pod-manifest-path": "/etc/kubernetes/manifests",
          "--pod-max-pids": "-1",
          "--read-only-port": "0",
          "--rotate-certificates": "true",
          "--streaming-connection-idle-timeout": "5m",
          "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
          "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
          "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
        },
        "controllerManagerConfig": {
          "--allocate-node-cidrs": "true",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-cidr": "10.244.0.0/16",
          "--cluster-name": "qa-qknows-k8s-8164",
          "--cluster-signing-cert-file": "/etc/kubernetes/certs/ca.crt",
          "--cluster-signing-key-file": "/etc/kubernetes/certs/ca.key",
          "--configure-cloud-routes": "true",
          "--controllers": "*,bootstrapsigner,tokencleaner",
          "--feature-gates": "LocalStorageCapacityIsolation=true,ServiceNodeExclusion=true",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--node-monitor-grace-period": "40s",
          "--pod-eviction-timeout": "5m0s",
          "--profiling": "false",
          "--root-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--route-reconciliation-period": "10s",
          "--service-account-private-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--terminated-pod-gc-threshold": "5000",
          "--use-service-account-credentials": "true",
          "--v": "2"
        },
        "cloudControllerManagerConfig": {
          "--allocate-node-cidrs": "true",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-cidr": "10.244.0.0/16",
          "--cluster-name": "qa-qknows-k8s-8164",
          "--configure-cloud-routes": "true",
          "--controllers": "*",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--route-reconciliation-period": "10s",
          "--v": "2"
        },
        "apiServerConfig": {
          "--advertise-address": "<advertiseAddr>",
          "--allow-privileged": "true",
          "--anonymous-auth": "false",
          "--audit-log-maxage": "30",
          "--audit-log-maxbackup": "10",
          "--audit-log-maxsize": "100",
          "--audit-log-path": "/var/log/kubeaudit/audit.log",
          "--audit-policy-file": "/etc/kubernetes/addons/audit-policy.yaml",
          "--authorization-mode": "Node,RBAC",
          "--bind-address": "0.0.0.0",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--enable-admission-plugins": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,ValidatingAdmissionWebhook,ResourceQuota,ExtendedResourceToleration",
          "--enable-bootstrap-token-auth": "true",
          "--etcd-cafile": "/etc/kubernetes/certs/ca.crt",
          "--etcd-certfile": "/etc/kubernetes/certs/etcdclient.crt",
          "--etcd-keyfile": "/etc/kubernetes/certs/etcdclient.key",
          "--etcd-servers": "https://127.0.0.1:2379",
          "--insecure-port": "8080",
          "--kubelet-client-certificate": "/etc/kubernetes/certs/client.crt",
          "--kubelet-client-key": "/etc/kubernetes/certs/client.key",
          "--oidc-client-id": "***",
          "--oidc-groups-claim": "groups",
          "--oidc-issuer-url": "***",
          "--oidc-username-claim": "oid",
          "--profiling": "false",
          "--proxy-client-cert-file": "/etc/kubernetes/certs/proxy.crt",
          "--proxy-client-key-file": "/etc/kubernetes/certs/proxy.key",
          "--requestheader-allowed-names": "",
          "--requestheader-client-ca-file": "/etc/kubernetes/certs/proxy-ca.crt",
          "--requestheader-extra-headers-prefix": "X-Remote-Extra-",
          "--requestheader-group-headers": "X-Remote-Group",
          "--requestheader-username-headers": "X-Remote-User",
          "--secure-port": "443",
          "--service-account-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--service-account-lookup": "true",
          "--service-cluster-ip-range": "10.0.0.0/16",
          "--storage-backend": "etcd3",
          "--tls-cert-file": "/etc/kubernetes/certs/apiserver.crt",
          "--tls-cipher-suites": "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA",
          "--tls-private-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--v": "4"
        },
        "schedulerConfig": {
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--profiling": "false",
          "--v": "2"
        },
        "cloudProviderBackoffMode": "v2",
        "cloudProviderBackoff": true,
        "cloudProviderBackoffRetries": 6,
        "cloudProviderBackoffJitter": 1,
        "cloudProviderBackoffDuration": 5,
        "cloudProviderBackoffExponent": 1.5,
        "cloudProviderRateLimit": false,
        "cloudProviderRateLimitQPS": 3,
        "cloudProviderRateLimitQPSWrite": 30,
        "cloudProviderRateLimitBucket": 10,
        "cloudProviderRateLimitBucketWrite": 300,
        "cloudProviderDisableOutboundSNAT": false,
        "loadBalancerSku": "Basic",
        "maximumLoadBalancerRuleCount": 250,
        "kubeProxyMode": "iptables"
      }
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "qa-qknows-k8s-8164",
      "subjectAltNames": null,
      "vmSize": "Standard_D2s_v3",
      "osDiskSizeGB": 128,
      "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
      "vnetCidr": "10.239.0.0/16",
      "firstConsecutiveStaticIP": "10.239.255.10",
      "storageProfile": "ManagedDisks",
      "oauthEnabled": false,
      "preProvisionExtension": null,
      "extensions": [],
      "distro": "ubuntu",
      "kubernetesConfig": {
        "kubeletConfig": {
          "--address": "0.0.0.0",
          "--anonymous-auth": "false",
          "--authentication-token-webhook": "true",
          "--authorization-mode": "Webhook",
          "--azure-container-registry-config": "/etc/kubernetes/azure.json",
          "--cgroups-per-qos": "true",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-dns": "10.0.0.10",
          "--cluster-domain": "cluster.local",
          "--enforce-node-allocatable": "pods",
          "--event-qps": "0",
          "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
          "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
          "--image-gc-high-threshold": "85",
          "--image-gc-low-threshold": "80",
          "--image-pull-progress-deadline": "30m",
          "--keep-terminated-pod-volumes": "false",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--max-pods": "110",
          "--network-plugin": "cni",
          "--node-status-update-frequency": "10s",
          "--non-masquerade-cidr": "0.0.0.0/0",
          "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
          "--pod-manifest-path": "/etc/kubernetes/manifests",
          "--pod-max-pids": "-1",
          "--read-only-port": "0",
          "--rotate-certificates": "true",
          "--streaming-connection-idle-timeout": "5m",
          "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
          "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
          "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
        },
        "cloudProviderBackoffMode": ""
      },
      "availabilityProfile": "AvailabilitySet",
      "platformFaultDomainCount": 2,
      "cosmosEtcd": false
    },
    "agentPoolProfiles": [
      {
        "name": "dynamic",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "osDiskSizeGB": 128,
        "osType": "Linux",
        "availabilityProfile": "VirtualMachineScaleSets",
        "storageProfile": "ManagedDisks",
        "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
        "distro": "ubuntu",
        "kubernetesConfig": {
          "kubeletConfig": {
            "--address": "0.0.0.0",
            "--anonymous-auth": "false",
            "--authentication-token-webhook": "true",
            "--authorization-mode": "Webhook",
            "--azure-container-registry-config": "/etc/kubernetes/azure.json",
            "--cgroups-per-qos": "true",
            "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
            "--cloud-config": "/etc/kubernetes/azure.json",
            "--cloud-provider": "azure",
            "--cluster-dns": "10.0.0.10",
            "--cluster-domain": "cluster.local",
            "--enforce-node-allocatable": "pods",
            "--event-qps": "0",
            "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
            "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
            "--image-gc-high-threshold": "85",
            "--image-gc-low-threshold": "80",
            "--image-pull-progress-deadline": "30m",
            "--keep-terminated-pod-volumes": "false",
            "--kubeconfig": "/var/lib/kubelet/kubeconfig",
            "--max-pods": "110",
            "--network-plugin": "cni",
            "--node-status-update-frequency": "10s",
            "--non-masquerade-cidr": "0.0.0.0/0",
            "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
            "--pod-manifest-path": "/etc/kubernetes/manifests",
            "--pod-max-pids": "-1",
            "--read-only-port": "0",
            "--rotate-certificates": "true",
            "--streaming-connection-idle-timeout": "5m",
            "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
            "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
            "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
          },
          "cloudProviderBackoffMode": ""
        },
        "acceleratedNetworkingEnabled": true,
        "acceleratedNetworkingEnabledWindows": false,
        "vmssOverProvisioningEnabled": false,
        "auditDEnabled": false,
        "fqdn": "",
        "preProvisionExtension": null,
        "extensions": [],
        "singlePlacementGroup": true,
        "platformFaultDomainCount": null,
        "enableVMSSNodePublicIP": false
      },
      {
        "name": "graph",
        "count": 1,
        "vmSize": "Standard_E32s_v3",
        "osDiskSizeGB": 128,
        "osType": "Linux",
        "availabilityProfile": "VirtualMachineScaleSets",
        "storageProfile": "ManagedDisks",
        "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
        "distro": "ubuntu",
        "kubernetesConfig": {
          "kubeletConfig": {
            "--address": "0.0.0.0",
            "--anonymous-auth": "false",
            "--authentication-token-webhook": "true",
            "--authorization-mode": "Webhook",
            "--azure-container-registry-config": "/etc/kubernetes/azure.json",
            "--cgroups-per-qos": "true",
            "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
            "--cloud-config": "/etc/kubernetes/azure.json",
            "--cloud-provider": "azure",
            "--cluster-dns": "10.0.0.10",
            "--cluster-domain": "cluster.local",
            "--enforce-node-allocatable": "pods",
            "--event-qps": "0",
            "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
            "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
            "--image-gc-high-threshold": "85",
            "--image-gc-low-threshold": "80",
            "--image-pull-progress-deadline": "30m",
            "--keep-terminated-pod-volumes": "false",
            "--kubeconfig": "/var/lib/kubelet/kubeconfig",
            "--max-pods": "110",
            "--network-plugin": "cni",
            "--node-status-update-frequency": "10s",
            "--non-masquerade-cidr": "0.0.0.0/0",
            "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
            "--pod-manifest-path": "/etc/kubernetes/manifests",
            "--pod-max-pids": "-1",
            "--read-only-port": "0",
            "--rotate-certificates": "true",
            "--streaming-connection-idle-timeout": "5m",
            "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
            "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
            "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
          },
          "cloudProviderBackoffMode": ""
        },
        "acceleratedNetworkingEnabled": true,
        "acceleratedNetworkingEnabledWindows": false,
        "vmssOverProvisioningEnabled": false,
        "auditDEnabled": false,
        "fqdn": "",
        "preProvisionExtension": null,
        "extensions": [],
        "singlePlacementGroup": true,
        "platformFaultDomainCount": null,
        "enableVMSSNodePublicIP": false
      },
      {
        "name": "static",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "osDiskSizeGB": 128,
        "osType": "Linux",
        "availabilityProfile": "VirtualMachineScaleSets",
        "storageProfile": "ManagedDisks",
        "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
        "distro": "ubuntu",
        "kubernetesConfig": {
          "kubeletConfig": {
            "--address": "0.0.0.0",
            "--anonymous-auth": "false",
            "--authentication-token-webhook": "true",
            "--authorization-mode": "Webhook",
            "--azure-container-registry-config": "/etc/kubernetes/azure.json",
            "--cgroups-per-qos": "true",
            "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
            "--cloud-config": "/etc/kubernetes/azure.json",
            "--cloud-provider": "azure",
            "--cluster-dns": "10.0.0.10",
            "--cluster-domain": "cluster.local",
            "--enforce-node-allocatable": "pods",
            "--event-qps": "0",
            "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
            "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
            "--image-gc-high-threshold": "85",
            "--image-gc-low-threshold": "80",
            "--image-pull-progress-deadline": "30m",
            "--keep-terminated-pod-volumes": "false",
            "--kubeconfig": "/var/lib/kubelet/kubeconfig",
            "--max-pods": "110",
            "--network-plugin": "cni",
            "--node-status-update-frequency": "10s",
            "--non-masquerade-cidr": "0.0.0.0/0",
            "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
            "--pod-manifest-path": "/etc/kubernetes/manifests",
            "--pod-max-pids": "-1",
            "--read-only-port": "0",
            "--rotate-certificates": "true",
            "--streaming-connection-idle-timeout": "5m",
            "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
            "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
            "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
          },
          "cloudProviderBackoffMode": ""
        },
        "acceleratedNetworkingEnabled": true,
        "acceleratedNetworkingEnabledWindows": false,
        "vmssOverProvisioningEnabled": false,
        "auditDEnabled": false,
        "fqdn": "",
        "preProvisionExtension": null,
        "extensions": [],
        "singlePlacementGroup": true,
        "platformFaultDomainCount": null,
        "enableVMSSNodePublicIP": false
      }
    ],
    "linuxProfile": {
      "adminUsername": "azureuser",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "***"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "***",
      "secret": "***"
    },
    "certificateProfile": {
      "caCertificate": "***",
      "caPrivateKey": "***",
      "apiServerCertificate": "***",
      "apiServerPrivateKey": "***",
      "clientCertificate": "***",
      "clientPrivateKey": "***",
      "kubeConfigCertificate": "***",
      "kubeConfigPrivateKey": "***",
      "etcdServerCertificate": "***",
      "etcdServerPrivateKey": "***",
      "etcdClientCertificate": "***",
      "etcdClientPrivateKey": "***",
      "etcdPeerCertificates": [
        "***",
        "***",
        "***"
      ],
      "etcdPeerPrivateKeys": [
        "***",
        "***",
        "***"
      ]
    },
    "aadProfile": {
      "clientAppID": "***",
      "serverAppID": "***",
      "tenantID": "***"
    },
    "telemetryProfile": {
      "applicationInsightsKey": "***"
    }
  }
}

Cluster does not come up, can not be reached via kubectl. Agent nodes do not join the cluster.

Expected behavior
Cluster can be shut down and restarted without problems.

AKS Engine version
0.53.0

Kubernetes version
1.16.4

Additional context

@chreichert chreichert added the bug Something isn't working label Jul 22, 2020
@jackfrancis
Copy link
Member

Are the master VMs running Ubuntu 16.04-LTS?

@jackfrancis
Copy link
Member

Do you see the calico-node daemonset on the system (kubectl get daemonsets -n kube-system)?

@chreichert
Copy link
Author

chreichert commented Jul 23, 2020

yes, its Ubuntu 16.04-LTS. "distro" in apimodel is "ubuntu" as the cluster was originally installed with ACS-Engine 0.21.2.

NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"

Access with kubectl is not possible, neither over the loadbalancer, nor directly on each of the three master nodes (connection refused).

Here the docker ps output of one of the masters:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

a025a29b10e1 55e951e260e5 "/hyperkube kube-sch…" 3 minutes ago Up 3 minutes k8s_kube-scheduler_kube-scheduler-k8s-master-11480702-0_kube-system_df4c22bda061bce9a8e587c471a3947d_3
738d77cb8c49 k8s.gcr.io/pause-amd64:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-scheduler-k8s-master-11480702-0_kube-system_df4c22bda061bce9a8e587c471a3947d_2
4f58ed69c474 9a976649de57 "/opt/kube-addons.sh" 3 minutes ago Up 3 minutes k8s_kube-addon-manager_kube-addon-manager-k8s-master-11480702-0_kube-system_e9f7884002d0626b5c6d893989fda64b_2
1fb14073e706 k8s.gcr.io/pause-amd64:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-addon-manager-k8s-master-11480702-0_kube-system_e9f7884002d0626b5c6d893989fda64b_2
08f7861c417b 55e951e260e5 "/hyperkube kube-con…" 4 minutes ago Up 4 minutes k8s_kube-controller-manager_kube-controller-manager-k8s-master-11480702-0_kube-system_973970ab1c759cdaa0685dd614886203_3
9db1e340ba4d k8s.gcr.io/pause-amd64:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-controller-manager-k8s-master-11480702-0_kube-system_973970ab1c759cdaa0685dd614886203_2
8e2ee3be562b k8s.gcr.io/pause-amd64:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-apiserver-k8s-master-11480702-0_kube-system_d91bc4fdadd55face6a86786d22b17e2_2

@chreichert
Copy link
Author

chreichert commented Jul 23, 2020

I found the reason: etcd was stopped on all masters. After manually restarting etcd on all three master nodes the cluster formed and became ready again.

But why did etcd not start automatically after the restart of the nodes? Does that have anything to do with the previous upgrade ?

@chreichert
Copy link
Author

chreichert commented Jul 23, 2020

I did a shutdown and restart of all master nodes and agent scalesets again. Same problem. Etcd does not come up automatically:

root@k8s-master-11480702-0:~# systemctl status etcd
● etcd.service - etcd - highly-available key value store
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead)
Docs: https://github.com/coreos/etcd
man:etcd

Jul 23 08:00:35 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store.

After restarting etcd manually on all three masters the cluster is up and healthy again.

journalctl -u etcd around reboot

Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped streaming with peer d99721d80760f5b (writer)
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped HTTP pipelining with peer d99721d80760f5b
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped streaming with peer d99721d80760f5b (stream MsgApp v2 reader)
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped streaming with peer d99721d80760f5b (stream Message reader)
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped peer d99721d80760f5b
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopping peer affffa36a0ce35af...
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped streaming with peer affffa36a0ce35af (writer)
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped streaming with peer affffa36a0ce35af (writer)
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped HTTP pipelining with peer affffa36a0ce35af
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped streaming with peer affffa36a0ce35af (stream MsgApp v2 reader)
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped streaming with peer affffa36a0ce35af (stream Message reader)
Jul 23 07:57:17 k8s-master-11480702-0 etcd[59574]: stopped peer affffa36a0ce35af
Jul 23 07:57:17 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store.
-- Reboot --
Jul 23 08:00:35 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store.
Jul 23 08:02:57 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store.
Jul 23 08:02:57 k8s-master-11480702-0 systemd[1]: Starting etcd - highly-available key value store...
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: etcd Version: 3.3.22
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: Git SHA: 282cce72f
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: Go Version: go1.13.4
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: Go OS/Arch: linux/amd64
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: setting maximum number of CPUs to 2, total number of available CPUs is 2
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: found invalid file/dir lost+found under data dir /var/lib/etcddisk (Ignore this if you are upgrading etcd)
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: the server is already initialized as member before, starting as etcd member...
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: peerTLS: cert = /etc/kubernetes/certs/etcdpeer0.crt, key = /etc/kubernetes/certs/etcdpeer0.key, ca = , trusted-ca = /etc/kubernetes/certs/ca.crt, client-cert-auth = true, crl-file = 
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: listening for peers on https://10.239.255.10:2380
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: listening for client requests on 10.239.255.10:2379
Jul 23 08:02:58 k8s-master-11480702-0 etcd[6236]: listening for client requests on 127.0.0.1:2379
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: recovered store from snapshot at index 100001
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: restore compact to 87736
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: name = k8s-master-11480702-0
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: data dir = /var/lib/etcddisk
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: member dir = /var/lib/etcddisk/member
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: heartbeat = 100ms
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: election = 1000ms
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: snapshot count = 100000
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: advertise client URLs = https://10.239.255.10:2379
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: restarting member 6680b3eb661ad6b9 in cluster 7c2b12d47dbc2006 at commit index 126052
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: 6680b3eb661ad6b9 became follower at term 482
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: newRaft 6680b3eb661ad6b9 [peers: [d99721d80760f5b,6680b3eb661ad6b9,affffa36a0ce35af], term: 482, commit: 126052, applied: 100001, lastindex: 126052, lastterm: 482]
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: enabled capabilities for version 3.3
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: added member 6680b3eb661ad6b9 [https://10.239.255.10:2380] to cluster 7c2b12d47dbc2006 from store
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: added member affffa36a0ce35af [https://10.239.255.12:2380] to cluster 7c2b12d47dbc2006 from store
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: added member d99721d80760f5b [https://10.239.255.11:2380] to cluster 7c2b12d47dbc2006 from store
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: set the cluster version to 3.3 from store
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: simple token is not cryptographically signed
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: restore compact to 87736
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: starting peer d99721d80760f5b...
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started HTTP pipelining with peer d99721d80760f5b
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started streaming with peer d99721d80760f5b (writer)
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started streaming with peer d99721d80760f5b (writer)
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started peer d99721d80760f5b
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started streaming with peer d99721d80760f5b (stream Message reader)
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started streaming with peer d99721d80760f5b (stream MsgApp v2 reader)
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: added peer d99721d80760f5b
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: starting peer affffa36a0ce35af...
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started HTTP pipelining with peer affffa36a0ce35af
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: started peer affffa36a0ce35af
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: added peer affffa36a0ce35af
Jul 23 08:03:02 k8s-master-11480702-0 etcd[6236]: starting server... [version: 3.3.22, cluster version: 3.3]

So why does etcd not come up automatically after system restart?

@jackfrancis
Copy link
Member

Yes, that is strange. Do you get this same result from this command?

$ sudo systemctl list-unit-files | grep enabled | grep etcd
etcd.service                           enabled 

@chreichert
Copy link
Author

chreichert commented Jul 24, 2020

Yes, its enabled, but it does not start:

   # systemctl list-unit-files | grep enabled | grep etcd
   etcd.service                               enabled 

Here again the log after reboot:

  Jul 23 14:48:20 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store.
  -- Reboot --
  Jul 24 06:49:43 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store.

Only that one line, that etcd ist stopped.

Here the status output:

   # systemctl status etcd
   ● etcd.service - etcd - highly-available key value store
      Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
      Active: inactive (dead)
        Docs: https://github.com/coreos/etcd
              man:etcd
   
   Jul 24 06:49:43 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store.

@chreichert
Copy link
Author

I successfully upgraded the Cluster to 1.17.7 now. Had some problems with AAD auth on upgrade and afterwards (see #3637) but cluster is running.
The upgrade to 1.17.7 did not fix the issue with etcd not starting after system shutdown.

@chreichert
Copy link
Author

@jackfrancis Any idea how to solve this problem with etcd not starting after a master node reboot?

Were still not confident to upgrade our PROD cluster with this problem active, and additionally the problem with AAD not working after upgrade to 1.17.7 (#3637)

@chreichert
Copy link
Author

chreichert commented Sep 2, 2020

Just installed a brand new AKS cluster, k8s version 1.18.8 with AKS-Engine 0.55.1 using our API-Model (Private cluster, RBAC enabled, 3 master nodes as availability set, 3 agent nodepools as VMSS).
Same issue here. After shutting down all 3 masters (and the scalesets) together and restarting them again, the etcd service on all three masters does not come up, shown as inactive (dead) in systemctl status.

Additionally, in this cluster AAD login also does not work (#3637)

@chreichert
Copy link
Author

chreichert commented Sep 2, 2020

Just tried to reboot a single master instance (master-1) on this new cluster. Again, etcd does not come up on its own on the rebooted node and shows as dead in systemctl status:

root@k8s-master-11480702-1:~# systemctl status etcd
● etcd.service - etcd - highly-available key value store
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: https://github.com/coreos/etcd
           man:etcd

A manual start of etcd does work.

@jackfrancis
Copy link
Member

@chreichert The AAD not working starting w/ 1.17 is a known issue: #3637

I'll try to repro the etcd not coming up after reboot (we don't see this in our tests), though I'll exclude the AAD configuration. What else besides private cluster should I configure the cluster for?

@chreichert
Copy link
Author

@jackfrancis I could repro it even with a brand new AKS cluster, k8s version 1.18.8 with AKS-Engine 0.55.1 using our API-Model (Private cluster, RBAC enabled, AAD enabled, network: own subnet in separate RG with kubenet plugin and calico policy, 3 master nodes as availability set, 3 agent nodepools as VMSS).

The kubernetes.json used for aks-engine generate is as follows (redacted):

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.18",
      "kubernetesConfig": {
        "enableRbac": true,
        "privateCluster": {
          "enabled": true
        },
        "networkPlugin": "kubenet",
        "networkPolicy": "calico"
      }
    },
    "aadProfile": {
      "serverAppID": "***",
      "clientAppID": "***",
      "tenantID": "***"
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "**owndnsprefix***",
      "vmSize": "Standard_D2s_v3",
      "OSDiskSizeGB": 128,
      "vnetSubnetId": "***",
      "firstConsecutiveStaticIP": "10.239.255.10",
      "vnetCidr": "10.239.0.0/16"
    },
    "agentPoolProfiles": [
      {
        "name": "dynamic",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "OSDiskSizeGB": 128,
        "storageProfile": "ManagedDisks",
        "availabilityProfile": "VirtualMachineScaleSets",
        "vnetSubnetId": "***"
      },
      {
        "name": "graph",
        "count": 1,
        "vmSize": "Standard_E32s_v3",
        "OSDiskSizeGB": 128,
        "storageProfile": "ManagedDisks",
        "availabilityProfile": "VirtualMachineScaleSets",
        "vnetSubnetId": "***"
      },
      {
        "name": "static",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "OSDiskSizeGB": 128,
        "storageProfile": "ManagedDisks",
        "availabilityProfile": "VirtualMachineScaleSets",
        "vnetSubnetId": "****"
      }
    ],
    "linuxProfile": {
      "adminUsername": "azureuser",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "***"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "***",
      "secret": "***"
    }
  }
}

@jackfrancis
Copy link
Member

I wasn't able to repro on a private 1.17.11 cluster:

azureuser@k8s-master-29056579-0:~$ uptime
 16:19:00 up 12 min,  1 user,  load average: 0.18, 0.50, 0.52
azureuser@k8s-master-29056579-0:~$ sudo systemctl status etcd -n 1
● etcd.service - etcd - highly-available key value store
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-09-02 16:07:34 UTC; 11min ago
     Docs: https://github.com/coreos/etcd
           man:etcd
 Main PID: 4499 (etcd)
    Tasks: 13 (limit: 4915)
   CGroup: /system.slice/etcd.service
           └─4499 /usr/bin/etcd --name k8s-master-29056579-0 --peer-client-cert-auth --peer-trusted-ca-file=/etc/kubernetes/certs/ca.crt --peer-cert-file=/etc/kubernetes/certs/etcdpeer0.crt --peer-key-file=

Sep 02 16:18:03 k8s-master-29056579-0 etcd[4499]: finished scheduled compaction at 1723 (took 3.440608ms)
azureuser@k8s-master-29056579-0:~$ sudo /sbin/shutdown -r now; exit
Connection to k8s-master-29056579-0 closed by remote host.
Connection to k8s-master-29056579-0 closed.
azureuser@my-jb:~$ ssh k8s-master-29056579-0

Authorized uses only. All activity may be monitored and reported.
Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.3.0-1035-azure x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Wed Sep  2 16:24:23 UTC 2020

  System load:  0.81               Processes:              171
  Usage of /:   30.4% of 28.90GB   Users logged in:        0
  Memory usage: 12%                IP address for docker0: 172.17.0.1
  Swap usage:   0%                 IP address for azure0:  10.255.255.5

36 packages can be updated.
20 updates are security updates.


Last login: Wed Sep  2 16:18:11 2020 from 10.240.0.4
azureuser@k8s-master-29056579-0:~$ uptime
 16:24:25 up 3 min,  1 user,  load average: 0.81, 0.63, 0.28
azureuser@k8s-master-29056579-0:~$ sudo systemctl status etcd -n 1
● etcd.service - etcd - highly-available key value store
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-09-02 16:21:24 UTC; 3min 9s ago
     Docs: https://github.com/coreos/etcd
           man:etcd
 Main PID: 1190 (etcd)
    Tasks: 12 (limit: 4915)
   CGroup: /system.slice/etcd.service
           └─1190 /usr/bin/etcd --name k8s-master-29056579-0 --peer-client-cert-auth --peer-trusted-ca-file=/etc/kubernetes/certs/ca.crt --peer-cert-file=/etc/kubernetes/certs/etcdpeer0.crt --peer-key-file=

Sep 02 16:23:03 k8s-master-29056579-0 etcd[1190]: finished scheduled compaction at 2557 (took 1.195ms)

I'll build a cluster that more precisely looks like yours, except with no aadProfile (again, that functionality simply doesn't work right now in >= 1.17).

@jackfrancis
Copy link
Member

I built a private cluster 1.18.8 cluster in a cluster VNET w/ calico + kubenet, it looks like this:

azureuser@my-jb:~$ kubectl get nodes
NAME                                 STATUS   ROLES    AGE     VERSION
k8s-agentpool1-89646023-vmss000000   Ready    agent    2m17s   v1.18.8
k8s-agentpool2-89646023-vmss000000   Ready    agent    2m54s   v1.18.8
k8s-agentpool3-89646023-vmss000000   Ready    agent    2m54s   v1.18.8
k8s-master-89646023-0                Ready    master   2m54s   v1.18.8
k8s-master-89646023-1                Ready    master   119s    v1.18.8
k8s-master-89646023-2                Ready    master   2m49s   v1.18.8
azureuser@my-jb:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                                 READY   STATUS    RESTARTS   AGE
kube-system   azure-ip-masq-agent-6bg4c                            1/1     Running   0          2m19s
kube-system   azure-ip-masq-agent-96xt7                            1/1     Running   0          2m19s
kube-system   azure-ip-masq-agent-9jgfk                            1/1     Running   0          2m19s
kube-system   azure-ip-masq-agent-j9f6b                            1/1     Running   0          2m19s
kube-system   azure-ip-masq-agent-wz6fz                            1/1     Running   0          2m19s
kube-system   azure-ip-masq-agent-x8gv9                            1/1     Running   0          2m16s
kube-system   blobfuse-flexvol-installer-6mqp9                     1/1     Running   0          102s
kube-system   blobfuse-flexvol-installer-9j2df                     1/1     Running   0          2m
kube-system   blobfuse-flexvol-installer-th9qp                     1/1     Running   0          2m
kube-system   calico-node-97sjf                                    1/1     Running   0          2m25s
kube-system   calico-node-9dlxv                                    1/1     Running   0          2m25s
kube-system   calico-node-jvzqv                                    1/1     Running   0          2m25s
kube-system   calico-node-mffsk                                    1/1     Running   0          2m16s
kube-system   calico-node-t7mqb                                    1/1     Running   0          2m25s
kube-system   calico-node-vgt6f                                    1/1     Running   0          2m25s
kube-system   calico-typha-6d7c7868cb-hlwfn                        1/1     Running   0          2m25s
kube-system   calico-typha-horizontal-autoscaler-f5cdbfff7-r7cxp   1/1     Running   0          2m15s
kube-system   coredns-56db558fb9-hcpdc                             1/1     Running   0          2m25s
kube-system   coredns-autoscaler-5c7db64899-fv7v9                  1/1     Running   0          2m24s
kube-system   csi-secrets-store-96fjq                              3/3     Running   0          2m
kube-system   csi-secrets-store-klb7w                              3/3     Running   0          2m
kube-system   csi-secrets-store-provider-azure-ldz8w               1/1     Running   0          2m
kube-system   csi-secrets-store-provider-azure-sxvdm               1/1     Running   0          102s
kube-system   csi-secrets-store-provider-azure-wf9k6               1/1     Running   0          2m
kube-system   csi-secrets-store-qsglk                              3/3     Running   0          102s
kube-system   kube-addon-manager-k8s-master-89646023-0             1/1     Running   0          2m31s
kube-system   kube-addon-manager-k8s-master-89646023-1             1/1     Running   0          70s
kube-system   kube-addon-manager-k8s-master-89646023-2             1/1     Running   0          2m34s
kube-system   kube-apiserver-k8s-master-89646023-0                 1/1     Running   0          2m13s
kube-system   kube-apiserver-k8s-master-89646023-1                 1/1     Running   0          96s
kube-system   kube-apiserver-k8s-master-89646023-2                 1/1     Running   0          2m13s
kube-system   kube-controller-manager-k8s-master-89646023-0        1/1     Running   0          2m15s
kube-system   kube-controller-manager-k8s-master-89646023-1        1/1     Running   0          106s
kube-system   kube-controller-manager-k8s-master-89646023-2        1/1     Running   0          2m20s
kube-system   kube-proxy-5fk74                                     1/1     Running   0          2m19s
kube-system   kube-proxy-76ngv                                     1/1     Running   0          2m19s
kube-system   kube-proxy-clzqp                                     1/1     Running   0          2m19s
kube-system   kube-proxy-jp8v6                                     1/1     Running   0          2m19s
kube-system   kube-proxy-kvd4d                                     1/1     Running   0          2m19s
kube-system   kube-proxy-w29nn                                     1/1     Running   0          2m16s
kube-system   kube-scheduler-k8s-master-89646023-0                 1/1     Running   0          2m10s
kube-system   kube-scheduler-k8s-master-89646023-1                 1/1     Running   0          99s
kube-system   kube-scheduler-k8s-master-89646023-2                 1/1     Running   0          2m21s
kube-system   metrics-server-6756f7f765-x7wjk                      1/1     Running   0          2m18s

I'll see what happens if I reboot the master VMs.

@jackfrancis
Copy link
Member

Confirmed that the above cluster does not repro. If you'd like to prove it definitively, you can build out a cluster w/ out the RBAC + AAD config and see if that repros or not; I strongly suspect this symptom is related to that issue.

@chreichert
Copy link
Author

@jackfrancis I will need some time to test that scenario, as my Azure test env is currently blocked with other tests.
I'll get back to you later this week.

@chreichert
Copy link
Author

@jackfrancis I doubt it has something to do with the AAD config, as it started for me with updating from 1.16.4 to 1.16.11 (with 0.53.0) or 1.16.14 (with 0.55.1) respectively. Both versions do still support AAD logins.

@chreichert
Copy link
Author

@jackfrancis Just did a test with our apimodel with removed AAD and RBAC setting. The problem could be reproduced. When I shut down all agent VMSS and master VMs (using the Azure portal) and then restart the master VMs and agent VMSS (also in the portal), the masters come up with etcd service dead.

Here the exact model I used for aks-engine generate (redacted):

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.18",
      "kubernetesConfig": {
        "addons": [
          {
            "name": "blobfuse-flexvolume",
            "enabled": false
          },
          {
            "name": "smb-flexvolume",
            "enabled": false
          },
          {
            "name": "kubernetes-dashboard",
            "enabled": true
          },
          {
            "name": "heapster",
            "enabled": true
          },
          {
            "name": "tiller",
            "enabled": true
          },
          {
            "name": "keyvault-flexvolume",
            "enabled": false
          },
          {
            "name": "cluster-autoscaler",
            "enabled": false,
            "containers": [
              {
                "name": "cluster-autoscaler",
                "cpuRequests": "100m",
                "memoryRequests": "300Mi",
                "cpuLimits": "100m",
                "memoryLimits": "300Mi"
              }
            ],
            "config": {
              "maxNodes": "5",
              "minNodes": "1"
            }
          }
        ],
        "privateCluster": {
          "enabled": true
        },
        "loadBalancerSku": "Basic",
        "networkPlugin": "kubenet",
        "networkPolicy": "calico",
        "cloudProviderBackoff": true,
        "cloudProviderBackoffRetries": 6,
        "cloudProviderBackoffJitter": 1,
        "cloudProviderBackoffDuration": 5,
        "cloudProviderBackoffExponent": 1.5,
        "cloudProviderRateLimit": false,
        "cloudProviderRateLimitQPS": 3,
        "cloudProviderRateLimitBucket": 10
      }
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "**dnsprefix***",
      "vmSize": "Standard_D2s_v3",
      "distro": "ubuntu",
      "OSDiskSizeGB": 128,
      "vnetSubnetId": "***",
      "firstConsecutiveStaticIP": "10.239.255.10",
      "vnetCidr": "10.239.0.0/16"
    },
    "agentPoolProfiles": [
      {
        "name": "dynamic",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "distro": "ubuntu",
        "OSDiskSizeGB": 128,
        "storageProfile": "ManagedDisks",
        "availabilityProfile": "VirtualMachineScaleSets",
        "vnetSubnetId": "***"
      },
      {
        "name": "graph",
        "count": 1,
        "vmSize": "Standard_E32s_v3",
        "distro": "ubuntu",
        "OSDiskSizeGB": 128,
        "storageProfile": "ManagedDisks",
        "availabilityProfile": "VirtualMachineScaleSets",
        "vnetSubnetId": "***"
      },
      {
        "name": "static",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "distro": "ubuntu",
        "OSDiskSizeGB": 128,
        "storageProfile": "ManagedDisks",
        "availabilityProfile": "VirtualMachineScaleSets",
        "vnetSubnetId": "****"
      }
    ],
    "linuxProfile": {
      "adminUsername": "azureuser",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "***"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "***",
      "secret": "***"
    }
  }
}

Cluster is deployed in North-Europe Region.

What we also use, is a Basic Loadbalancer, as our Clusters were initially created with ACS-Engine, where this was the default.

@chreichert
Copy link
Author

@jackfrancis Could this, and #3807 be related?

@Michael-Sinz
Copy link
Collaborator

@jackfrancis Could this, and #3807 be related?

I don't know for sure but with some more details, it could be. We found that it matters how the nodes are restarted and what parts of cloud-init run due to the mechanism of restart. A simple "soft" reboot (initiated by the OS itself), it tends to not have this problem, but a harder reboot via Azure where we end up causing more cloud-init operations (specifically to the point where it triggers the extra fsck of etcddisk and manual mounting of it) causes the requires chain to break and thus etcd fails to start due to systemd not thinking that the mount had been executed (since it was not executed by systemd)

During upgrade, you most likely are doing more work like that and thus without a soft reboot, the problem stated could be it. Your symptoms look right from what you wrote here. (The key items were the logs of the /var/lib/etcddisk mount and the fsck of that disk in journalctl)

@chreichert
Copy link
Author

@jackfrancis I just tested upgrading from 1.16.4 to 1.17.11 with newest AKS-Engine v0.56.0:
This version fixes the issue. Cluster survives reboots from the portal again, etcd is starting up automatically as expected.

@Michael-Sinz
Copy link
Collaborator

@chreichert - Thanks for the verification!

@stale
Copy link

stale bot commented Dec 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 12, 2020
@stale stale bot closed this as completed Dec 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants