Upgrade older cluster from 1.16.4 to 1.16.11 fails with CSE exit code 35 #3618

chreichert · 2020-07-20T13:04:46Z

Describe the bug
Upgrading an older cluster, that was initially created with ACS-Engine 0.21.2, from 1.16.4 to 1.16.11 stops while deploying first upgraded master node with error: "VM has reported a failure when processing extension 'cse-master-0'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=35"

Steps To Reproduce

Latest Upgrade of the cluster has been done with AKS_Engine Version 0.45.0.
Resulting API-Model:

api-model

{
  "apiVersion": "vlabs",
  "location": "northeurope",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.16",
      "orchestratorVersion": "1.16.4",
      "kubernetesConfig": {
        "kubernetesImageBase": "k8s.gcr.io/",
        "mcrKubernetesImageBase": "mcr.microsoft.com/k8s/core/",
        "clusterSubnet": "10.244.0.0/16",
        "dnsServiceIP": "10.0.0.10",
        "serviceCidr": "10.0.0.0/16",
        "networkPolicy": "calico",
        "networkPlugin": "kubenet",
        "containerRuntime": "docker",
        "dockerBridgeSubnet": "172.17.0.1/16",
        "mobyVersion": "3.0.8",
        "useInstanceMetadata": true,
        "enableRbac": true,
        "enableSecureKubelet": true,
        "enableAggregatedAPIs": true,
        "privateCluster": {
          "enabled": true
        },
        "gchighthreshold": 85,
        "gclowthreshold": 80,
        "etcdVersion": "3.3.15",
        "etcdDiskSizeGB": "1024",
        "enablePodSecurityPolicy": true,
        "addons": [
          {
            "name": "blobfuse-flexvolume",
            "enabled": false
          },
          {
            "name": "smb-flexvolume",
            "enabled": false
          },
          {
            "name": "keyvault-flexvolume",
            "enabled": false
          },
          {
            "name": "cluster-autoscaler",
            "enabled": false
          },
          {
            "name": "heapster",
            "enabled": true,
            "containers": [
              {
                "name": "heapster",
                "image": "k8s.gcr.io/heapster-amd64:v1.5.4",
                "cpuRequests": "88m",
                "memoryRequests": "204Mi",
                "cpuLimits": "88m",
                "memoryLimits": "204Mi"
              },
              {
                "name": "heapster-nanny",
                "image": "k8s.gcr.io/addon-resizer:1.8.5",
                "cpuRequests": "88m",
                "memoryRequests": "204Mi",
                "cpuLimits": "88m",
                "memoryLimits": "204Mi"
              }
            ]
          },
          {
            "name": "tiller",
            "enabled": true,
            "containers": [
              {
                "name": "tiller",
                "image": "gcr.io/kubernetes-helm/tiller:v2.13.1",
                "cpuRequests": "50m",
                "memoryRequests": "150Mi",
                "cpuLimits": "50m",
                "memoryLimits": "150Mi"
              }
            ],
            "config": {
              "max-history": "0"
            }
          },
          {
            "name": "aci-connector",
            "enabled": false
          },
          {
            "name": "kubernetes-dashboard",
            "enabled": true,
            "containers": [
              {
                "name": "kubernetes-dashboard",
                "image": "k8s.gcr.io/kubernetes-dashboard-amd64:v1.10.1",
                "cpuRequests": "300m",
                "memoryRequests": "150Mi",
                "cpuLimits": "300m",
                "memoryLimits": "150Mi"
              }
            ]
          },
          {
            "name": "rescheduler",
            "enabled": false
          },
          {
            "name": "metrics-server",
            "enabled": true,
            "containers": [
              {
                "name": "metrics-server",
                "image": "k8s.gcr.io/metrics-server-amd64:v0.3.4"
              }
            ]
          },
          {
            "name": "nvidia-device-plugin",
            "enabled": false
          },
          {
            "name": "container-monitoring",
            "enabled": false
          },
          {
            "name": "azure-cni-networkmonitor",
            "enabled": false
          },
          {
            "name": "azure-npm-daemonset",
            "enabled": false
          },
          {
            "name": "ip-masq-agent",
            "enabled": true,
            "containers": [
              {
                "name": "ip-masq-agent",
                "image": "k8s.gcr.io/ip-masq-agent-amd64:v2.5.0",
                "cpuRequests": "50m",
                "memoryRequests": "50Mi",
                "cpuLimits": "50m",
                "memoryLimits": "250Mi"
              }
            ],
            "config": {
              "enable-ipv6": "false",
              "non-masq-cni-cidr": "",
              "non-masquerade-cidr": "10.244.0.0/16",
              "secondary-non-masquerade-cidr": ""
            }
          },
          {
            "name": "dns-autoscaler",
            "enabled": false
          },
          {
            "name": "calico-daemonset",
            "enabled": true,
            "containers": [
              {
                "name": "calico-typha",
                "image": "calico/typha:v3.8.0"
              },
              {
                "name": "calico-cni",
                "image": "calico/cni:v3.8.0"
              },
              {
                "name": "calico-node",
                "image": "calico/node:v3.8.0"
              },
              {
                "name": "calico-pod2daemon",
                "image": "calico/pod2daemon-flexvol:v3.8.0"
              },
              {
                "name": "calico-cluster-proportional-autoscaler",
                "image": "k8s.gcr.io/cluster-proportional-autoscaler-amd64:1.1.2-r2"
              }
            ]
          },
          {
            "name": "cloud-node-manager",
            "enabled": false
          },
          {
            "name": "aad-pod-identity",
            "enabled": false
          },
          {
            "name": "appgw-ingress",
            "enabled": false
          },
          {
            "name": "azuredisk-csi-driver",
            "enabled": false
          },
          {
            "name": "azurefile-csi-driver",
            "enabled": false
          },
          {
            "name": "azure-policy",
            "enabled": false
          },
          {
            "name": "node-problem-detector",
            "enabled": false
          },
          {
            "name": "kube-dns",
            "enabled": false
          },
          {
            "name": "coredns",
            "enabled": true,
            "containers": [
              {
                "name": "coredns",
                "image": "k8s.gcr.io/coredns:1.6.5"
              }
            ],
            "config": {
              "clusterIP": "10.0.0.10",
              "domain": "cluster.local"
            }
          },
          {
            "name": "kube-proxy",
            "enabled": true,
            "containers": [
              {
                "name": "kube-proxy",
                "image": "k8s.gcr.io/hyperkube-amd64:v1.16.4"
              }
            ],
            "config": {
              "cluster-cidr": "10.244.0.0/16",
              "featureGates": "{}",
              "proxy-mode": "iptables"
            }
          }
        ],
        "kubeletConfig": {
          "--address": "0.0.0.0",
          "--anonymous-auth": "false",
          "--authentication-token-webhook": "true",
          "--authorization-mode": "Webhook",
          "--azure-container-registry-config": "/etc/kubernetes/azure.json",
          "--cgroups-per-qos": "true",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-dns": "10.0.0.10",
          "--cluster-domain": "cluster.local",
          "--enforce-node-allocatable": "pods",
          "--event-qps": "0",
          "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
          "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
          "--image-gc-high-threshold": "85",
          "--image-gc-low-threshold": "80",
          "--image-pull-progress-deadline": "30m",
          "--keep-terminated-pod-volumes": "false",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--max-pods": "110",
          "--network-plugin": "cni",
          "--node-status-update-frequency": "10s",
          "--non-masquerade-cidr": "0.0.0.0/0",
          "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
          "--pod-manifest-path": "/etc/kubernetes/manifests",
          "--pod-max-pids": "-1",
          "--read-only-port": "0",
          "--rotate-certificates": "true",
          "--streaming-connection-idle-timeout": "5m",
          "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
          "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
          "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
        },
        "controllerManagerConfig": {
          "--allocate-node-cidrs": "true",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-cidr": "10.244.0.0/16",
          "--cluster-name": "qa-qknows-k8s-8164",
          "--cluster-signing-cert-file": "/etc/kubernetes/certs/ca.crt",
          "--cluster-signing-key-file": "/etc/kubernetes/certs/ca.key",
          "--configure-cloud-routes": "true",
          "--controllers": "*,bootstrapsigner,tokencleaner",
          "--feature-gates": "LocalStorageCapacityIsolation=true,ServiceNodeExclusion=true",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--node-monitor-grace-period": "40s",
          "--pod-eviction-timeout": "5m0s",
          "--profiling": "false",
          "--root-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--route-reconciliation-period": "10s",
          "--service-account-private-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--terminated-pod-gc-threshold": "5000",
          "--use-service-account-credentials": "true",
          "--v": "2"
        },
        "cloudControllerManagerConfig": {
          "--allocate-node-cidrs": "true",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-cidr": "10.244.0.0/16",
          "--cluster-name": "qa-qknows-k8s-8164",
          "--configure-cloud-routes": "true",
          "--controllers": "*",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--route-reconciliation-period": "10s",
          "--v": "2"
        },
        "apiServerConfig": {
          "--advertise-address": "<advertiseAddr>",
          "--allow-privileged": "true",
          "--anonymous-auth": "false",
          "--audit-log-maxage": "30",
          "--audit-log-maxbackup": "10",
          "--audit-log-maxsize": "100",
          "--audit-log-path": "/var/log/kubeaudit/audit.log",
          "--audit-policy-file": "/etc/kubernetes/addons/audit-policy.yaml",
          "--authorization-mode": "Node,RBAC",
          "--bind-address": "0.0.0.0",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--enable-admission-plugins": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,ValidatingAdmissionWebhook,ResourceQuota,ExtendedResourceToleration",
          "--enable-bootstrap-token-auth": "true",
          "--etcd-cafile": "/etc/kubernetes/certs/ca.crt",
          "--etcd-certfile": "/etc/kubernetes/certs/etcdclient.crt",
          "--etcd-keyfile": "/etc/kubernetes/certs/etcdclient.key",
          "--etcd-servers": "https://127.0.0.1:2379",
          "--insecure-port": "8080",
          "--kubelet-client-certificate": "/etc/kubernetes/certs/client.crt",
          "--kubelet-client-key": "/etc/kubernetes/certs/client.key",
          "--oidc-client-id": "***",
          "--oidc-groups-claim": "groups",
          "--oidc-issuer-url": "***",
          "--oidc-username-claim": "oid",
          "--profiling": "false",
          "--proxy-client-cert-file": "/etc/kubernetes/certs/proxy.crt",
          "--proxy-client-key-file": "/etc/kubernetes/certs/proxy.key",
          "--requestheader-allowed-names": "",
          "--requestheader-client-ca-file": "/etc/kubernetes/certs/proxy-ca.crt",
          "--requestheader-extra-headers-prefix": "X-Remote-Extra-",
          "--requestheader-group-headers": "X-Remote-Group",
          "--requestheader-username-headers": "X-Remote-User",
          "--secure-port": "443",
          "--service-account-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--service-account-lookup": "true",
          "--service-cluster-ip-range": "10.0.0.0/16",
          "--storage-backend": "etcd3",
          "--tls-cert-file": "/etc/kubernetes/certs/apiserver.crt",
          "--tls-cipher-suites": "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA",
          "--tls-private-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--v": "4"
        },
        "schedulerConfig": {
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--profiling": "false",
          "--v": "2"
        },
        "cloudProviderBackoffMode": "v2",
        "cloudProviderBackoff": true,
        "cloudProviderBackoffRetries": 6,
        "cloudProviderBackoffJitter": 1,
        "cloudProviderBackoffDuration": 5,
        "cloudProviderBackoffExponent": 1.5,
        "cloudProviderRateLimit": false,
        "cloudProviderRateLimitQPS": 3,
        "cloudProviderRateLimitQPSWrite": 30,
        "cloudProviderRateLimitBucket": 10,
        "cloudProviderRateLimitBucketWrite": 300,
        "cloudProviderDisableOutboundSNAT": false,
        "loadBalancerSku": "Basic",
        "maximumLoadBalancerRuleCount": 250,
        "kubeProxyMode": "iptables"
      }
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "qa-qknows-k8s-8164",
      "subjectAltNames": null,
      "vmSize": "Standard_D2s_v3",
      "osDiskSizeGB": 128,
      "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
      "vnetCidr": "10.239.0.0/16",
      "firstConsecutiveStaticIP": "10.239.255.10",
      "storageProfile": "ManagedDisks",
      "oauthEnabled": false,
      "preProvisionExtension": null,
      "extensions": [],
      "distro": "ubuntu",
      "kubernetesConfig": {
        "kubeletConfig": {
          "--address": "0.0.0.0",
          "--anonymous-auth": "false",
          "--authentication-token-webhook": "true",
          "--authorization-mode": "Webhook",
          "--azure-container-registry-config": "/etc/kubernetes/azure.json",
          "--cgroups-per-qos": "true",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-dns": "10.0.0.10",
          "--cluster-domain": "cluster.local",
          "--enforce-node-allocatable": "pods",
          "--event-qps": "0",
          "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
          "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
          "--image-gc-high-threshold": "85",
          "--image-gc-low-threshold": "80",
          "--image-pull-progress-deadline": "30m",
          "--keep-terminated-pod-volumes": "false",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--max-pods": "110",
          "--network-plugin": "cni",
          "--node-status-update-frequency": "10s",
          "--non-masquerade-cidr": "0.0.0.0/0",
          "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
          "--pod-manifest-path": "/etc/kubernetes/manifests",
          "--pod-max-pids": "-1",
          "--read-only-port": "0",
          "--rotate-certificates": "true",
          "--streaming-connection-idle-timeout": "5m",
          "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
          "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
          "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
        },
        "cloudProviderBackoffMode": ""
      },
      "availabilityProfile": "AvailabilitySet",
      "platformFaultDomainCount": 2,
      "cosmosEtcd": false
    },
    "agentPoolProfiles": [
      {
        "name": "dynamic",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "osDiskSizeGB": 128,
        "osType": "Linux",
        "availabilityProfile": "VirtualMachineScaleSets",
        "storageProfile": "ManagedDisks",
        "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
        "distro": "ubuntu",
        "kubernetesConfig": {
          "kubeletConfig": {
            "--address": "0.0.0.0",
            "--anonymous-auth": "false",
            "--authentication-token-webhook": "true",
            "--authorization-mode": "Webhook",
            "--azure-container-registry-config": "/etc/kubernetes/azure.json",
            "--cgroups-per-qos": "true",
            "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
            "--cloud-config": "/etc/kubernetes/azure.json",
            "--cloud-provider": "azure",
            "--cluster-dns": "10.0.0.10",
            "--cluster-domain": "cluster.local",
            "--enforce-node-allocatable": "pods",
            "--event-qps": "0",
            "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
            "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
            "--image-gc-high-threshold": "85",
            "--image-gc-low-threshold": "80",
            "--image-pull-progress-deadline": "30m",
            "--keep-terminated-pod-volumes": "false",
            "--kubeconfig": "/var/lib/kubelet/kubeconfig",
            "--max-pods": "110",
            "--network-plugin": "cni",
            "--node-status-update-frequency": "10s",
            "--non-masquerade-cidr": "0.0.0.0/0",
            "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
            "--pod-manifest-path": "/etc/kubernetes/manifests",
            "--pod-max-pids": "-1",
            "--read-only-port": "0",
            "--rotate-certificates": "true",
            "--streaming-connection-idle-timeout": "5m",
            "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
            "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
            "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
          },
          "cloudProviderBackoffMode": ""
        },
        "acceleratedNetworkingEnabled": true,
        "acceleratedNetworkingEnabledWindows": false,
        "vmssOverProvisioningEnabled": false,
        "auditDEnabled": false,
        "fqdn": "",
        "preProvisionExtension": null,
        "extensions": [],
        "singlePlacementGroup": true,
        "platformFaultDomainCount": null,
        "enableVMSSNodePublicIP": false
      },
      {
        "name": "graph",
        "count": 1,
        "vmSize": "Standard_E32s_v3",
        "osDiskSizeGB": 128,
        "osType": "Linux",
        "availabilityProfile": "VirtualMachineScaleSets",
        "storageProfile": "ManagedDisks",
        "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
        "distro": "ubuntu",
        "kubernetesConfig": {
          "kubeletConfig": {
            "--address": "0.0.0.0",
            "--anonymous-auth": "false",
            "--authentication-token-webhook": "true",
            "--authorization-mode": "Webhook",
            "--azure-container-registry-config": "/etc/kubernetes/azure.json",
            "--cgroups-per-qos": "true",
            "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
            "--cloud-config": "/etc/kubernetes/azure.json",
            "--cloud-provider": "azure",
            "--cluster-dns": "10.0.0.10",
            "--cluster-domain": "cluster.local",
            "--enforce-node-allocatable": "pods",
            "--event-qps": "0",
            "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
            "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
            "--image-gc-high-threshold": "85",
            "--image-gc-low-threshold": "80",
            "--image-pull-progress-deadline": "30m",
            "--keep-terminated-pod-volumes": "false",
            "--kubeconfig": "/var/lib/kubelet/kubeconfig",
            "--max-pods": "110",
            "--network-plugin": "cni",
            "--node-status-update-frequency": "10s",
            "--non-masquerade-cidr": "0.0.0.0/0",
            "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
            "--pod-manifest-path": "/etc/kubernetes/manifests",
            "--pod-max-pids": "-1",
            "--read-only-port": "0",
            "--rotate-certificates": "true",
            "--streaming-connection-idle-timeout": "5m",
            "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
            "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
            "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
          },
          "cloudProviderBackoffMode": ""
        },
        "acceleratedNetworkingEnabled": true,
        "acceleratedNetworkingEnabledWindows": false,
        "vmssOverProvisioningEnabled": false,
        "auditDEnabled": false,
        "fqdn": "",
        "preProvisionExtension": null,
        "extensions": [],
        "singlePlacementGroup": true,
        "platformFaultDomainCount": null,
        "enableVMSSNodePublicIP": false
      },
      {
        "name": "static",
        "count": 1,
        "vmSize": "Standard_E16s_v3",
        "osDiskSizeGB": 128,
        "osType": "Linux",
        "availabilityProfile": "VirtualMachineScaleSets",
        "storageProfile": "ManagedDisks",
        "vnetSubnetID": "/subscriptions/***/resourceGroups/qa_004_QKNOWS_K8s/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet",
        "distro": "ubuntu",
        "kubernetesConfig": {
          "kubeletConfig": {
            "--address": "0.0.0.0",
            "--anonymous-auth": "false",
            "--authentication-token-webhook": "true",
            "--authorization-mode": "Webhook",
            "--azure-container-registry-config": "/etc/kubernetes/azure.json",
            "--cgroups-per-qos": "true",
            "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
            "--cloud-config": "/etc/kubernetes/azure.json",
            "--cloud-provider": "azure",
            "--cluster-dns": "10.0.0.10",
            "--cluster-domain": "cluster.local",
            "--enforce-node-allocatable": "pods",
            "--event-qps": "0",
            "--eviction-hard": "memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%",
            "--feature-gates": "PodPriority=true,RotateKubeletServerCertificate=true",
            "--image-gc-high-threshold": "85",
            "--image-gc-low-threshold": "80",
            "--image-pull-progress-deadline": "30m",
            "--keep-terminated-pod-volumes": "false",
            "--kubeconfig": "/var/lib/kubelet/kubeconfig",
            "--max-pods": "110",
            "--network-plugin": "cni",
            "--node-status-update-frequency": "10s",
            "--non-masquerade-cidr": "0.0.0.0/0",
            "--pod-infra-container-image": "k8s.gcr.io/pause-amd64:3.1",
            "--pod-manifest-path": "/etc/kubernetes/manifests",
            "--pod-max-pids": "-1",
            "--read-only-port": "0",
            "--rotate-certificates": "true",
            "--streaming-connection-idle-timeout": "5m",
            "--tls-cert-file": "/etc/kubernetes/certs/kubeletserver.crt",
            "--tls-cipher-suites": "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256",
            "--tls-private-key-file": "/etc/kubernetes/certs/kubeletserver.key"
          },
          "cloudProviderBackoffMode": ""
        },
        "acceleratedNetworkingEnabled": true,
        "acceleratedNetworkingEnabledWindows": false,
        "vmssOverProvisioningEnabled": false,
        "auditDEnabled": false,
        "fqdn": "",
        "preProvisionExtension": null,
        "extensions": [],
        "singlePlacementGroup": true,
        "platformFaultDomainCount": null,
        "enableVMSSNodePublicIP": false
      }
    ],
    "linuxProfile": {
      "adminUsername": "azureuser",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "***"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "***",
      "secret": "***"
    },
    "certificateProfile": {
      "caCertificate": "***",
      "caPrivateKey": "***",
      "apiServerCertificate": "***",
      "apiServerPrivateKey": "***",
      "clientCertificate": "***",
      "clientPrivateKey": "***",
      "kubeConfigCertificate": "***",
      "kubeConfigPrivateKey": "***",
      "etcdServerCertificate": "***",
      "etcdServerPrivateKey": "***",
      "etcdClientCertificate": "***",
      "etcdClientPrivateKey": "***",
      "etcdPeerCertificates": [
        "***",
        "***",
        "***"
      ],
      "etcdPeerPrivateKeys": [
        "***",
        "***",
        "***"
      ]
    },
    "aadProfile": {
      "clientAppID": "***",
      "serverAppID": "***",
      "tenantID": "***"
    },
    "telemetryProfile": {
      "applicationInsightsKey": "***"
    }
  }
}

Upgrade this cluster using AKS-Engine 0.53.0 to 1.16.11 with command:

aks-engine upgrade --subscription-id --resource-group --location northeurope --api-model deployment-20191115_131752/arm-deploy/apimodel.json --upgrade-version 1.16.11 --auth-method client_secret --client-id --client-secret --debug

Produces the following error:

INFO[0612] Finished ARM Deployment (master-20-07-20T11.40.58-410958522). Error: Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details." Details=[{"code":"Conflict","message":"{\r\n "status": "Failed",\r\n "error": {\r\n "code": "ResourceDeploymentFailure",\r\n "message": "The resource operation completed with terminal provisioning state 'Failed'.",\r\n "details": [\r\n {\r\n "code": "VMExtensionProvisioningError",\r\n "message": "VM has reported a failure when processing extension 'cse-master-0'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=35\n[stdout]\nMon Jul 20 11:42:19 UTC 2020,k8s-master-11480702 0\n\n[stderr]\n\"\r\n\r\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot "\r\n }\r\n ]\r\n }\r\n}"}]
INFO[0612] Error creating upgraded master VM: k8s-master-11480702-0

Expected behavior
Cluster can be upgraded to 1.16.11 without errors.

AKS Engine version
0.53.0

Kubernetes version
1.16.4

Additional context
Looking at /var/log/azure/cluster-provision.log on the failing master node, it shows, that the hyperkube image could not be pulled:

timeout 1200 docker pull k8s.gcr.io/oss/kubernetes/hyperkube:v1.16.11
Error response from daemon: manifest for k8s.gcr.io/oss/kubernetes/hyperkube:v1.16.11 not found: manifest unknown: Failed to fetch "v1.16.11" from request "/v2/oss/kubernetes/hyperkube/manifests/v1.16.11".

'[' 60 -eq 60 ']'

echo Executed '"docker' pull 'k8s.gcr.io/oss/kubernetes/hyperkube:v1.16.11"' 60 times
Executed "docker pull k8s.gcr.io/oss/kubernetes/hyperkube:v1.16.11" 60 times

return 1

exit 35

The text was updated successfully, but these errors were encountered:

jackfrancis · 2020-07-20T18:08:17Z

Hi @chreichert, could you retry this upgrade, and make sure that these are the api model configuration values inside kubernetesConfig:

"kubernetesImageBase": "mcr.microsoft.com/",
"kubernetesImageBaseType": "mcr",

chreichert · 2020-07-21T15:22:47Z

Thanks @jackfrancis, this helped. After modifying kubernetesImageBase and adding kubernetesImageBaseType in our apimodel.json as mentioned by you above, I could successfully upgrade our cluster to 1.16.11.

One thing to mention: kubernetes-dashboard was not cleanly reconciled during the upgrade. I ended up with two versions of the dashboard, old in namespace kube-system and new in namespace kubernetes-dashboard. I deleted the old deployment in namespace kube-system manually to clean things up. Hope that was enough to get rid of all old dashboard artefacts?

jackfrancis · 2020-07-21T15:26:20Z

@chreichert Glad that got you through. This is a bug, btw, that I'll look into today. In the meanwhile you have a workaround :/

Correct about post-upgrade cleanup. Depending on the version-to-version path, and the initial cluster configuration, there may be leftover cruft, in your example you've observed dashboard. metrics-server, and other components may also need a nudge. You're doing the right thing to audit your cluster after upgrade, hopefully the set of things that need manual poking is consistent across your fleet of clusters and so that poking can be automated?

chreichert · 2020-07-21T15:33:37Z

@jackfrancis We're currently on finding our way to upgrade to latest 1.17 or 1.18 by testing this with our test cluster, before doing our production cluster. This was the first step with going to latest 1.16. After testing our apps, I will continue with upgrading to 1.17.7 and so on. Still some manual steps involved, but its still manageable.
Thanks again for your valuable help, as always :-)

jackfrancis · 2020-07-21T15:39:41Z

@chreichert are you able to paste the original values of kubernetesImageBase and kubernetesImageBaseType in your api model before you changed them? That will help to ensure that PR #3625 has the proper fixes so that manual step is not needed.

chreichert · 2020-07-21T15:48:23Z

@jackfrancis This were the original settings before upgrade:

    "kubernetesImageBase": "k8s.gcr.io/",
    "mcrKubernetesImageBase": "mcr.microsoft.com/k8s/core/",

kubernetesImageBaseType was not present before.

You can find the full apimodel in the original post above (folded).

chreichert added the bug Something isn't working label Jul 20, 2020

jackfrancis mentioned this issue Jul 21, 2020

fix: reinforce MCR migration during upgrade for older clusters #3625

Merged

4 tasks

jackfrancis closed this as completed in #3625 Jul 21, 2020

chreichert mentioned this issue Jul 22, 2020

Master nodes on upgraded cluster do not come ready after restart #3628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade older cluster from 1.16.4 to 1.16.11 fails with CSE exit code 35 #3618

Upgrade older cluster from 1.16.4 to 1.16.11 fails with CSE exit code 35 #3618

chreichert commented Jul 20, 2020 •

edited

Loading

jackfrancis commented Jul 20, 2020

chreichert commented Jul 21, 2020 •

edited

Loading

jackfrancis commented Jul 21, 2020

chreichert commented Jul 21, 2020

jackfrancis commented Jul 21, 2020

chreichert commented Jul 21, 2020

Upgrade older cluster from 1.16.4 to 1.16.11 fails with CSE exit code 35 #3618

Upgrade older cluster from 1.16.4 to 1.16.11 fails with CSE exit code 35 #3618

Comments

chreichert commented Jul 20, 2020 • edited Loading

jackfrancis commented Jul 20, 2020

chreichert commented Jul 21, 2020 • edited Loading

jackfrancis commented Jul 21, 2020

chreichert commented Jul 21, 2020

jackfrancis commented Jul 21, 2020

chreichert commented Jul 21, 2020

chreichert commented Jul 20, 2020 •

edited

Loading

chreichert commented Jul 21, 2020 •

edited

Loading