Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Aks-engine upgrade command fails on cluster deployed with ACSE v.0.18.8 and enabledRbac set to false #905

Closed
OldSchooled opened this issue Mar 28, 2019 · 3 comments · Fixed by #946
Labels
bug Something isn't working

Comments

@OldSchooled
Copy link

Is this a request for help?:
Yes.

Is this an ISSUE or FEATURE REQUEST?
Issue.

What version of aks-engine?:
acs-engine 0.18.8 & aks-engine 0.33.1

The template used is here:

{
    "apiVersion": "vlabs",
    "properties": {
        "orchestratorProfile": {
            "orchestratorType": "Kubernetes",
            "orchestratorVersion": "1.10.4",
            "kubernetesConfig": {
                "enableRbac": false
            }
        },
        "agentPoolProfiles": [
            {
                "storageProfile": "ManagedDisks",
                "name": "mtfoo0327",
                "count": 2,
                "osType": "Linux",
                "vmSize": "Standard_A2m_v2",
                "availabilityProfile": "VirtualMachineScaleSets"
            }
        ],
        "servicePrincipalProfile": {
            "clientId": "",
            "secret": ""
        },
        "linuxProfile": {
            "adminUsername": "foo",
            "ssh": {
                "publicKeys": [
                    {
                        "keyData": ""
                    }
                ]
            }
        },
        "masterProfile": {
            "storageProfile": "ManagedDisks",
            "count": 1,
            "dnsPrefix": "k8s-master-foo0327",
            "vmSize": "Standard_F2s"
        }
    }
}

Kubernetes version:
1.10.4 -> 1.11.8

What happened:
I have clusters that were provisioned with acs-engine version 0.18.8 with the flag enableRbac set to false. When upgrading the acs-engine to aks-engine 0.33.1 and then attempting the aks-engine upgrade command I receive the following error:

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.","details":[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"VMExtensionProvisioningError\",\r\n \"message\": \"VM has reported a failure when processing extension 'cse-master-0'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=30\\n[stdout]\\n\\n[stderr]\\nConnection to k8s.gcr.io 443 port [tcp/https] succeeded!\\nConnection to gcr.io 443 port [tcp/https] succeeded!\\nConnection to docker.io 443 port [tcp/https] succeeded!\\n\\\".\"\r\n }\r\n ]\r\n }\r\n}"}]}

This happens specifically when setting "enableRbac": false as a cluster with Rbac set to true upgrades successfully. Inside of the VM we can see that the apiserver has crashed and querying the docker container logs for the apiserver shows the error:

Error: unable to load client CA file: unable to load client CA file: open /etc/kubernetes/certs/proxy-ca.crt: no such file or directory

Going through the apimodel.json that is generated, I have looked at every file path for every parameter and searched for the corresponding file on the VM. As well as confirming the docker error above, I found that there are four files in total that are listed in the generated apimodel.json that are not located on the VM after an attempted upgrade:

  • /var/log/audit.log
  • /etc/kubernetes/certs/proxy.crt
  • /etc/kubernetes/certs/proxy.key
  • /etc/kubernetes/certs/proxy-ca.crt

I'm assuming that the audit.log isn't mission critical but I wanted to be thorough.

I can confirm that the proxy certs are, indeed, on the VM before the upgrade takes place, which would infer that they aren't being copied over correctly during the upgrade when enableRbac is set to false.

What you expected to happen:
I expect the upgrade to complete successfully.

How to reproduce it (as minimally and precisely as possible):

  • Using the provided template, run the acs-engine generate command using acs-engine version 0.18.8
  • Run az group create to create a clean resource group.
  • Run az group deployment create using the the output folder created in step one.
  • Using aks-engine version 0.33.1, Run aks-engine upgrade command using the output folder from above as the deployment-dir and 1.11.8 as the target version.

Anything else we need to know:
I have included the cloud-init-output.log and the cluster-provision.log as attachments.

cloud-init-output.log
cluster-provision.log

We will continue looking at this issue on our side and will post any further information to this space.

Thank you for your time.

@OldSchooled
Copy link
Author

OldSchooled commented Mar 28, 2019

We have found further information and a temporary workaround.

Searching through the AKSE code, we found a section in kubernetesmastercustomdata.yml :

{{if EnableAggregatedAPIs}}
sudo bash /etc/kubernetes/generate-proxy-certs.sh

This led to us manually setting "EnableAggregatedAPIs": true in our apimodel.json before upgrading, which results in the upgrade being completed successfully.

Also of note, we've noticed that in the Azure portal the ARM input parameters contain the enableAggregatedAPIs flag only when upgrading, at which point it's set to false.
The original deployment doesn't have it at all.

@dennis-benzinger-hybris
Copy link
Contributor

I think the problem we are seeing is a side effect of Azure/acs-engine#3813. When we provisioned the clusters with acs-engine 0.18.8 we used enableRbac: false and didn't specify enableAggregatedAPIs so the (calculated) default of true (K8s ver. >= 1.9) was used.

Now after Azure/acs-engine#3813 enableAggregatedAPIs is enabled by default only if enableRbac: true. Which is not the case for us.

Ideally the upgrade would use the same defaults as the original acs- / aks-engine.

@jackfrancis
Copy link
Member

@dennis-benzinger-hybris Thank you so much for summarizing. I think this change will protect clusters built prior to Azure/acs-engine#3813 from being upgrade with the wrong settings:

#946

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants