Aks-engine upgrade command fails on cluster deployed with ACSE v.0.18.8 and enabledRbac set to false #905

OldSchooled · 2019-03-28T09:39:06Z

Is this a request for help?:
Yes.

Is this an ISSUE or FEATURE REQUEST?
Issue.

What version of aks-engine?:
acs-engine 0.18.8 & aks-engine 0.33.1

The template used is here:

{
    "apiVersion": "vlabs",
    "properties": {
        "orchestratorProfile": {
            "orchestratorType": "Kubernetes",
            "orchestratorVersion": "1.10.4",
            "kubernetesConfig": {
                "enableRbac": false
            }
        },
        "agentPoolProfiles": [
            {
                "storageProfile": "ManagedDisks",
                "name": "mtfoo0327",
                "count": 2,
                "osType": "Linux",
                "vmSize": "Standard_A2m_v2",
                "availabilityProfile": "VirtualMachineScaleSets"
            }
        ],
        "servicePrincipalProfile": {
            "clientId": "",
            "secret": ""
        },
        "linuxProfile": {
            "adminUsername": "foo",
            "ssh": {
                "publicKeys": [
                    {
                        "keyData": ""
                    }
                ]
            }
        },
        "masterProfile": {
            "storageProfile": "ManagedDisks",
            "count": 1,
            "dnsPrefix": "k8s-master-foo0327",
            "vmSize": "Standard_F2s"
        }
    }
}

Kubernetes version:
1.10.4 -> 1.11.8

What happened:
I have clusters that were provisioned with acs-engine version 0.18.8 with the flag enableRbac set to false. When upgrading the acs-engine to aks-engine 0.33.1 and then attempting the aks-engine upgrade command I receive the following error:

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.","details":[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"VMExtensionProvisioningError\",\r\n \"message\": \"VM has reported a failure when processing extension 'cse-master-0'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=30\\n[stdout]\\n\\n[stderr]\\nConnection to k8s.gcr.io 443 port [tcp/https] succeeded!\\nConnection to gcr.io 443 port [tcp/https] succeeded!\\nConnection to docker.io 443 port [tcp/https] succeeded!\\n\\\".\"\r\n }\r\n ]\r\n }\r\n}"}]}

This happens specifically when setting "enableRbac": false as a cluster with Rbac set to true upgrades successfully. Inside of the VM we can see that the apiserver has crashed and querying the docker container logs for the apiserver shows the error:

Error: unable to load client CA file: unable to load client CA file: open /etc/kubernetes/certs/proxy-ca.crt: no such file or directory

Going through the apimodel.json that is generated, I have looked at every file path for every parameter and searched for the corresponding file on the VM. As well as confirming the docker error above, I found that there are four files in total that are listed in the generated apimodel.json that are not located on the VM after an attempted upgrade:

/var/log/audit.log
/etc/kubernetes/certs/proxy.crt
/etc/kubernetes/certs/proxy.key
/etc/kubernetes/certs/proxy-ca.crt

I'm assuming that the audit.log isn't mission critical but I wanted to be thorough.

I can confirm that the proxy certs are, indeed, on the VM before the upgrade takes place, which would infer that they aren't being copied over correctly during the upgrade when enableRbac is set to false.

What you expected to happen:
I expect the upgrade to complete successfully.

How to reproduce it (as minimally and precisely as possible):

Using the provided template, run the acs-engine generate command using acs-engine version 0.18.8
Run az group create to create a clean resource group.
Run az group deployment create using the the output folder created in step one.
Using aks-engine version 0.33.1, Run aks-engine upgrade command using the output folder from above as the deployment-dir and 1.11.8 as the target version.

Anything else we need to know:
I have included the cloud-init-output.log and the cluster-provision.log as attachments.

cloud-init-output.log
cluster-provision.log

We will continue looking at this issue on our side and will post any further information to this space.

Thank you for your time.

The text was updated successfully, but these errors were encountered:

OldSchooled · 2019-03-28T12:24:50Z

We have found further information and a temporary workaround.

Searching through the AKSE code, we found a section in kubernetesmastercustomdata.yml :

{{if EnableAggregatedAPIs}}
sudo bash /etc/kubernetes/generate-proxy-certs.sh

This led to us manually setting "EnableAggregatedAPIs": true in our apimodel.json before upgrading, which results in the upgrade being completed successfully.

Also of note, we've noticed that in the Azure portal the ARM input parameters contain the enableAggregatedAPIs flag only when upgrading, at which point it's set to false.
The original deployment doesn't have it at all.

dennis-benzinger-hybris · 2019-03-28T12:56:44Z

I think the problem we are seeing is a side effect of Azure/acs-engine#3813. When we provisioned the clusters with acs-engine 0.18.8 we used enableRbac: false and didn't specify enableAggregatedAPIs so the (calculated) default of true (K8s ver. >= 1.9) was used.

Now after Azure/acs-engine#3813 enableAggregatedAPIs is enabled by default only if enableRbac: true. Which is not the case for us.

Ideally the upgrade would use the same defaults as the original acs- / aks-engine.

jackfrancis · 2019-04-03T23:14:12Z

@dennis-benzinger-hybris Thank you so much for summarizing. I think this change will protect clusters built prior to Azure/acs-engine#3813 from being upgrade with the wrong settings:

#946

CecileRobertMichon added bug Something isn't working feature/upgrade labels Mar 29, 2019

jackfrancis mentioned this issue Apr 3, 2019

fix: fix disable rbac + enable aggregated API upgrade bug #946

Merged

4 tasks

acs-bot closed this as completed in #946 Apr 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aks-engine upgrade command fails on cluster deployed with ACSE v.0.18.8 and enabledRbac set to false #905

Aks-engine upgrade command fails on cluster deployed with ACSE v.0.18.8 and enabledRbac set to false #905

OldSchooled commented Mar 28, 2019

OldSchooled commented Mar 28, 2019 •

edited

Loading

dennis-benzinger-hybris commented Mar 28, 2019

jackfrancis commented Apr 3, 2019

Aks-engine upgrade command fails on cluster deployed with ACSE v.0.18.8 and enabledRbac set to false #905

Aks-engine upgrade command fails on cluster deployed with ACSE v.0.18.8 and enabledRbac set to false #905

Comments

OldSchooled commented Mar 28, 2019

Is this a request for help?: Yes.

Is this an ISSUE or FEATURE REQUEST? Issue.

What version of aks-engine?: acs-engine 0.18.8 & aks-engine 0.33.1

OldSchooled commented Mar 28, 2019 • edited Loading

dennis-benzinger-hybris commented Mar 28, 2019

jackfrancis commented Apr 3, 2019

Is this a request for help?:
Yes.

Is this an ISSUE or FEATURE REQUEST?
Issue.

What version of aks-engine?:
acs-engine 0.18.8 & aks-engine 0.33.1

OldSchooled commented Mar 28, 2019 •

edited

Loading