Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

az aks create with --enable-managed-identity breaks auth with ACR #1428

Closed
wasker opened this issue Feb 5, 2020 · 31 comments
Closed

az aks create with --enable-managed-identity breaks auth with ACR #1428

wasker opened this issue Feb 5, 2020 · 31 comments
Assignees

Comments

@wasker
Copy link

wasker commented Feb 5, 2020

What happened:
My newly created clusters fail to pull images from a private ACR. First I noticed it on 1.17.0, but it reproduces with 1.15.7.

What you expected to happen:
Images successfully pulled from ACR.

How to reproduce it (as minimally and precisely as possible):

Following assumes that you have an image built from servercore:ltsc2019 and pushed to your private ACR (e.g. myacr.azurecr.io/myrepo:tag). You're deploying said image on Windows nodes in both scenarios.

1.15.7 w/ managed identity:

az aks create --name aks-test-westus2 --resource-group rg-name --location westus2 --kubernetes-version 1.15.7 --generate-ssh-keys --windows-admin-username azureuser --windows-admin-password "P@ssword!1234" --vm-set-type VirtualMachineScaleSets --enable-addons monitoring --network-plugin azure --enable-managed-identity --node-vm-size Standard_D2_v3 --node-count 2 --enable-cluster-autoscaler --min-count 2 --max-count 5

az aks nodepool add --name win1 --cluster-name aks-test-westus2 --resource-group rg-name --kubernetes-version 1.15.7 --os-type Windows --node-vm-size Standard_D2_v3 --node-count 2 --enable-cluster-autoscaler --min-count 2 --max-count 10

Both autocreated xxx-agentpool managed identity and cluster SP are role-assigned AcrPull to myacr.

pod/test-7699fdb4c4-mw48f      Failed to pull image "myacr.azurecr.io/myrepo:tag": rpc error: code = Unknown desc = unauthorized: authentication required

1.15.7 w/o managed identity:

az aks create --name aks-test-eastus2 --resource-group rg-name --location eastus2 --kubernetes-version 1.15.7 --generate-ssh-keys --windows-admin-username azureuser --windows-admin-password "P@ssword!1234" --vm-set-type VirtualMachineScaleSets --enable-addons monitoring --network-plugin azure --node-vm-size Standard_D2_v3 --node-count 2 --enable-cluster-autoscaler --min-count 2 --max-count 5

az aks nodepool add --name win1 --cluster-name aks-test-eastus2 --resource-group rg-name --kubernetes-version 1.15.7 --os-type Windows --node-vm-size Standard_D2_v3 --node-count 2 --enable-cluster-autoscaler --min-count 2 --max-count 10

Cluster SP is role-assigned AcrPull to myacr.

pod/test-7699fdb4c4-ctb98      Successfully pulled image "myacr.azurecr.io/myrepo:tag"

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.15.7 (but same is on 1.17.0).
@jluk jluk added azure/acr Azure Container Registry Managed Identity labels Feb 6, 2020
@triage-new-issues triage-new-issues bot removed the triage label Feb 6, 2020
@sauryadas
Copy link
Contributor

sauryadas commented Feb 6, 2020

@wasker Cluster SP is AcrPull to myacr. <- by this you mean you performed a role assignment?

@norshtein

@wasker
Copy link
Author

wasker commented Feb 6, 2020

@sauryadas Correct. Let me update my comment to make it clear.

@norshtein
Copy link
Member

Acknowledge. In the design, kubelet should be able to pull the image from ACR as long as the identity is associated with the VMSS and the identity has Acrpull permission. Seems it is broken now. We're investigating what happened.

@norshtein
Copy link
Member

Is the ACR in different subscription? Please check kubernetes/kubernetes#87579.

@sauryadas
Copy link
Contributor

@wasker Is the ACR in different subscription?

@wasker
Copy link
Author

wasker commented Feb 12, 2020

No, ACR is in the same sub.

@heoelri
Copy link

heoelri commented Apr 9, 2020

I am effected by the same issue. I've deployed an AKS cluster today using the terraform azurerm 2.5 provider.

resource "azurerm_kubernetes_cluster" "deployment" {
  name                = "${lower(random_pet.deployment.id)}aks"
  location            = azurerm_resource_group.deployment.location
  resource_group_name = azurerm_resource_group.deployment.name
  dns_prefix          = "${lower(random_pet.deployment.id)}aks"

  role_based_access_control {
    enabled = true
  }

  default_node_pool {
    name                = "default"
    vm_size             = "Standard_D2_v2"
    enable_auto_scaling = true
    min_count           = 1
    max_count           = 8
    vnet_subnet_id      = azurerm_subnet.kubernetes.id
    type                = "VirtualMachineScaleSets"
  }

  identity {
      type = "SystemAssigned"
  }

  addon_profile {
    oms_agent {
      enabled                    = true
      log_analytics_workspace_id = azurerm_log_analytics_workspace.deployment.id
    }
  }

  tags = {
    Environment = var.environment
  }
}

And assigned the cluster identity to the AcrPull role:

resource "azurerm_role_assignment" "acrpull_role" {
  scope                            = data.azurerm_subscription.primary.id
  role_definition_name             = "AcrPull"
  principal_id                     = azurerm_kubernetes_cluster.deployment.identity.0.principal_id
  skip_service_principal_aad_check = true
}

It works, my AKS identity is now assigned to the AcrPull role on subscription level. But the pods are not able to pull images from an Azure Container Registry in the same subscription:

  Warning  Failed   11m (x596 over 121m)    kubelet, aks-default-12078679-vmss000000  Error: ImagePullBackOff
  Normal   BackOff  6m45s (x620 over 121m)  kubelet, aks-default-12078679-vmss000000  Back-off pulling image "actualtroutcr.azurecr.io/frontend:latest"

Even attaching the acr directly does not work:

az aks update -n adjustedtreefrogaks -g adjustedtreefrog --attach-acr actualtroutcr

Waiting for AAD role to propagate[################################    ]  90.0000%
Could not create a role assignment for ACR. Are you an Owner on this subscription?

Even though i am an owner in this subscription. Switching back to a new cluster w/o MI (same deployment with small modifications):

  service_principal {
    client_id     = var.client_id
    client_secret = var.client_secret
  }

  #identity {
  #    type = "SystemAssigned"
  #}

Works like a charm. Even the manual process of attaching an ACR:

az aks update -n helpedowlaks -g helpedowl  --attach-acr actualtroutcr
{
  ...
}

@TomGeske
Copy link

TomGeske commented Apr 9, 2020

Ensure you install at least aks-preview extension in version 0.4.42.
Docs issue is reported here: MicrosoftDocs/azure-docs#51672.

@sauryadas sauryadas removed their assignment Apr 9, 2020
@heoelri
Copy link

heoelri commented Apr 9, 2020

Thanks @TomGeske. Works for me with aks-preview 0.4.42 and azure-cli 2.31.

az --version

azure-cli                          2.3.1
...

Extensions:
aks-preview                       0.4.42

@mathieu-benoit
Copy link

Hi @TomGeske and @heoelri, the fix for the azure-cli part will work with 2.3.1, but you are not supposed to have the aks-preview to get this working since that --enable-managed-identity feature is now GA. Just making sure.

@heoelri
Copy link

heoelri commented Apr 9, 2020

@mathieu-benoit i've tried it with "plain" azure-cli in v2.3.1 w/o the aks-preview extension.

az aks update -n apparentmonitoraks -g apparentmonitor --attach-acr actualtroutcr

AAD role propagation done[############################################] 100.0000%
Operation failed with status: 'Bad Request'. Details: UnmarshalEntity encountered error: json: cannot unmarshal bool into Go struct field Properties.properties.autoScalerProfile of type string.

@mathieu-benoit
Copy link

mathieu-benoit commented Apr 9, 2020

Interesting @heoelri, thanks for confirming. FYI: another thread happened on that regard: MicrosoftDocs/azure-docs#51672. Looks like Azure CLI 2.3.1 did work for the az aks update --attach-acr, but I just asked to see if they got the aks-preview extension installed or not. Let's see.

@aristosvo
Copy link

aristosvo commented Apr 9, 2020

And assigned the cluster identity to the AcrPull role:

@heoelri : You are probably assigning the pull permissions to the wrong identity. The role assigment should use the kubelet identity, not the managed identity of AKS itself.

So, for az cli:

KUBELET_IDENTITY_ID=$(az aks show -g $RESOURCE_GROUP -n $CLUSTER_NAME --query identityProfile.kubeletidentity.clientId -o tsv)
ACR_ID=$(az acr show --resource-group $RESOURCE_GROUP --name $ACR_NAME --query id --output tsv)
az role assignment create --assignee $KUBELET_IDENTITY_ID --scope $ACR_ID --role acrpull

This worked for me!

See this PR for the Azure TerraForm provider which gives you the kubelet_identity, so we can finaly configure it in TerraForm: hashicorp/terraform-provider-azurerm#6393 🚀

@damienpontifex
Copy link

Thanks @aristosvo 🎉

I was doing this with an ARM template and the valid line for role assignment principal id ended up being

"principalId": "[reference(variables('aksResourceId'), '2020-02-01').identityProfile.kubeletidentity.objectId]"

@garethmorris78
Copy link

I'm having the same issue; created a new AKS 1.16.7 cluster with a system assigned identity; I have a system node pool, linux node pool and windows node pool.

A single managed identity (assigned as the kubelet identity) was created suffixed with -agentpool

azure cli version:

az version
{
  "azure-cli": "2.5.1",
  "azure-cli-command-modules-nspkg": "2.0.3",
  "azure-cli-core": "2.5.1",
  "azure-cli-nspkg": "3.0.4",
  "azure-cli-telemetry": "1.0.4",
  "extensions": {
    "azure-devops": "0.18.0"
  }
}

Ran the az command to integrate ACR and AKS as per: https://docs.microsoft.com/en-us/azure/aks/cluster-container-registry-integration

The kublet identity was successfully added to the ACR RBAC with acr pull permissions.

I'm able to pull ACR containers on the linux nodes with no errors. The windows nodes however result in the following error:

rpc error: code = Unknown desc = Error response from daemon: Get https://xxxxxx.azurecr.io/v2/xx/xx/xx: unauthorized: authentication required

Did the add node pool process for windows pool miss creating an additional kubelet identity for it ? The deploy process was smooth without any errors etc.

@TomGeske
Copy link

@garethmorris78 this is a bug with MSI for Windows in AKS. We are currently fixing it. Could you open a separate issue and tag @mikkelhegn and myself?

I'm going ahead and closing this one here. Please, feel free to continue commenting.

@wasker
Copy link
Author

wasker commented May 19, 2020

@TomGeske Is there a bug tracking MSI on Windows issues?

@TomGeske
Copy link

@wasker no. Feel free to create one.

@wasker
Copy link
Author

wasker commented May 19, 2020

I'm the original bug opener and I'm confused: I opened this bug to track the issue with MSI when it's used with Windows containers. Why did you close it and advise me to open another one, which will look exactly the same?

@TomGeske
Copy link

@wasker: Sorry, you're right. I missread it. I reopen this one and we will keep you posted about the status.

@TomGeske TomGeske reopened this May 19, 2020
@wasker
Copy link
Author

wasker commented May 19, 2020

Thank you!

@jjindrich
Copy link

I have same problem.

@mikkelhegn
Copy link
Contributor

The fix is currently rolling out: https://github.com/Azure/AKS/releases/tag/2020-05-25

@TomGeske
Copy link

Fix has been deployed.
Going ahead closing this one. Feel free to continue commenting.

@bohlenc
Copy link

bohlenc commented Jul 22, 2020

I am seeing some strange behavior on an AKS cluster configured to use managed identities which may be related to this issue:

Whenever a pod is scheduled on a node where one of its container images is not yet cached (or imagePullPolicy is Always), the pod goes into the ErrImagePull state (and finally into ImagePullBackOff) with the Docker daemon responding Failed to pull image "<IMAGE>": rpc error: code = Unknown desc = Error response from daemon: Get https://<ACR_NAME>.azurecr.io/v2/<REPO_NAME>/manifests/<IMAGE_TAG>: unauthorized: authentication required, visit https://aka.ms/acr/authorization for more information..

The curious thing is: after a few minutes (usually about 1-2 minutes) the issue resolves itself and the daemon is finally able to pull the image and create the pod.

When the image is cached on the node where the pod is scheduled, the pod can be created immediately and without issues.

Additional Information:
Kubernetes version: v1.18.4 (but also observed this on v.1.17.7)
CRI version: docker://3.0.10+azure
OS & Kernel version: Ubuntu 16.04.6 LTS, 4.15.0-1089-azure
ACR in different resource group but same subscription
kubelet identity has AcrPull role scoped to the ACR

Does the observed behavior fit here, or should I report it in a new issue?

@TomGeske
Copy link

@ch-bohlen: Are you using windows or linux node pools?

@bohlenc
Copy link

bohlenc commented Jul 22, 2020

@TomGeske I am using a Linux node pool. I updated my comment above.

@TomGeske
Copy link

@ch-bohlen: Did you notice failed auth. attempts in ACR. Did you try steps mentioned here.

@bohlenc
Copy link

bohlenc commented Jul 23, 2020

@TomGeske No, I am neither seeing failed auth nor failed image pulls in the ACR logs.
I just ran the proposed health checks, they all complete successfully.

@TomGeske
Copy link

I would suggest opening a support ticket with us.

@nohajc
Copy link

nohajc commented Aug 7, 2020

And assigned the cluster identity to the AcrPull role:

@heoelri : You are probably assigning the pull permissions to the wrong identity. The role assigment should use the kubelet identity, not the managed identity of AKS itself.

So, for az cli:

KUBELET_IDENTITY_ID=$(az aks show -g $RESOURCE_GROUP -n $CLUSTER_NAME --query identityProfile.kubeletidentity.clientId -o tsv)
ACR_ID=$(az acr show --resource-group $RESOURCE_GROUP --name $ACR_NAME --query id --output tsv)
az role assignment create --assignee $KUBELET_IDENTITY_ID --scope $ACR_ID --role acrpull

This worked for me!

See this PR for the Azure TerraForm provider which gives you the kubelet_identity, so we can finaly configure it in TerraForm: terraform-providers/terraform-provider-azurerm#6393 🚀

OK, this advice is a bit confusing. For me, identityProfile is empty, so the query command for aks doesn't return anything.
Anyway, the kubelet identity's name is AKS Cluster Name-agentpool (https://docs.microsoft.com/en-us/azure/aks/use-managed-identity#summary-of-managed-identities) and that's the one you have to use when assigning roles manually.

My issue was that I wanted to use a larger scope but I assigned the role to identity.principalId. Then I used aks update -n $CLUSTER_NAME -g $RESOURCE_GROUP --attach-acr $ACR_NAME instead and everything was set correctly.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests