Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(azure): add support for workload identity using azidentity #3906

Conversation

weisdd
Copy link
Contributor

@weisdd weisdd commented Sep 1, 2023

Description

In Release 2022-10-10, AKS has brought support for Workload Identity Federation as a replacement for Pod-Managed Identities. The latter now is considered deprecated and is about to reach its end of life.

This PR adds support for Workload Identity using azidentity (which replaced adal in #3040).

Fixes #2724

NOTE: This PR comes to replace #3111 where I implemented workload identity using adal in a bit hacky way.

Checklist

  • Unit tests updated
  • End user documentation updated

Signed-off-by: Igor Beliakov <demtis.register@gmail.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 1, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @weisdd. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 1, 2023
Comment on lines +423 to +427
{
"subscriptionId": "$(az account show --query id -o tsv)",
"resourceGroup": "$AZURE_DNS_ZONE_RESOURCE_GROUP",
"useWorkloadIdentityExtension": true
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a pro at Azure, but wonder if it is possible to avoid this mount..

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the PR creator, but as long as Azure-workload-identity is installed in the Cluster, the Mutating Admission Webhook will take care of the projection based on just the podLabels & the service Account annotations.

So, perhaps the secretConfiguration section in the helm values can be done away with.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is what I am thinking about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the mutating webhook inserts environment variables that help to get an access token:

image

Whereas Subscription ID (subscriptionId) and Resource Group (resourceGroup) are needed for Azure provider (in-tree plugin in external-dns) to understand where a DNS zone is maintained. It's used to construct proper API calls against ARM.

As for the last option called useWorkloadIdentityExtension, it's needed to instruct provider which authentication method to use. Strictly speaking, azidentity offers DefaultAzureCredential function that can automatically detect authentication method:

image

https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity#section-readme
But I don't think it should get enabled under this PR. So, this option is also required, at least for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think DefaultAzureCredential should always be preferred and if it had been in most projects azwi would've been easier to get going. But I don't know enough about this project to judge if that change should or should not be made in this pr or as it's own change so if you think that's not appropriate you probably know better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weisdd I don't have a strong opinion here as I haven't used DefaultAzureCredential in the past and I'm not familiar with the tradeoffs that come with it. Can you please clarify pros and cons of the two approaches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Raffo Here's what I can think of:

  • Pros:
    • Easy-to-use;
    • No need to explicitly maintain configuration settings related to any of the authentication methods (e.g. when I looked into how to add support for Workload Identity to FluxCD, it was a matter of bumping deps in those controllers that relied on DefaultAzureCredential);
  • Cons:
    • If your environment satisfies multiple authentication methods, you can't directly override the order of preference;
    • You can't use it in a complex multi-tenant scenario when a token is not mounted, but referenced (multiple MSIs delegate access to different k8s service-accounts, so the app needs to have a more complex flow for getting tokens. It's used it external-secrets operator, docs).

I think none of the cons are applicable to external-dns at this point:

  • A container environment is unlikely to satisfy multiple authentication methods (why would anyone configure all of them?);
  • external-dns currently works with just one resource group (so, you don't have complex multi-tenant setups in the same instance).

Order of preference

DefaultAzureCredential has this order of preference when it come to authentication methods:

  • Environment - DefaultAzureCredential will read account information specified via environment variables and use it to authenticate.
  • Workload Identity - If the app is deployed on Kubernetes with environment variables set by the workload identity webhook, DefaultAzureCredential will authenticate the configured identity.
  • Managed Identity - If the app is deployed to an Azure host with managed identity enabled, DefaultAzureCredential will authenticate with it.
  • Azure CLI - If a user or service principal has authenticated via the Azure CLI az login command, DefaultAzureCredential will authenticate that identity.

Source: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity#section-readme

I think this order is well thought-out and is unlikely to cause any issues:

  • Managed Identity will soon be gone (or is already gone) in most clusters;
  • Azure CLI is quite heavy for a container image and will only be pre-installed in local development environments. Devs can make the library use an SPN through environment variables if there's any need for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the context, this is very helpful. If the switch to DefaultAzureCredential is trivial, then I wouldn't be opposed in having this in this PR. It would align us with other controllers and make this feature easier to adopt. At the end of the day, we are adding a new functionality, so it is not bad to make now the change that we need to make.
The only real consideration that I want to make is that I want to cut a release soon and I'm waiting for this one to merge. It's fine to do so, but if adding DefaultAzureCredential means waiting a lot of time or extensive testing due to lack of confidence in the change, then I'd prefer having this one in and handle the change in a follow up PR. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Raffo At a first glance, code-wise, migration to DefaultAzureCredentialt would be simple, something along the lines:

func getCredentials(cfg config) (azcore.TokenCredential, error) {
	cloudCfg, err := getCloudConfiguration(cfg.Cloud)
	if err != nil {
		return nil, fmt.Errorf("failed to get cloud configuration: %w", err)
	}

	opts := azidentity.DefaultAzureCredentialOptions{
		ClientOptions: azcore.ClientOptions{
			Cloud: cloudCfg,
		},
	}

	cred, err := azidentity.NewDefaultAzureCredential(&opts)
	if err != nil {
		return nil, fmt.Errorf("failed to obtain a token: %w", err)
	}

	return cred, nil
}

(+ remove old configuration fields here and there)

Testing is unlikely to be complex.

When it comes to documentation, it'll be more time-consuming:

  • We'll need to add some references to the docs for the new authentication logic and highlight important pieces there;
  • I think there are plenty of things to be simplified in the docs (including the part that I prepared in this PR), and we should probably share some code snippets (e.g. for terraform).

So, I'd prefer to go with the second option you suggested - to merge the current PR and then handle other changes in a follow-up PR. If I don't get busy at work, I'll do it this week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me, thank you for your careful explanation and to be mindful to separate work appropriately, that is very appreciated.

Copy link
Contributor

@Raffo Raffo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me and I am very thankful for the amount of documentation that is included.

What I request the community is to provide feedback on the instructions as I don't have an immediate way to test (I don't have an azure cluster ready, although I should be able to create one if someone can't provide information).

I'm also gonna ask the Azure folks what they think about this so that we can get some direct cloud provider feedback.

I'll keep an eye on this so it won't be delayed long.

Comment on lines +423 to +427
{
"subscriptionId": "$(az account show --query id -o tsv)",
"resourceGroup": "$AZURE_DNS_ZONE_RESOURCE_GROUP",
"useWorkloadIdentityExtension": true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weisdd I don't have a strong opinion here as I haven't used DefaultAzureCredential in the past and I'm not familiar with the tradeoffs that come with it. Can you please clarify pros and cons of the two approaches?

@weisdd
Copy link
Contributor Author

weisdd commented Sep 2, 2023

@Raffo Thanks for your time and the review!

Important context for the documentation:

  • I tried to make it look close to the section around "Managed identity using AKS Kubelet identity" (present in the same document);
  • In a standard scenario, Workload Identity is configured in the same way for different apps, so I guess nobody would really look into how to enable the Workload Identity and OIDC issuer add-ons on their clusters, and people who have heard anything about Workload Identity are likely to be already familiar with where to set certain labels/annotations;
  • Microsoft offers a tool called azwi that can help to create service accounts and federated credentials. I haven't referred to it, but it can potentially be added.

I'll be happy to improve docs should any suggestions arise from community or Microsoft :)

@johngmyers
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 2, 2023
@bigfleet
Copy link

bigfleet commented Sep 3, 2023

As one Azure community member, what I know is that the Azure has finally hard-deprecated the pod identity mechanism, so that's not an option for use with external-dns. With this not yet landed, the only option is to create, manage, and inject your own special purpose service principal. Not impossible, but not very good either. I don't speak for MSFT/Azure but usage wise, it's pretty clear they created this functionality specifically so you wouldn't have to do that (and not just for external-dns).

Personally, I appreciate the constant consideration of backward compatibility, but external factors have broken this use case pretty thoroughly, and a release with perhaps some special handling/mention in changelog would be in order. Only one opinion.

@Raffo
Copy link
Contributor

Raffo commented Sep 4, 2023

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Raffo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2023
@k8s-ci-robot k8s-ci-robot merged commit add5e92 into kubernetes-sigs:master Sep 4, 2023
12 checks passed
'{"spec": {"template": {"metadata": {"labels": {\"azure.workload.identity/use\": \"true\"}}}}}'
```

NOTE: it's also possible to specify (or override) ClientID through `UserAssignedIdentityID` field in `azure.json`.
Copy link

@Sturgelose Sturgelose Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: according this line, I kinda understand that "if I do not set the ENV variables right through annotations, I can override/specify the clientId through the value in azure.json"

However, here, it feels it is the opposite: if the azure.json does not have the values, we default to the ENV variables.

I tried setting all the values via the azure.json and not using annotations but, the clientId is not fetching properly. May there be a problem in the config somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the code, it seems the property to set is ClientID instead of UserAssignedIdentityID:

ClientID: cfg.ClientID,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my bad, we should correct either docs or code. Sorry for that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply, I think this provides quite some insights on why it didn't work. So... which of both options do you prefer? Updating docs or the code?

I don't mind opening a PR, but I don't want to make a decision without an agreement on what we want.

From my side I wouldn't mind opening a PR to update the docs. Also I can test that ClientID and TenantID work as expected before reflecting that in the docs.

Copy link

@janpfischer janpfischer Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do encounter the same issue. For me it seems like the default way with annotation of client-id and pod label is not working. I get the error message that the client id is not present.
To set the clientId in the json file you'll have to set it in the key aadClientId and not in UserAssignedIdentityID like in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sturgelose sorry for the delay in replies, intense period at work and some personal matters.
To be honest, not sure what's the best approach here. On one hand, changing code to UserAssignedIdentityID would make more sense in terms of namings. I think I used it in the initial implementation on the old library, but when preparing the new PR, somehow lost the fact that there are two ways to pass client ID in external-dns.
On the other, aadClientId is an "easier" change as it doesn't change behaviour.
At the same time, as the documentation is not fully accurate, it's unlikely that many people used overrides. By the way, could you expand on why you need to use explicit override through config instead of k8s annotations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, it's better to stick with clientID as it's the environment variable the mutating webhook will present it in the pod through AZURE_CLIENT_ID as described in https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html#mutating-admission-webhook
More other, UserAssignedIdentityID was just used for the old pod identity feature wich is deprecated so once it will be removed, it will make a weird sitution where UserAssignedIdentityID is used to populate AZURE_CLIENT_ID. I think it makes more sense to use the same vocabulary everywhere.

@janpfischer
Copy link

Hey @weisdd,

I've got problems with getting this to work with Azure China. I've linked the issue I've created in the azure-sdk-for-go above because I think the problem is maybe there but could you have a quick look into this, too to identity if maybe there is a problem in the external-dns implementation?
I am willing to provide test support if needed.

Thank you very much!

@jbpaux
Copy link
Contributor

jbpaux commented Oct 20, 2023

Hi @janpfischer it's been fix with this PR #3942 but it's not yet in a release. You'll have to use an image compiled from master.

@janpfischer
Copy link

janpfischer commented Oct 20, 2023

Thank you very much. I will test this!
EDIT: tested, works now!
For anyone who encounters this problem in the future: Make sure to use a release from external-dns which includes the Bugfix from #3942.

@jonasnorlund
Copy link

Gr8 work with this (long) one @weisdd, waiting for it... :-)

@jbpaux
Copy link
Contributor

jbpaux commented Oct 25, 2023

It's already released in https://github.com/kubernetes-sigs/external-dns/releases/tag/v0.13.6 @jonasnorlund (even if a small bug impacts sovereign tenants)

@jonasnorlund
Copy link

Hi, yes I saw it later, I just looked at Azure Private DNS and the docs wasn't updated there.

https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/azure-private-dns.md

@weisdd
Copy link
Contributor Author

weisdd commented Oct 27, 2023

@jonasnorlund I thought of merging docs for public and private dns (the difference in external-dns settings is tiny), improve some of the sections, and experiment with migration from manual selection of authentication methods to an automated one, but some personal matters got into my way (load at work and some construction defects in a newly bought property). It's still in my plans, I just haven't had mental capacity for that yet. :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Azure Workload Identities