Linkerd CNI pods not aware about the OIDC signing key auto-rotation by AKS| #12573

Peeyush1989 · 2024-05-08T09:23:51Z

What is the issue?

We are using a private AKS cluster version 1.26.x, We have configured linkerd stable version 2.14.2 with linkerd-cni enabled.

The AKS cluster is enabled with OIDC which is designed to to auto rotate the signing keys periodically.

After the OIDC keys were auto rotated, all the new pods were getting stuck with following error

“FailedCreatePodSandBox (x556 over ) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3756782430d4016076288c700b871e4325ca2d5d6bdd7a422697c7d3b54d23e6": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized”

We found that the issue started after an automatic RotateServiceAccountSigningKeys operation
We tried reconciling the cluster, by running a “az aks update” command, but the issue persisted.
we tried to create a new token for the default service account in the default namespace, then created a new pod with it. but the issue persisted.
Then, we tried running “az aks oidc-issuer rotate-signing-keys” twice, but the issue persisted.
Lastly, we figured that since the new pods are failing with an unauthorized linkerd error, that would mean that the issue is being generated in the linkerd pods. Therefore, we deleted the linkerd-cni daemonset pods, which caused the new pods to get the fresh token, which caused the issue to get resolved.

After restarted the linkerd-cni daemonset were were able to deploy the new pods but the existing pods in the linkerd meshed namespace started giving invalid certificate errors and pods inter communication was impacted.

We checked the issuer certificate and it was valid. We had to redeploy linkerd to get rid of this issue

Need to you help in troubleshooting linkerd issues with OIDC

How can it be reproduced?

we need to manual auto rotated the oidc signing keys in new infra to reproduce this issues.__

Logs, error output, etc

Linkerd control plane

[ 0.105506s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 0.306969s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 0.710647s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 1.211775s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 1.713047s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 2.215585s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 2.716391s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 3.217705s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

output of `linkerd check -o short`

N/A

Environment

K8 version 1.26
Env: AKS

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

yes

The text was updated successfully, but these errors were encountered:

stale · 2024-08-17T12:11:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

rootik · 2024-08-20T10:19:02Z

We are experiencing the same issue on AKS 1.30.3 running linkerd stable-2.14.9.
The linkerd-cni tokens have started to expire every hour resulting new pods are not able to start with the error mentioned above.
We've found out that Istio had the same issue and it was addressed finally.
We've applied a temporary workaround to restart the linkerd CNI daemon set every 50 minutes.
~~There's probably a better workaround~~ but I agree it's a bug and it has to be fixed.
It turned out that a workaround would be to create a daemon set which would mount /host/etc/cni/net.d/ as a volume and then do the simplest thing sed -i 's/info/warn/g' /host/etc/cni/net.d/10-azure.conflist every 50 minutes and then sleep.
Here's the result of POC:

[2024-08-21 14:37:42] Detected change in /host/etc/cni/net.d/: MOVED_TO 10-azure.conflist
[2024-08-21 14:37:42] New file [10-azure.conflist] detected; re-installing

linkessgit · 2024-08-23T15:51:41Z

We also have this issue on AKS clusters upgraded to Kubernetes version >=1.30 and OIDC feature enabled. We contacted Microsoft and they confirmed that the service account token lifetime for such clusters is set to 1-h expiration, they also said that, the 1-year token behavior is legacy, and 1-h token behavior is recommended and will applied to all clusters even without OIDC feature enabled.
The problem as described by @rootik and @Peeyush1989 that install-cni pod does not monitor service account token changes so it's not reflected in cni config and causing
Unknown desc = failed to setup network for sandbox "3756782430d4016076288c700b871e4325ca2d5d6bdd7a422697c7d3b54d23e6": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized”

amedinagar · 2024-08-26T08:46:00Z

Experiencing same issue on AKS 1.30.3 running linkerd stable-2.14.9.
install-cni pod does not monitor service account token changes, and it causes Uknown desc error.

rootik · 2024-09-05T22:46:58Z

A pull request for the possible fix was raised linkerd/linkerd2-proxy-init#416

oskarm93 · 2024-11-13T10:28:30Z

Reproduced on linkerd edge-24.8.2 (stable-2.16) and AKS 1.30.4.

Added a daemonset with busybox running a script from this suggestion:

It turned out that a workaround would be to create a daemon set which would mount /host/etc/cni/net.d/ as a volume and then do the simplest thing sed -i 's/info/warn/g' /host/etc/cni/net.d/10-azure.conflist every 50 minutes and then sleep. Here's the result of POC:
[2024-08-21 14:37:42] Detected change in /host/etc/cni/net.d/: MOVED_TO 10-azure.conflist
[2024-08-21 14:37:42] New file [10-azure.conflist] detected; re-installing

https://gist.github.com/oskarm93/a6941dafc0a2af52f35af794c939f20f
https://gist.github.com/oskarm93/0f7a901f5ec5db5ae4e00e42176c98a4

[2024-11-13 10:22:12] Created CNI config /host/etc/cni/net.d/99-cni-fix.conflist
Setting up watches.
Watches established.
[2024-11-13 10:28:10] Detected change in /host/etc/cni/net.d/: MODIFY 99-cni-fix.conflist
[2024-11-13 10:28:10] New file [99-cni-fix.conflist] detected; re-installing
[2024-11-13 10:28:10] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://10.0.0.1:__KUBERNETES_SERVICE_PORT__",
[2024-11-13 10:28:10] CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "info",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.0.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },
  "linkerd": {
    "incoming-proxy-port": 4143,
    "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,
    "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],
    "simulate": false,
    "use-wait-flag": false,
    "iptables-mode": "legacy",
    "ipv6": false
  }
}
[2024-11-13 10:28:10] Created CNI config /host/etc/cni/net.d/99-cni-fix.conflist
[2024-11-13 10:28:10] Detected change in /host/etc/cni/net.d/: MODIFY 99-cni-fix.conflist
[2024-11-13 10:28:10] Ignoring event: MODIFY /host/etc/cni/net.d/99-cni-fix.conflist; no real changes detected
[2024-11-13 10:28:10] Detected change in /host/etc/cni/net.d/: DELETE 99-cni-fix.conflist
[2024-11-13 10:28:10] Detected change in /host/etc/cni/net.d/: CREATE 99-cni-fix.conflist
[2024-11-13 10:28:10] Ignoring event: CREATE /host/etc/cni/net.d/99-cni-fix.conflist; no real changes detected
[2024-11-13 10:28:10] Detected change in /host/etc/cni/net.d/: MODIFY 99-cni-fix.conflist
[2024-11-13 10:28:10] Ignoring event: MODIFY /host/etc/cni/net.d/99-cni-fix.conflist; no real changes detected

This change removes the `policy` entry from the cni config template, which isn't used. That contained a `__SERVICEACCOUNT_TOKEN__` placeholder, which was coupling this config file with the `ZZZ-linkerd-cni-kubeconfig` file generated by linkerd-cni. An upcoming PR will add support for detecting changes in the mounted serviceaccount token file (see #12573), and the current change will facilitate that effort.

Fixes linkerd/linkerd2#12573 ## Problem When deployed, the linkerd-cni pod gets its service account token mounted automatically by k8s: ```yaml - name: kube-api-access-729gv projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace ``` According to this, the token is set to expire after an hour. When the linkerd-cni pod starts it deploys the file `ZZZ-linkerd-cni-kubeconfig` in to the **host** file system. That config contains the token sourced from `/var/run/secrets/kubernetes.io/serviceaccount` (mounted by the pod). When the token gets rotated after an hour, that token file is updated but `ZZZ-linkerd-cni-kubeconfig` is not updated. The `linkerd-cni` binary uses that token to connect to the kube-api, so having an outdated token should forbid it from functioning properly, which would manifest as new pods in the data plane not being able to acquire a proper network config. However, that failure isn't usually observed, except for the cases pointed out in linkerd/linkerd2#12573. The reason is that the token's actual lifetime is one year, due to kube-api's `--service-account-extend-token-expiration` [flag](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#options) which is usually set as `true` to avoid breaking too many instances not yet adapted to use tokens with short expirations: > Turns on projected service account expiration extension during token generation, which helps safe transition from legacy token to bound service account token feature. If this flag is enabled, admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition, ignoring value of service-account-max-token-expiration. ## Repro ### AKS The issue currently affects AKS clusters using OIDC keys. To reproduce, create a new cluster in AKS, making sure "Enable OIDC" and "Workload Identity" is ticked in the UI. Then install the linkerd-cni plugin, labelling the linkerd-cni DaemonSet so that its ServiceAccount token is provided via OIDC: ``` linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" | kubectl apply -f - ``` And install linkerd with cni enabled, and an injected instance of emojivoto. The secret token is rotated after an hour, but the old one remains valid for a 24h. Manually rotating the key as detailed in the [docs](https://learn.microsoft.com/en-us/azure/aks/use-oidc-issuer#rotate-the-oidc-key) should invalidate the old key. After that, bouncing any emojivoto pod will prove unsuccessful with the following event being raised: ``` Warning FailedCreatePodSandBox 15s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8121291446642b272cea9ee5f083958a37bab0dd7060c4d9c06bb05fecf911d2": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized ``` ## Fix This change adds a new function `monitor_service_account_token()` that monitors the rollout of the token file; which is a symlink whose target changes as a new token is deployed. When detecting a new token file, this function calls the new `create_kubeconfig()` function. This change also removes the existing logic around the DELETE event, which is a leftover from previous changes and is now a no-op. Also, as detailed in linkerd/linkerd2#13407, the ServiceAccount token has been removed from the cni config template because it's not used, simplifying things as we can regenerate the kubeconfig file without having to touch the cni config file. Finally, the file `linkerd-cni.conf.default` has been removed as is not used. ## Test Same as with the repro above, but use the cni-plugin image that contains the fix: ``` linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" --set image.name="ghcr.io/alpeb/cni-plugin" --set image.version="v1.5.3" | kubectl apply -f - ``` After an hour when the token gets rotated you should see the event in the linkerd-cni pod logs.

This change removes the `policy` entry from the cni config template, which isn't used. That contained a `__SERVICEACCOUNT_TOKEN__` placeholder, which was coupling this config file with the `ZZZ-linkerd-cni-kubeconfig` file generated by linkerd-cni. In linkerd/linkerd2-proxy-init#440 we add support for detecting changes in the mounted serviceaccount token file (see #12573), and the current change facilitates that effort. Co-authored-by: Oliver Gould <ver@buoyant.io>

alpeb · 2024-12-12T16:50:33Z

Hi folks, thank you all for the continued feedback!
We finally released a new linkerd-cni version fixing this issue:
https://github.com/linkerd/linkerd2-proxy-init/releases/tag/cni-plugin%2Fv1.6.0
It's not yet referred to as the default version in the linkerd2-cni chart, but it'd be great if you could give it a try, setting image.version: v1.6.0 in the values.yaml. Let me know how it goes! 🙂

pdefreitas · 2024-12-16T12:05:00Z

@alpeb I've tried this with linkerd-cni chart 24.11.8 and the issue still occurs:

    Image:          acrmecremote.azurecr.io/third-party/linkerd/cni-plugin:v1.6.0

  Warning  FailedCreatePodSandBox  12s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

AKS 1.30.5 running Azure CNI Node Subnet.

EDIT: I had to re-image all cluster nodes it seems the cluster got in a impaired network state. I will retest the scenario above again and update this comment.

EDIT2: Just confirmed the issue still persists when the OIDC token expires:

New pods fail to launch:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "***": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

CNI pod fails to start on that instance where the pod fails to start:
linkerd-cni-5lngg 0/1 ContainerCreating 0 4m27s

Describing the CNI pod shows the same error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "***": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

Peeyush1989 added the bug label May 8, 2024

olix0r added the env/aks Microsoft AKS label May 16, 2024

stale bot added the wontfix label Aug 17, 2024

stale bot removed the wontfix label Aug 20, 2024

alpeb mentioned this issue Oct 4, 2024

Introduce support for Projected Volumes kubernetes ServiceAccount Tokens linkerd/linkerd2-proxy-init#416

Closed

alpeb mentioned this issue Nov 29, 2024

Simplify cni config #13407

Merged

alpeb mentioned this issue Nov 29, 2024

Detect SA token rotation linkerd/linkerd2-proxy-init#440

Merged

alpeb closed this as completed in linkerd/linkerd2-proxy-init#440 Dec 6, 2024

alpeb closed this as completed in linkerd/linkerd2-proxy-init@f224556 Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd CNI pods not aware about the OIDC signing key auto-rotation by AKS| #12573

Linkerd CNI pods not aware about the OIDC signing key auto-rotation by AKS| #12573

Peeyush1989 commented May 8, 2024

stale bot commented Aug 17, 2024

rootik commented Aug 20, 2024 •

edited

Loading

linkessgit commented Aug 23, 2024 •

edited

Loading

amedinagar commented Aug 26, 2024

rootik commented Sep 5, 2024

oskarm93 commented Nov 13, 2024

alpeb commented Dec 12, 2024

pdefreitas commented Dec 16, 2024 •

edited

Loading

Linkerd CNI pods not aware about the OIDC signing key auto-rotation by AKS| #12573

Linkerd CNI pods not aware about the OIDC signing key auto-rotation by AKS| #12573

Comments

Peeyush1989 commented May 8, 2024

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

stale bot commented Aug 17, 2024

rootik commented Aug 20, 2024 • edited Loading

linkessgit commented Aug 23, 2024 • edited Loading

amedinagar commented Aug 26, 2024

rootik commented Sep 5, 2024

oskarm93 commented Nov 13, 2024

alpeb commented Dec 12, 2024

pdefreitas commented Dec 16, 2024 • edited Loading

output of `linkerd check -o short`

rootik commented Aug 20, 2024 •

edited

Loading

linkessgit commented Aug 23, 2024 •

edited

Loading

pdefreitas commented Dec 16, 2024 •

edited

Loading