Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify cni config #13407

Merged
merged 2 commits into from
Dec 10, 2024
Merged

Simplify cni config #13407

merged 2 commits into from
Dec 10, 2024

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Nov 29, 2024

This change removes the policy entry from the cni config template, which isn't used. That contained a __SERVICEACCOUNT_TOKEN__ placeholder, which was coupling this config file with the ZZZ-linkerd-cni-kubeconfig file generated by linkerd-cni. In linkerd/linkerd2-proxy-init#440 we add support for detecting changes in the mounted serviceaccount token file (see #12573), and the current change facilitates that effort.

This change removes the `policy` entry from the cni config template,
which isn't used. That contained a `__SERVICEACCOUNT_TOKEN__`
placeholder, which was coupling this config file with the
`ZZZ-linkerd-cni-kubeconfig` file generated by linkerd-cni. An upcoming
PR will add support for detecting changes in the mounted serviceaccount
token file (see #12573), and the current change will facilitate that
effort.
@alpeb alpeb requested a review from a team as a code owner November 29, 2024 13:48
@alpeb
Copy link
Member Author

alpeb commented Nov 29, 2024

Currently blocked by a rustc version issue when doing cargo install cargo-nextest in the policy tests.

alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this pull request Nov 29, 2024
Fixes linkerd/linkerd2#12573

## Problem

When deployed, the linkerd-cni pod gets its service account token mounted automatically by k8s:
```yaml
  - name: kube-api-access-729gv
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
```
According to this, the token is set to expire after an hour.
When the linkerd-cni pod starts it deploys the file `ZZZ-linkerd-cni-kubeconfig` in to the **host** file system.
That config contains the token sourced from `/var/run/secrets/kubernetes.io/serviceaccount` (mounted by the pod).
When the token gets rotated after an hour, that token file is updated but `ZZZ-linkerd-cni-kubeconfig` is not updated.
The `linkerd-cni` binary uses that token to connect to the kube-api, so having an outdated token should forbid it from functioning properly, which would manifest as new pods in the data plane not being able to acquire a proper network config.
However, that failure isn't usually observed, except for the cases pointed out in linkerd/linkerd2#12573. The reason is that the token's actual lifetime is one year, due to kube-api's `--service-account-extend-token-expiration` [flag](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#options) which is usually set as `true` to avoid breaking too many instances not yet adapted to use tokens with short expirations:

> Turns on projected service account expiration extension during token generation, which helps safe transition from legacy token to bound service account token feature. If this flag is enabled, admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition, ignoring value of service-account-max-token-expiration.

## Repro

### AKS

The issue currently affects AKS clusters using OIDC keys. To reproduce, create a new cluster in AKS, making sure "Enable OIDC" and "Workload Identity" is ticked in the UI.

Then install the linkerd-cni plugin, labelling the linkerd-cni DaemonSet so that its ServiceAccount token is provided via OIDC:
```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" | kubectl apply -f -
```

And install linkerd with cni enabled, and an injected instance of emojivoto.

The secret token is rotated after an hour, but the old one remains valid for a 24h. Manually rotating the key as detailed in the [docs](https://learn.microsoft.com/en-us/azure/aks/use-oidc-issuer#rotate-the-oidc-key) should invalidate the old key.

After that, bouncing any emojivoto pod will prove unsuccessful with the following event being raised:

```
Warning  FailedCreatePodSandBox  15s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8121291446642b272cea9ee5f083958a37bab0dd7060c4d9c06bb05fecf911d2": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
```

## Fix

This change adds a new function `monitor_service_account_token()` that monitors the rollout of the token file; which is a symlink whose target changes as a new token is deployed. When detecting a new token file, this function calls the new `create_kubeconfig()` function.

This change also removes the existing logic around the DELETE event, which is a leftover from previous changes and is now a no-op.

Also, as detailed in linkerd/linkerd2#13407, the ServiceAccount token has been removed from the cni config template because it's not used, simplifying things as we can regenerate the kubeconfig file without having to touch the cni config file.

Finally, the file `linkerd-cni.conf.default` has been removed as is not used.

## Test

Same as with the repro above, but use the cni-plugin image that contains the fix:

```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" --set image.name="ghcr.io/alpeb/cni-plugin" --set image.version="v1.5.3" | kubectl apply -f -
```

After an hour when the token gets rotated you should see the event in the linkerd-cni pod logs.
alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this pull request Nov 29, 2024
Fixes linkerd/linkerd2#12573

## Problem

When deployed, the linkerd-cni pod gets its service account token mounted automatically by k8s:
```yaml
  - name: kube-api-access-729gv
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
```
According to this, the token is set to expire after an hour.
When the linkerd-cni pod starts it deploys the file `ZZZ-linkerd-cni-kubeconfig` in to the **host** file system.
That config contains the token sourced from `/var/run/secrets/kubernetes.io/serviceaccount` (mounted by the pod).
When the token gets rotated after an hour, that token file is updated but `ZZZ-linkerd-cni-kubeconfig` is not updated.
The `linkerd-cni` binary uses that token to connect to the kube-api, so having an outdated token should forbid it from functioning properly, which would manifest as new pods in the data plane not being able to acquire a proper network config.
However, that failure isn't usually observed, except for the cases pointed out in linkerd/linkerd2#12573. The reason is that the token's actual lifetime is one year, due to kube-api's `--service-account-extend-token-expiration` [flag](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#options) which is usually set as `true` to avoid breaking too many instances not yet adapted to use tokens with short expirations:

> Turns on projected service account expiration extension during token generation, which helps safe transition from legacy token to bound service account token feature. If this flag is enabled, admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition, ignoring value of service-account-max-token-expiration.

## Repro

### AKS

The issue currently affects AKS clusters using OIDC keys. To reproduce, create a new cluster in AKS, making sure "Enable OIDC" and "Workload Identity" is ticked in the UI.

Then install the linkerd-cni plugin, labelling the linkerd-cni DaemonSet so that its ServiceAccount token is provided via OIDC:
```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" | kubectl apply -f -
```

And install linkerd with cni enabled, and an injected instance of emojivoto.

The secret token is rotated after an hour, but the old one remains valid for a 24h. Manually rotating the key as detailed in the [docs](https://learn.microsoft.com/en-us/azure/aks/use-oidc-issuer#rotate-the-oidc-key) should invalidate the old key.

After that, bouncing any emojivoto pod will prove unsuccessful with the following event being raised:

```
Warning  FailedCreatePodSandBox  15s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8121291446642b272cea9ee5f083958a37bab0dd7060c4d9c06bb05fecf911d2": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
```

## Fix

This change adds a new function `monitor_service_account_token()` that monitors the rollout of the token file; which is a symlink whose target changes as a new token is deployed. When detecting a new token file, this function calls the new `create_kubeconfig()` function.

This change also removes the existing logic around the DELETE event, which is a leftover from previous changes and is now a no-op.

Also, as detailed in linkerd/linkerd2#13407, the ServiceAccount token has been removed from the cni config template because it's not used, simplifying things as we can regenerate the kubeconfig file without having to touch the cni config file.

Finally, the file `linkerd-cni.conf.default` has been removed as is not used.

## Test

Same as with the repro above, but use the cni-plugin image that contains the fix:

```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" --set image.name="ghcr.io/alpeb/cni-plugin" --set image.version="v1.5.3" | kubectl apply -f -
```

After an hour when the token gets rotated you should see the event in the linkerd-cni pod logs.
@olix0r
Copy link
Member

olix0r commented Dec 2, 2024

Currently blocked by a rustc version issue when doing cargo install cargo-nextest in the policy tests.

#13411

@olix0r olix0r self-assigned this Dec 4, 2024
alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this pull request Dec 6, 2024
Fixes linkerd/linkerd2#12573

## Problem

When deployed, the linkerd-cni pod gets its service account token mounted automatically by k8s:
```yaml
  - name: kube-api-access-729gv
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
```
According to this, the token is set to expire after an hour.
When the linkerd-cni pod starts it deploys the file `ZZZ-linkerd-cni-kubeconfig` in to the **host** file system.
That config contains the token sourced from `/var/run/secrets/kubernetes.io/serviceaccount` (mounted by the pod).
When the token gets rotated after an hour, that token file is updated but `ZZZ-linkerd-cni-kubeconfig` is not updated.
The `linkerd-cni` binary uses that token to connect to the kube-api, so having an outdated token should forbid it from functioning properly, which would manifest as new pods in the data plane not being able to acquire a proper network config.
However, that failure isn't usually observed, except for the cases pointed out in linkerd/linkerd2#12573. The reason is that the token's actual lifetime is one year, due to kube-api's `--service-account-extend-token-expiration` [flag](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#options) which is usually set as `true` to avoid breaking too many instances not yet adapted to use tokens with short expirations:

> Turns on projected service account expiration extension during token generation, which helps safe transition from legacy token to bound service account token feature. If this flag is enabled, admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition, ignoring value of service-account-max-token-expiration.

## Repro

### AKS

The issue currently affects AKS clusters using OIDC keys. To reproduce, create a new cluster in AKS, making sure "Enable OIDC" and "Workload Identity" is ticked in the UI.

Then install the linkerd-cni plugin, labelling the linkerd-cni DaemonSet so that its ServiceAccount token is provided via OIDC:
```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" | kubectl apply -f -
```

And install linkerd with cni enabled, and an injected instance of emojivoto.

The secret token is rotated after an hour, but the old one remains valid for a 24h. Manually rotating the key as detailed in the [docs](https://learn.microsoft.com/en-us/azure/aks/use-oidc-issuer#rotate-the-oidc-key) should invalidate the old key.

After that, bouncing any emojivoto pod will prove unsuccessful with the following event being raised:

```
Warning  FailedCreatePodSandBox  15s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8121291446642b272cea9ee5f083958a37bab0dd7060c4d9c06bb05fecf911d2": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
```

## Fix

This change adds a new function `monitor_service_account_token()` that monitors the rollout of the token file; which is a symlink whose target changes as a new token is deployed. When detecting a new token file, this function calls the new `create_kubeconfig()` function.

This change also removes the existing logic around the DELETE event, which is a leftover from previous changes and is now a no-op.

Also, as detailed in linkerd/linkerd2#13407, the ServiceAccount token has been removed from the cni config template because it's not used, simplifying things as we can regenerate the kubeconfig file without having to touch the cni config file.

Finally, the file `linkerd-cni.conf.default` has been removed as is not used.

## Test

Same as with the repro above, but use the cni-plugin image that contains the fix:

```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" --set image.name="ghcr.io/alpeb/cni-plugin" --set image.version="v1.5.3" | kubectl apply -f -
```

After an hour when the token gets rotated you should see the event in the linkerd-cni pod logs.
alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this pull request Dec 6, 2024
Fixes linkerd/linkerd2#12573

## Problem

When deployed, the linkerd-cni pod gets its service account token mounted automatically by k8s:
```yaml
  - name: kube-api-access-729gv
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
```
According to this, the token is set to expire after an hour.
When the linkerd-cni pod starts it deploys the file `ZZZ-linkerd-cni-kubeconfig` in to the **host** file system.
That config contains the token sourced from `/var/run/secrets/kubernetes.io/serviceaccount` (mounted by the pod).
When the token gets rotated after an hour, that token file is updated but `ZZZ-linkerd-cni-kubeconfig` is not updated.
The `linkerd-cni` binary uses that token to connect to the kube-api, so having an outdated token should forbid it from functioning properly, which would manifest as new pods in the data plane not being able to acquire a proper network config.
However, that failure isn't usually observed, except for the cases pointed out in linkerd/linkerd2#12573. The reason is that the token's actual lifetime is one year, due to kube-api's `--service-account-extend-token-expiration` [flag](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#options) which is usually set as `true` to avoid breaking too many instances not yet adapted to use tokens with short expirations:

> Turns on projected service account expiration extension during token generation, which helps safe transition from legacy token to bound service account token feature. If this flag is enabled, admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition, ignoring value of service-account-max-token-expiration.

## Repro

The issue currently affects AKS clusters using OIDC keys. To reproduce, create a new cluster in AKS, making sure "Enable OIDC" and "Workload Identity" is ticked in the UI.

Then install the linkerd-cni plugin, labelling the linkerd-cni DaemonSet so that its ServiceAccount token is provided via OIDC:
```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" | kubectl apply -f -
```

And install linkerd with cni enabled, and an injected instance of emojivoto.

The secret token is rotated after an hour, but the old one remains valid for a 24h. Manually rotating the key as detailed in the [docs](https://learn.microsoft.com/en-us/azure/aks/use-oidc-issuer#rotate-the-oidc-key) should invalidate the old key.

After that, bouncing any emojivoto pod will prove unsuccessful with the following event being raised:

```
Warning  FailedCreatePodSandBox  15s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8121291446642b272cea9ee5f083958a37bab0dd7060c4d9c06bb05fecf911d2": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
```

## Fix

This change adds a new function `monitor_service_account_token()` that monitors the rollout of the token file; which is a symlink whose target changes as a new token is deployed. When detecting a new token file, this function calls the new `create_kubeconfig()` function.

Also, as detailed in linkerd/linkerd2#13407, the ServiceAccount token has been removed from the cni config template because it's not used, simplifying things as we can regenerate the kubeconfig file without having to touch the cni config file.

## Test

Same as with the repro above, but use the cni-plugin image that contains the fix:

```
linkerd install-cni --set-string "podLabels.azure\.workload\.identity/use"="true" --set image.name="ghcr.io/alpeb/cni-plugin" --set image.version="v1.5.3" | kubectl apply -f -
```

After an hour when the token gets rotated you should see the event in the linkerd-cni pod logs.
@alpeb alpeb merged commit 81eb2c8 into main Dec 10, 2024
36 checks passed
@alpeb alpeb deleted the alpeb/simplify-cni-config branch December 10, 2024 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants