Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-eks: KubernetesManifest Overwrite option invalid now that ServerSideApply is defaulted #31697

Open
1 task done
diranged opened this issue Oct 8, 2024 · 6 comments
Open
1 task done
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p3 potential-regression Marking this issue as a potential regression to be checked by team member

Comments

@diranged
Copy link

diranged commented Oct 8, 2024

Describe the bug

At some point recently, the kubectl CLI started setting --server-side-apply as the default behavior. The problem is that with kubernetes/kubernetes#44165, kubectl apply -f ... no longer works the way you'd expect. On a server-side-apply, it seems that Kubernetes will refuse to update a resource that already exists, which then reports back to the Lambda function an AlreadyExists error.

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

unknown

Expected Behavior

I would expect that kubectl apply -f ... just works ... (which is configured by setting overwrite: true on the KubernetesManifest resource)... but instead it's failing.

Current Behavior

Here are the logs from the Lambda function trying to run kubectl apply -f on a resource that happens to already exist in the cluster:

[INFO]	2024-10-05T18:12:44.871Z	bc961513-6915-4749-a32f-a787912469b1	Running command: ['kubectl', 'apply', '--kubeconfig', '/tmp/kubeconfig', '-f', '/tmp/manifest.yaml']
[INFO]	2024-10-05T18:12:44.871Z	bc961513-6915-4749-a32f-a787912469b1	manifest written to: /tmp/manifest.yaml
[INFO]	2024-10-05T18:12:42.741Z	bc961513-6915-4749-a32f-a787912469b1	Running command: ['aws', 'eks', 'update-kubeconfig', '--role-arn', 'arn:aws:iam::...:role/...-3bVbyuZfXXf4', '--name', '....', '--kubeconfig', '/tmp/kubeconfig']
[INFO]	2024-10-05T18:12:42.740Z	bc961513-6915-4749-a32f-a787912469b1	{"RequestType": "Create", "ServiceToken": "arn:aws:lambda:us-west-2:...:function:INFRA...-oNQV5X2TDIjj", "ResponseURL": "...", "StackId": "arn:aws:cloudformation:us-west-2:...:stack/...-ContinuousDeploymentNestedStackContinuousDeploymentNes-8QONVDS1QSK3/4fb3cbe0-8345-11ef-afa7-067d0aea149f", "RequestId": "ec9880d5-f9bf-498d-9780-12134c81f17d", "LogicalResourceId": "ArgoCDSystemPostHelmResources073D6BA8", "ResourceType": "Custom::AWSCDK-EKS-KubernetesResource", "ResourceProperties": {"ServiceToken": "arn:aws:lambda:us-west-2:...:function:...-oNQV5X2TDIjj", "Overwrite": "true", "PruneLabel": "aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea", "ClusterName": "...", "Manifest": "[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"AppProject\",\"metadata\":{\"name\":\"default\",\"namespace\":\"argocd-system\",\"labels\":{\"aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea\":\"\"}},\"spec\":{\"clusterResourceWhitelist\":[{\"group\":\"*\",\"kind\":\"*\"}],\"destinations\":[{\"namespace\":\"*\",\"server\":\"https://kubernetes.default.svc\"}],\"sourceRepos\":[\"*\"]}}]", "RoleArn": "arn:aws:iam::...:role/...-3bVbyuZfXXf4"}}
{
  "RequestType": "Create",
  "ServiceToken": "arn:aws:lambda:us-west-2:...:function:...-oNQV5X2TDIjj",
  "ResponseURL": "...",
  "StackId": "arn:aws:cloudformation:us-west-2:...:stack/...-ContinuousDeploymentNestedStackContinuousDeploymentNes-8QONVDS1QSK3/4fb3cbe0-8345-11ef-afa7-067d0aea149f",
  "RequestId": "ec9880d5-f9bf-498d-9780-12134c81f17d",
  "LogicalResourceId": "ArgoCDSystemPostHelmResources073D6BA8",
  "ResourceType": "Custom::AWSCDK-EKS-KubernetesResource",
  "ResourceProperties": {
    "ServiceToken": "arn:aws:lambda:us-west-2:...:function:...-oNQV5X2TDIjj",
    "Overwrite": "true",
    "PruneLabel": "aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea",
    "ClusterName": "...",
    "Manifest": "[{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"AppProject\",\"metadata\":{\"name\":\"default\",\"namespace\":\"argocd-system\",\"labels\":{\"aws.cdk.eks/prune-c8c67f619695e93fb41d90faa4dabab90eb2bca3ea\":\"\"}},\"spec\":{\"clusterResourceWhitelist\":[{\"group\":\"*\",\"kind\":\"*\"}],\"destinations\":[{\"namespace\":\"*\",\"server\":\"https://kubernetes.default.svc\"}],\"sourceRepos\":[\"*\"]}}]",
    "RoleArn": "arn:aws:iam::...:role/...-3bVbyuZfXXf4"
  }
}
[ERROR] Exception: b'Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": appprojects.argoproj.io "default" already exists\n'
Traceback (most recent call last):
  File "/var/task/index.py", line 14, in handler
    return apply_handler(event, context)
  File "/var/task/apply/__init__.py", line 60, in apply_handler
    kubectl('apply', manifest_file, *kubectl_opts)
  File "/var/task/apply/__init__.py", line 91, in kubectl
    raise Exception(output)

If we go and look at the Kubernetes Audit logs, we can see that ArgoCD first creates this default resource, and then the next Create call fails with a 409 and is a server-side-apply call:

Here's the first create call (made by ArgoCD, and uncontrollable by us)

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "21c5b6a7-eeee-4676-b7e7-2764d16517f1",
  "stage": "ResponseComplete",
  "requestURI": "/apis/argoproj.io/v1alpha1/namespaces/argocd-system/appprojects",
  "verb": "create",
  "user": {
    "username": "system:serviceaccount:argocd-system:argocd-server",
    "uid": "ec844b9d-8ce1-4325-905a-479df42f0aed",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:argocd-system",
      "system:authenticated"
    ],
    "extra": {
...
    }
  },
  "sourceIPs": [
    "..."
  ],
  "userAgent": "argocd-server/v0.0.0 (linux/arm64) kubernetes/$Format",
  "objectRef": {
    "resource": "appprojects",
    "namespace": "argocd-system",
    "name": "default",
    "apiGroup": "argoproj.io",
    "apiVersion": "v1alpha1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 201
  },
  "requestReceivedTimestamp": "2024-10-05T18:12:49.590120Z",
  "stageTimestamp": "2024-10-05T18:12:49.594167Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by RoleBinding \"argocd-system-server/argocd-system\" of Role \"argocd-system-server\" to ServiceAccount \"argocd-server/argocd-system\""
  }
}

Then we see a second call, this time via kubectl... note the kubectl-client-side-apply in the requestURI path:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "64c914a4-e184-4079-a6b6-8c5630e473fd",
  "stage": "ResponseComplete",
  "requestURI": "/apis/argoproj.io/v1alpha1/namespaces/argocd-system/appprojects?fieldManager=kubectl-client-side-apply&fieldValidation=Strict",
  "verb": "create",
  "user": {
    "username": "arn:aws:sts::...:assumed-role/.../EKSGetTokenAuth",
    "uid": "aws-iam-authenticator:...:...",
    "groups": [
      "system:authenticated"
    ],
    "extra": {
...
    }
  },
  "sourceIPs": [
    "..."
  ],
  "userAgent": "kubectl/v1.28.3 (linux/amd64) kubernetes/a8a1abc",
  "objectRef": {
    "resource": "appprojects",
    "namespace": "argocd-system",
    "name": "default",
    "apiGroup": "argoproj.io",
    "apiVersion": "v1alpha1"
  },
  "responseStatus": {
    "metadata": {},
    "status": "Failure",
    "message": "appprojects.argoproj.io \"default\" already exists",
    "reason": "AlreadyExists",
    "details": {
      "name": "default",
      "group": "argoproj.io",
      "kind": "appprojects"
    },
    "code": 409
  },
  "requestReceivedTimestamp": "2024-10-05T18:12:49.605439Z",
  "stageTimestamp": "2024-10-05T18:12:49.614296Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "EKS Access Policy: allowed by ClusterRoleBinding \"arn:aws:iam::...:role/...+arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy\" of ClusterRole \"arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy\" to User \"...\""
  }
}

Reproduction Steps

N/A

Possible Solution

I think that when overwrite: true is set, then the --server-side=false flag should also be applied to the command..

Additional Information/Context

No response

CDK CLI Version

2.161.1

Framework Version

No response

Node.js Version

18

OS

linux

Language

TypeScript

Language Version

No response

Other information

No response

@diranged diranged added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 8, 2024
@github-actions github-actions bot added @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service potential-regression Marking this issue as a potential regression to be checked by team member labels Oct 8, 2024
@ashishdhingra
Copy link
Contributor

@diranged Good morning. Could you please confirm if this is a CDK issue and share minimal code to reproduce the issue? Or this issue was originally intended for https://github.com/kubernetes/kubectl/ repo?

Thanks,
Ashish

@ashishdhingra ashishdhingra added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p3 and removed needs-triage This issue or PR still needs to be triaged. labels Oct 8, 2024
@diranged
Copy link
Author

diranged commented Oct 8, 2024

Honestly I think this is a CDK issue because the overwrite: true flag is no longer behaving the way the user expects...

@pahud
Copy link
Contributor

pahud commented Oct 8, 2024

Hi

@diranged

Can you share a minimal CDK app that we can test and reproduce this issue in our account?

And please let us know which version was working and which version is broken now with exactly the same code.

Thanks.

@diranged
Copy link
Author

diranged commented Oct 8, 2024

@pahud,
I can try - but I don't know when I'll have time to get that done... but I will note that rolling back to the V27 Lambda Kubectl function resolves the issue.

@pahud
Copy link
Contributor

pahud commented Oct 9, 2024

The error message "appprojects.argoproj.io "default" already exists" indicates that you're trying to create an Argo CD AppProject resource named "default" in the "argocd-system" namespace, but a resource with the same name already exists in your cluster. [1]

This situation typically occurs when:

You've previously created an AppProject named "default" in the same namespace.

The "default" AppProject was automatically created during the Argo CD installation process.

Argo CD typically creates a "default" AppProject during its initial setup, which is why you're encountering this error when trying to apply your manifest.

To resolve this issue:

Update instead of create: If you want to modify the existing "default" AppProject, you can use kubectl apply with the --force flag:

kubectl apply -f your-manifest.yaml --force

looks like there's already a default AppProject and you are install another one with the same name? I am not sure if this is related to CDK but the issue kubernetes/kubernetes#44165 you mentioned is in 2017 and I am not sure if this is related to CDK.

I am not the expert of ArgoCD but hope this help.

@diranged
Copy link
Author

diranged commented Oct 9, 2024

@pahud,
So this code has been in place and untouched (other than updates to aws-cdk-lib, aws-cdk and the @aws-cdk/lambda-layer-kubectl-v28 typescript libraries) for 2 years now. It started breaking after some recent update (though it's hard right now for me to pinpoint it, because we don't run integration tests 100% of the time). Yes, ArgoCD auto-creates the default AppProject object on startup - it's a foot-gun being discussed at argoproj/argo-cd#11058 ... however, this behavior has been in place for several years.

Given the following code:

const cdk8sPostChart = new cdk8s.Chart(new cdk8s.App(), 'PostManifestBuilder', {
  namespace: this.namespace,
});
new AppProject(cdk8sPostChart, 'DefaultProject', {
  metadata: { name: 'default' },
  spec: props.defaultProjectSpec ?? DEFAULT_PROJECT_SPEC,
});
new KubernetesManifest(this, 'PostHelmResources', {
  cluster: this.cluster,
  overwrite: true, // the argo controller creates a default 'appProject' on startup, we overwrite it
  prune: true,
  manifest: cdk8sPostChart.toJson(),
}).node.addDependency(helmChart);

One would expect that regardless of whether the default object already exists or not, it would be overwritten via kubectl apply -f manifest.yaml... but it seems that in some cases that does not happen. I've tried to replicate this with a local kind environment using pure kubectl commands and for some reason I cannot .... which leads me to believe that there's actually a race-condition happening that is made worse by the Server Side Apply setup. The first create call is at 2024-10-05T18:12:49.590120Z and the second one comes in a hair later at 2024-10-05T18:12:49.605439Z.

Do you have specific objections to either:
a) exposing --server-side=<bool> an option for the KubernetesManifest resource
b) using --server-side=false when overwrite==true
c) using --force-conflicts=true when overwrite==true

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. p3 potential-regression Marking this issue as a potential regression to be checked by team member
Projects
None yet
Development

No branches or pull requests

3 participants