Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpc cni is getting failed to assign IPs and throwing invalid spec errors with release v1.13.2 #2654

Closed
ramandeepsharma opened this issue Nov 7, 2023 · 7 comments
Labels

Comments

@ramandeepsharma
Copy link

What happened:

We noticed that pods are getting failed intermittently in our infrastructure and logs are revealing that it's related to vpc cni as it's getting failed to assign the IPs.

Event logs of such pods.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d456c3f7056f7574708efdcb161e676949989c33411c13eba09bb78502a9a7c3": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Currently we have to delete such pods manually so that it can re-schedule on the node.

We have diagnosed this issue at our end and our findings are mentioned below:

  • We checked that sufficient amount of IPs are available.

  • ipamd logs are mentioned below, these errors are being repeated during the pod failure incident:
    { "level": "debug", "ts": "2023-11-01T18:43:09.746Z", "caller": "rpc/rpc.pb.go:713", "msg": "AddNetworkRequest: K8S_POD_NAME:\"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" K8S_POD_NAMESPACE:\"sas\" K8S_POD_INFRA_CONTAINER_ID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" ContainerID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-82499489-f3c4-abcb-c5a4-fb570bbf300b\"" } { "level": "error", "ts": "2023-11-01T18:43:10.391Z", "caller": "retry/util.go:51", "msg": "Failed to annotate vpc.amazonaws.com/pod-ips the pod with 10.17.76.32, error Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than spec.containers[].image, spec.initContainers[].image, spec.activeDeadlineSeconds, spec.tolerations(only additions to existing tolerations) orspec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \t... // 15 identical fields\n  \tSubdomain: \"\",\n  \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n  \tSchedulerName: \"default-scheduler\",\n  \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n  \t... // 11 identical fields\n  }\n" } { "level": "error", "ts": "2023-11-01T18:43:10.391Z", "caller": "rpc/rpc.pb.go:713", "msg": "Failed to add the pod annotation: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other thanspec.containers[].image, spec.initContainers[].image, spec.activeDeadlineSeconds, spec.tolerations(only additions to existing tolerations) orspec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \t... // 15 identical fields\n  \tSubdomain: \"\",\n  \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n  \tSchedulerName: \"default-scheduler\",\n  \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n  \t... // 11 identical fields\n  }\n" } { "level": "info", "ts": "2023-11-01T18:43:10.391Z", "caller": "rpc/rpc.pb.go:713", "msg": "Send AddNetworkReply: IPv4Addr 10.17.76.32, IPv6Addr: , DeviceNumber: 0, err: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other thanspec.containers[].image, spec.initContainers[].image, spec.activeDeadlineSeconds, spec.tolerations(only additions to existing tolerations) orspec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \t... // 15 identical fields\n  \tSubdomain: \"\",\n  \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n  \tSchedulerName: \"default-scheduler\",\n  \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n  \t... // 11 identical fields\n  }\n" } { "level": "debug", "ts": "2023-11-01T18:43:10.428Z", "caller": "rpc/rpc.pb.go:731", "msg": "DelNetworkRequest: K8S_POD_NAME:\"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" K8S_POD_NAMESPACE:\"sas\" K8S_POD_INFRA_CONTAINER_ID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" Reason:\"PodDeleted\" ContainerID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" IfName:\"eth0\" NetworkName:\"aws-cni\"" } { "level": "error", "ts": "2023-11-01T18:43:11.142Z", "caller": "retry/util.go:51", "msg": "Failed to annotate vpc.amazonaws.com/pod-ips the pod with , error Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other thanspec.containers[].image, spec.initContainers[].image, spec.activeDeadlineSeconds, spec.tolerations(only additions to existing tolerations) orspec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \t... // 15 identical fields\n  \tSubdomain: \"\",\n  \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n  \tSchedulerName: \"default-scheduler\",\n  \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n  \t... // 11 identical fields\n  }\n" } { "level": "error", "ts": "2023-11-01T18:43:11.142Z", "caller": "rpc/rpc.pb.go:731", "msg": "Failed to delete the pod annotation: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other thanspec.containers[].image, spec.initContainers[].image, spec.activeDeadlineSeconds, spec.tolerations(only additions to existing tolerations) orspec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \t... // 15 identical fields\n  \tSubdomain: \"\",\n  \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n  \tSchedulerName: \"default-scheduler\",\n  \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n  \t... // 11 identical fields\n  }\n" } { "level": "info", "ts": "2023-11-01T18:43:11.142Z", "caller": "rpc/rpc.pb.go:731", "msg": "Send DelNetworkReply: IPv4Addr 10.17.76.32, DeviceNumber: 0, err: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other thanspec.containers[].image, spec.initContainers[].image, spec.activeDeadlineSeconds, spec.tolerations(only additions to existing tolerations) orspec.terminationGracePeriodSeconds (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \t... // 15 identical fields\n  \tSubdomain: \"\",\n  \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n  \tSchedulerName: \"default-scheduler\",\n  \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n  \t... // 11 identical fields\n  }\n" }

  • vpc cni logs are mentioned below, these errors are being repeated during the pod failure incident:
    { "level": "info", "ts": "2023-11-01T18:44:10.737Z", "caller": "routed-eni-cni-plugin/cni.go:126", "msg": "Received CNI add request: ContainerID(ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952) Netns(/var/run/netns/cni-201a3b5a-9493-48ce-7204-9f3ddde85429) IfName(eth0) Args(K8S_POD_INFRA_CONTAINER_ID=ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952;K8S_POD_UID=a296181d-cb58-45ef-be09-5ee8ee5b1070;IgnoreUnknown=1;K8S_POD_NAMESPACE=sas;K8S_POD_NAME=command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})" } { "level": "info", "ts": "2023-11-01T18:44:11.425Z", "caller": "routed-eni-cni-plugin/cni.go:284", "msg": "Received CNI del request: ContainerID(ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952) Netns(/var/run/netns/cni-201a3b5a-9493-48ce-7204-9f3ddde85429) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=sas;K8S_POD_NAME=command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt;K8S_POD_INFRA_CONTAINER_ID=ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952;K8S_POD_UID=a296181d-cb58-45ef-be09-5ee8ee5b1070) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})" } { "level": "error", "ts": "2023-11-01T18:44:12.318Z", "caller": "routed-eni-cni-plugin/cni.go:284", "msg": "Error received from DelNetwork gRPC call for container ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952: rpc error: code = Unknown desc = Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than spec.containers[].image, spec.initContainers[].image, spec.activeDeadlineSeconds, spec.tolerations(only additions to existing tolerations) orspec.terminationGracePeriodSeconds (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \t... // 15 identical fields\n  \tSubdomain: \"\",\n  \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n  \tSchedulerName: \"default-scheduler\",\n  \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n  \t... // 11 identical fields\n  }\n" } { "level": "info", "ts": "2023-11-01T18:44:12.318Z", "caller": "routed-eni-cni-plugin/cni.go:284", "msg": "Could not teardown pod using prevResult: ContainerID(ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952) Netns(/var/run/netns/cni-201a3b5a-9493-48ce-7204-9f3ddde85429) IfName(eth0) PodNamespace(sas) PodName(command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt)" }

It seems to be going into a loop while sending a API request and failing over with invalid spec. hence the requests are not getting completed.

-->

What you expected to happen:

Pods should get the IPs.

How to reproduce it (as minimally and precisely as possible):

N/A

Anything else we need to know?:

It looks like this is related to recent change: #2328 which got introduced in https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.13.0 and we are using 'v1.13.2' release.

Environment:

  • Kubernetes version (use kubectl version): v1.24.17-eks
  • CNI Version: v1.13.2
  • OS (e.g: cat /etc/os-release): Amazon Linux
  • Kernel (e.g. uname -a): 5.10.192-183.736.amzn2.x86_64
@jdn5126
Copy link
Contributor

jdn5126 commented Nov 7, 2023

@ramandeepsharma are you setting the ANNOTATE_POD_IP environment variable: https://github.com/aws/amazon-vpc-cni-k8s#annotate_pod_ip-v193?

If so, how are you deploying the VPC CNI with this change? When set, this feature also requires the ClusterRole to be updated, as the README mentions. If you are deploying with helm or the EKS managed addon, this should be taken care of: https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.13.2/charts/aws-vpc-cni/templates/clusterrole.yaml#L17

Can you check the output of k get clusterrole aws-node -n kube-system -o yaml and look at the pods permissions?

@gesarki
Copy link

gesarki commented Nov 7, 2023

Hello, I'm working on this with Ramandeep. Here are the details you asked for while he's out:

We have the ANNOTATE_POD_IP environment variable set:

$ k get ds -n kube-system aws-node -o yaml | yq '.spec.template.spec.containers[0].env' | grep -i annotate -A2
- name: ANNOTATE_POD_IP
  value: "true"
- name: AWS_VPC_CNI_NODE_PORT_SUPPORT

And here's the aws-node clusterrole permissions:

$ k get clusterrole aws-node -n kube-system -o yaml
[...]
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - list
  - watch
  - get
  - patch
[...]

I believe this is a race condition because this error comes up intermittently, usually during periods of high pod churn. Every one or two days we get about a dozen pods that get stuck in this state all at once across multiple different nodes, with the above logs repeating constantly. New pods that spin up later don't get into this state and start up correctly.

This log specifically is interesting, from https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.13.2/pkg/ipamd/ipamd.go#L2103:

Failed to annotate vpc.amazonaws.com/pod-ips the pod with 10.17.76.32, error Pod \"command-b7aeaa61-[...]\" is invalid: spec: Forbidden: pod updates may not change fields other than [...]

We traced the code but we couldn't see how this could be changing any fields other than the annotation. Do you have any ideas?

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 7, 2023

@gesarki that is very strange, as that error log is implying a permissions issue that shouldn't occur given that you have patch permissions set for pods. Are you adding any special tolerations to these pods?

The only other thing that comes to mind is that we migrated from a raw client with an older controller-runtime to a cached client with newer version in v1.15.x, so perhaps the underlying k8s call changed. Is it possible for you to test with the latest VPC CNI version?

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 7, 2023

Other than that, I think we would need an EKS support ticket to dig further into this, as we'll need to look at the API server audit logs

@gesarki
Copy link

gesarki commented Nov 14, 2023

Thanks for your help @jdn5126, after checking the API server audit logs we found that it was due to a MutatingAdmissionWebhook (that we installed via kyverno) modifying the patches that the VPC CNI sent.

We can actually see the illegal patch in the error logs, they were just hard to notice because the formatting was lost in the VPC CNI logs. The API server audit logs have the correct formatting:

Pod "command-b6743717-ad50-4576-ad8c-fc5723e2e898-28322220-kmt8g" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
  	... // 15 identical fields
  	Subdomain:         "",
  	SetHostnameAsFQDN: nil,
- 	Affinity:          nil,
+ 	Affinity: &core.Affinity{
+ 		NodeAffinity: &core.NodeAffinity{
+ 			RequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},
+ 		},
+ 	},
  	SchedulerName: "default-scheduler",
  	Tolerations:   {{Key: "kaas.acquia.io/pool", Operator: "Equal", Value: "jobs"}, {Key: "node.kubernetes.io/not-ready", Operator: "Exists", Effect: "NoExecute", TolerationSeconds: &300}, {Key: "node.kubernetes.io/unreachable", Operator: "Exists", Effect: "NoExecute", TolerationSeconds: &300}},
  	... // 11 identical fields

Posting in case someone else runs into this error 🙂.

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 14, 2023

@gesarki nice debugging, thank you for updating! Hopefully something like this is hit by future searches 🤞🏻

@jdn5126 jdn5126 closed this as completed Nov 16, 2023
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants