-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vpc cni is getting failed to assign IPs and throwing invalid spec errors with release v1.13.2 #2654
Comments
@ramandeepsharma are you setting the If so, how are you deploying the VPC CNI with this change? When set, this feature also requires the ClusterRole to be updated, as the README mentions. If you are deploying with helm or the EKS managed addon, this should be taken care of: https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.13.2/charts/aws-vpc-cni/templates/clusterrole.yaml#L17 Can you check the output of |
Hello, I'm working on this with Ramandeep. Here are the details you asked for while he's out: We have the
And here's the aws-node clusterrole permissions:
I believe this is a race condition because this error comes up intermittently, usually during periods of high pod churn. Every one or two days we get about a dozen pods that get stuck in this state all at once across multiple different nodes, with the above logs repeating constantly. New pods that spin up later don't get into this state and start up correctly. This log specifically is interesting, from https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.13.2/pkg/ipamd/ipamd.go#L2103:
We traced the code but we couldn't see how this could be changing any fields other than the annotation. Do you have any ideas? |
@gesarki that is very strange, as that error log is implying a permissions issue that shouldn't occur given that you have The only other thing that comes to mind is that we migrated from a raw client with an older |
Other than that, I think we would need an EKS support ticket to dig further into this, as we'll need to look at the API server audit logs |
Thanks for your help @jdn5126, after checking the API server audit logs we found that it was due to a MutatingAdmissionWebhook (that we installed via kyverno) modifying the patches that the VPC CNI sent. We can actually see the illegal patch in the error logs, they were just hard to notice because the formatting was lost in the VPC CNI logs. The API server audit logs have the correct formatting:
Posting in case someone else runs into this error 🙂. |
@gesarki nice debugging, thank you for updating! Hopefully something like this is hit by future searches 🤞🏻 |
|
What happened:
We noticed that pods are getting failed intermittently in our infrastructure and logs are revealing that it's related to vpc cni as it's getting failed to assign the IPs.
Event logs of such pods.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d456c3f7056f7574708efdcb161e676949989c33411c13eba09bb78502a9a7c3": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
Currently we have to delete such pods manually so that it can re-schedule on the node.
We have diagnosed this issue at our end and our findings are mentioned below:
We checked that sufficient amount of IPs are available.
ipamd logs are mentioned below, these errors are being repeated during the pod failure incident:
{ "level": "debug", "ts": "2023-11-01T18:43:09.746Z", "caller": "rpc/rpc.pb.go:713", "msg": "AddNetworkRequest: K8S_POD_NAME:\"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" K8S_POD_NAMESPACE:\"sas\" K8S_POD_INFRA_CONTAINER_ID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" ContainerID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-82499489-f3c4-abcb-c5a4-fb570bbf300b\"" } { "level": "error", "ts": "2023-11-01T18:43:10.391Z", "caller": "retry/util.go:51", "msg": "Failed to annotate vpc.amazonaws.com/pod-ips the pod with 10.17.76.32, error Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than
spec.containers[].image,
spec.initContainers[].image,
spec.activeDeadlineSeconds,
spec.tolerations(only additions to existing tolerations) or
spec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 15 identical fields\n \tSubdomain: \"\",\n \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n \t... // 11 identical fields\n }\n" } { "level": "error", "ts": "2023-11-01T18:43:10.391Z", "caller": "rpc/rpc.pb.go:713", "msg": "Failed to add the pod annotation: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than
spec.containers[].image,
spec.initContainers[].image,
spec.activeDeadlineSeconds,
spec.tolerations(only additions to existing tolerations) or
spec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 15 identical fields\n \tSubdomain: \"\",\n \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n \t... // 11 identical fields\n }\n" } { "level": "info", "ts": "2023-11-01T18:43:10.391Z", "caller": "rpc/rpc.pb.go:713", "msg": "Send AddNetworkReply: IPv4Addr 10.17.76.32, IPv6Addr: , DeviceNumber: 0, err: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than
spec.containers[].image,
spec.initContainers[].image,
spec.activeDeadlineSeconds,
spec.tolerations(only additions to existing tolerations) or
spec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 15 identical fields\n \tSubdomain: \"\",\n \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n \t... // 11 identical fields\n }\n" } { "level": "debug", "ts": "2023-11-01T18:43:10.428Z", "caller": "rpc/rpc.pb.go:731", "msg": "DelNetworkRequest: K8S_POD_NAME:\"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" K8S_POD_NAMESPACE:\"sas\" K8S_POD_INFRA_CONTAINER_ID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" Reason:\"PodDeleted\" ContainerID:\"71833840eab0352825f05c91669c1f1684bd61f0d28346efb6ec519e9dbb28a6\" IfName:\"eth0\" NetworkName:\"aws-cni\"" } { "level": "error", "ts": "2023-11-01T18:43:11.142Z", "caller": "retry/util.go:51", "msg": "Failed to annotate vpc.amazonaws.com/pod-ips the pod with , error Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than
spec.containers[].image,
spec.initContainers[].image,
spec.activeDeadlineSeconds,
spec.tolerations(only additions to existing tolerations) or
spec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 15 identical fields\n \tSubdomain: \"\",\n \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n \t... // 11 identical fields\n }\n" } { "level": "error", "ts": "2023-11-01T18:43:11.142Z", "caller": "rpc/rpc.pb.go:731", "msg": "Failed to delete the pod annotation: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than
spec.containers[].image,
spec.initContainers[].image,
spec.activeDeadlineSeconds,
spec.tolerations(only additions to existing tolerations) or
spec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 15 identical fields\n \tSubdomain: \"\",\n \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n \t... // 11 identical fields\n }\n" } { "level": "info", "ts": "2023-11-01T18:43:11.142Z", "caller": "rpc/rpc.pb.go:731", "msg": "Send DelNetworkReply: IPv4Addr 10.17.76.32, DeviceNumber: 0, err: Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than
spec.containers[].image,
spec.initContainers[].image,
spec.activeDeadlineSeconds,
spec.tolerations(only additions to existing tolerations) or
spec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 15 identical fields\n \tSubdomain: \"\",\n \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n \t... // 11 identical fields\n }\n" }
vpc cni logs are mentioned below, these errors are being repeated during the pod failure incident:
{ "level": "info", "ts": "2023-11-01T18:44:10.737Z", "caller": "routed-eni-cni-plugin/cni.go:126", "msg": "Received CNI add request: ContainerID(ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952) Netns(/var/run/netns/cni-201a3b5a-9493-48ce-7204-9f3ddde85429) IfName(eth0) Args(K8S_POD_INFRA_CONTAINER_ID=ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952;K8S_POD_UID=a296181d-cb58-45ef-be09-5ee8ee5b1070;IgnoreUnknown=1;K8S_POD_NAMESPACE=sas;K8S_POD_NAME=command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})" } { "level": "info", "ts": "2023-11-01T18:44:11.425Z", "caller": "routed-eni-cni-plugin/cni.go:284", "msg": "Received CNI del request: ContainerID(ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952) Netns(/var/run/netns/cni-201a3b5a-9493-48ce-7204-9f3ddde85429) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=sas;K8S_POD_NAME=command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt;K8S_POD_INFRA_CONTAINER_ID=ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952;K8S_POD_UID=a296181d-cb58-45ef-be09-5ee8ee5b1070) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"strict\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})" } { "level": "error", "ts": "2023-11-01T18:44:12.318Z", "caller": "routed-eni-cni-plugin/cni.go:284", "msg": "Error received from DelNetwork gRPC call for container ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952: rpc error: code = Unknown desc = Pod \"command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt\" is invalid: spec: Forbidden: pod updates may not change fields other than
spec.containers[].image,
spec.initContainers[].image,
spec.activeDeadlineSeconds,
spec.tolerations(only additions to existing tolerations) or
spec.terminationGracePeriodSeconds(allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 15 identical fields\n \tSubdomain: \"\",\n \tSetHostnameAsFQDN: nil,\n- \tAffinity: nil,\n+ \tAffinity: &core.Affinity{\n+ \t\tNodeAffinity: &core.NodeAffinity{\n+ \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{NodeSelectorTerms: []core.NodeSelectorTerm{...}},\n+ \t\t},\n+ \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: {{Key: \"kaas.acquia.io/pool\", Operator: \"Equal\", Value: \"jobs\"}, {Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}},\n \t... // 11 identical fields\n }\n" } { "level": "info", "ts": "2023-11-01T18:44:12.318Z", "caller": "routed-eni-cni-plugin/cni.go:284", "msg": "Could not teardown pod using prevResult: ContainerID(ae26c3b9a86fbfc79df9e7c2c82caf2daf3252f4f68ca6f3355f421256339952) Netns(/var/run/netns/cni-201a3b5a-9493-48ce-7204-9f3ddde85429) IfName(eth0) PodNamespace(sas) PodName(command-b7aeaa61-6361-4011-b9b7-f0226622a2e3-28314385-sd4xt)" }
It seems to be going into a loop while sending a API request and failing over with invalid spec. hence the requests are not getting completed.
-->
What you expected to happen:
Pods should get the IPs.
How to reproduce it (as minimally and precisely as possible):
N/A
Anything else we need to know?:
It looks like this is related to recent change: #2328 which got introduced in https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.13.0 and we are using 'v1.13.2' release.
Environment:
kubectl version
): v1.24.17-ekscat /etc/os-release
): Amazon Linuxuname -a
): 5.10.192-183.736.amzn2.x86_64The text was updated successfully, but these errors were encountered: