vpc.amazonaws.com/pod-ips annotation unset by aws-node for statefulset pods #2281

sahilbadla · 2023-02-23T00:28:21Z

What happened:

I am running a druid statefulset with 2 replicas that are spawned with the same name always historicals-0 & historicals-1. I've had issues while testing it where the network calls to another service in the same namespace would fail during graceful shutdown using a prestop hook. Example errors:

Resolving druid-tiny-cluster-coordinators (druid-tiny-cluster-coordinators)... failed: Temporary failure in name resolution.
wget: unable to resolve host address 'druid-tiny-cluster-coordinators'

java.util.concurrent.ExecutionException: java.net.UnknownHostException: druid-exporter.dev-druid-usw2-qal.svc.cluster.local

Graceful shutdown time is being honored but it felt like the pod network was getting removed before the prestop hook had finished. Upon digging further into aws-node logs I found that my pods have the vpc.amazonaws.com/pod-ips annotation empty although in the AWS CNI config we have it set to true.

      containers:
        - name: aws-node
          image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.2
          env:
            - name: AWS_VPC_ENI_MTU
              value: '8400'
            - name: AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER
              value: 'false'
            - name: AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
              value: 'true'
            - name: ENI_CONFIG_LABEL_DEF
              value: failure-domain.beta.kubernetes.io/zone
            - name: AWS_VPC_K8S_CNI_LOGLEVEL
              value: DEBUG
            - name: AWS_VPC_K8S_CNI_RANDOMIZESNAT
              value: prng
            - name: AWS_VPC_K8S_CNI_VETHPREFIX
              value: eni
            - name: ANNOTATE_POD_IP
              value: 'true'
            - name: AWS_VPC_K8S_PLUGIN_LOG_LEVEL
              value: DEBUG
            - name: MY_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: WARM_IP_TARGET
              value: '3'
            - name: MINIMUM_IP_TARGET
              value: '5'
            - name: ENABLE_IPv4
              value: 'true'
            - name: ENABLE_IPv6
              value: 'false'

That lead to further investigation why the annotation in missing in my statefulset pods while other replicaset pods had the annotation set. I looked at k8s audit logs and realized the annotation is set by aws-node(via patch) when the pod is spawned but within a couple of seconds the annotation gets unset by aws-node.

patch call where annotation is set:

Patch call where annotation is unset:

Upon digging deeper into cni logs to see why annotation is unset, I was able to find some delete network request logs around the same time.

msg: Received CNI del request: ContainerID(b92a369fa0a0390676d31bdb9e7e79bff66ed823309c4286ccc4465beafadd8e) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=dev-druid-usw2-qal;K8S_POD_NAME=druid-tiny-cluster-historicals-1;K8S_POD_INFRA_CONTAINER_ID=b92a369fa0a0390676d31bdb9e7e79bff66ed823309c4286ccc4465beafadd8e;K8S_POD_UID=8b82f8ab-49f5-4a1c-aec6-cd6a581a8ba0) Path(/opt/cni/bin) argsStdinData({"cniVersion":"0.4.0","mtu":"8400","name":"aws-cni","pluginLogFile":"/var/log/aws-routed-eni/plugin.log","pluginLogLevel":"DEBUG","podSGEnforcingMode":"strict","type":"aws-cni","vethPrefix":"eni"})
   ts: 2023-02-22T18:58:46.111Z

caller: rpc/rpc.pb.go:731
   level: debug
   msg: DelNetworkRequest: K8S_POD_NAME:"druid-tiny-cluster-historicals-1" K8S_POD_NAMESPACE:"dev-druid-usw2-qal" K8S_POD_INFRA_CONTAINER_ID:"b92a369fa0a0390676d31bdb9e7e79bff66ed823309c4286ccc4465beafadd8e" Reason:"PodDeleted" ContainerID:"b92a369fa0a0390676d31bdb9e7e79bff66ed823309c4286ccc4465beafadd8e" IfName:"eth0" NetworkName:"aws-cni"
   ts: 2023-02-22T18:58:46.113Z

caller: ipamd/rpc_handler.go:226
   level: debug
   msg: UnassignPodIPAddress: IP address pool stats: total:8, assigned 5, sandbox aws-cni/b92a369fa0a0390676d31bdb9e7e79bff66ed823309c4286ccc4465beafadd8e/eth0
   ts: 2023-02-22T18:58:46.158Z

Now looking at this rpc_handler code https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/rpc_handler.go#L276, I know the logic that removed the annotation but the question "why delete network" remains. Then i closely looked at the containerID in cni logs that points to b92a369fa0a0390676d31bdb9e7e79bff66ed823309c4286ccc4465beafadd8e that doesnt exist anymore. It was the containerID of the previous historicals-1 pod that i had deleted. New containerID for historicals-1 pod is 6d0e8a780f636f40438eee7e1371a195f14980428814e1ecffb82b40a1d00d2d(as seen in the audit log below):

Looks like there's a caching/sync problem that happens with statefulset pods that are recreated quickly or with the same name. For some reason the cni keeps referring to the old containerID that got deleted and the new pod spec never gets annotated with vpc.amazonaws.com/pod-ips set to its ip. That's the reason why I see failures in prestop hook because with this annotation set to empty, cni removes the network from this pod as soon as it enters terminating state. This is making our pipelines unstable and causing problems in our druid workloads. Please help with the same.

What you expected to happen:

Stateful set pods should be recreated with the right pod annotations to work properly. Prestop hook should work with network access until the script succeeds/returns.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): EKS v1.22.16-eks-ffeb93d
CNI Version: v1.11.2
OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
Kernel (e.g. uname -a):
Linux ip-xx 5.4.228-131.415.amzn2.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Tue Dec 20 12:51:02 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

jdn5126 · 2023-02-23T17:21:29Z

@sahilbadla thank you for the thorough debugging! On CNI ADD, we patch vpc.amazonaws.com/pod-ips with the pod IP, while on CNI DEL, we patch vpc.amazonaws.com/pod-ips with an empty string (to help with termination cases). Both cases call the same function: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L2029 , and you can see that we do a GET request using pod namespace and name: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L2021

I was under the impression that kubelet ordered ADDs and DELs for the same pod name such that the DEL would always be delivered first, but your findings contradict that. Essentially, our decision to patch is not granular enough.

In order to fix this, I think we should make sure on DEL that the pod IP we are deallocating matches the annotation we are deleting (or something similar). This is a much simpler and lighter-weight solution than adding any caching.

I will mark this as a good first issue and add it to our internal backlog. If you are interesting and willing in implementing the fix and testing, please let me know! Otherwise, no worries, we will find someone.

sahilbadla · 2023-02-28T00:39:06Z

@jdn5126 Looks like there's more to it. Even if i start fresh, i see this problem on my statefulset pods. The annotation vpc.amazonaws.com/pod-ips get set only to be removed few seconds later. Earlier i thought it just happens when the pods shuffle but its not the case anymore. Here are the logs from this new pod historicals-0

As you can see in the logs, annotation is set at 2:29:29:01 and unset at 2:31:49:655. This behavior is pretty consistent as i tried multiple times. I do the see the log message when the annotation gets set "Annotates pod dev-druid-usw2-qal with vpc.amazonaws.com/pod-ips: 10.195.15.31" but nothing when it gets unset.

I feel it has to do something with these log statements from data_store.go

caller: datastore/data_store.go:713
msg:Unable to get IP address from CIDR: no free IP available in the prefix - 10.195.15.132/ffffffff"

Please help us answer these two questions: 1) "why the annotation gets unset and never reconciled" 2) "where does it get unset and how can i find the logs"

jdn5126 · 2023-02-28T22:42:39Z

Well that is very strange, as the only way that annotation gets set is via the patch API call that I linked to. Is it possible that you are running a custom webhook in your cluster that rejects or reverts this? Otherwise, there has been some refactoring in this area, so would you be willing to try reproducing this with the latest VPC CNI (currently v1.12.5)?

I have not been able to reproduce this, but I still think there is an issue when pod DELs and ADDs for the same pod name are ordered with DELs last. That issue we are planning to fix, but this new information is quite confusing. If it is not reproducible with v1.12.5, that would imply that the refactoring done helped resolve this, but that refactoring was done to prevent unnecessary retries, and I do not see any retries here (otherwise we would see the relevant logs).

sahilbadla · 2023-03-07T08:59:58Z

@jdn5126 I am able to reproduce it with v1.12.5 as well.

jdn5126 · 2023-03-07T16:40:24Z

@jdn5126 I am able to reproduce it with v1.12.5 as well.

Thank you, we will work on the fix soon and target it for a future release

sahilbadla · 2023-03-31T18:26:13Z

@jdn5126 Thanks for getting the ball rolling on this. Will there be a release anytime soon to pickup this change

jdn5126 · 2023-04-03T16:27:59Z

@sahilbadla no release date planned currently, I will have to check on this before making any promises

jdn5126 · 2023-05-12T19:09:22Z

Closing as fixed by #2328

github-actions · 2023-05-12T19:09:44Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

sahilbadla added the bug label Feb 23, 2023

jdn5126 added the good first issue label Feb 23, 2023

jdn5126 assigned jerryhe1999 Mar 7, 2023

jerryhe1999 mentioned this issue Mar 23, 2023

adding ip check for annotatePod in ipamd #2328

Merged

jdn5126 closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vpc.amazonaws.com/pod-ips annotation unset by aws-node for statefulset pods #2281

vpc.amazonaws.com/pod-ips annotation unset by aws-node for statefulset pods #2281

sahilbadla commented Feb 23, 2023

jdn5126 commented Feb 23, 2023

sahilbadla commented Feb 28, 2023

jdn5126 commented Feb 28, 2023

sahilbadla commented Mar 7, 2023

jdn5126 commented Mar 7, 2023

sahilbadla commented Mar 31, 2023

jdn5126 commented Apr 3, 2023

jdn5126 commented May 12, 2023

github-actions bot commented May 12, 2023