- Troubleshooting Windows
- Troubleshooting Security Group for Pods
- Troubleshooting Prefix Delegation for Windows
- Verify Windows prefix delegation is enabled in the ConfigMap
- Check both pod events and node events for any specific error
- Verify Node has the required Resource Capacity
- Verify Pod has the required resource limits
- Verify Pod has the required IPv4 Address Annotation
- Verify the configuration options set for windows prefix delegation
- Look for networking issues on the Windows Host
- List of Common Issues
Please follow the troubleshooting guide in the chronological order to debug issues with Windows Node and Pods.
To get the Platform Version of your EKS cluster
aws eks describe-cluster --name cluster-name --region us-west-2 | jq .cluster.platformVersion
Your Platform Version should be equal to or greater than Platform Version specified here.
Resolution
If your Platform Version is lower, you can
- Create a new EKS Cluster or
- Update to the new K8s Version if possible or
- Enable legacy controller support on your EKS Cluster using this guide.
To get the ConfigMap and the data field
kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"
You should have the ConfigMap with the following data,
enable-windows-ipam:true
Resolution
If the ConfigMap is missing or doesn't have the above field, you can
- Create or Update ConfigMap with the required fields by following this guide.
Describe the Windows Node,
kubectl describe node node-name
You should see a non-zero capacity for resource vpc.amazonaws.com/PrivateIPv4Address
Capacity:
vpc.amazonaws.com/PrivateIPv4Address: 9
Allocatable:
vpc.amazonaws.com/PrivateIPv4Address: 9
Resolution
If the node doesn't have the resource capacity validate the following,
- Windows Node has label
kubernetes.io/os: windows
orbeta.kubernetes.io/os: windows
. - There are Sufficient ENI/IP.
- Sufficient permissions in the Cluster Role.
Describe the Windows Pod,
kubectl describe pod windows-pod
You should see 1 limit and request for the resource vpc.amazonaws.com/PrivateIPv4Address
Limits:
vpc.amazonaws.com/PrivateIPv4Address: 1
Requests:
vpc.amazonaws.com/PrivateIPv4Address: 1
Resolution
If limit/request is missing,
- Validate Pod has nodeSelector.
nodeSelector: kubernetes.io/os: windows
- Validate Mutating Webhook Configuration is not accidentally deleted.
kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io vpc-resource-mutating-webhook NAME WEBHOOKS AGE vpc-resource-mutating-webhook 1 59d
Describe the Windows Pod,
kubectl describe pod windows-pod
The Pod should have the similar annotation.
Annotations: vpc.amazonaws.com/PrivateIPv4Address: 192.168.25.15/19
Resolution
If the Annotation is missing,
- Check the Pod Events for errors emitted by the vpc-resource-controller
- There are no PSP Blocking the annotation.
- There are Sufficient ENI/IP.
- Sufficient permissions in the Cluster Role.
Resolution
If the Pod is still stuck in ContainerCreating
you can,
- Fetch more detailed logs on the Host using the EKS Log collector script
- Check the CNI Logs from the collected logs.
- Open an Issue if no intuitive logs are present Issue in this repository.
Please follow the troubleshooting guide in the chronological order to debug issues with Security Group for Pods.
Describe the aws-node daemonset
kubectl get ds -n kube-system aws-node -o yaml
The following environment variable must be set.
containers:
name: aws-node
env:
- name: ENABLE_POD_ENI
value: "true"
If you are using ConfigMaps that are referred from VPC CNI containers' env
, you need have the same key/value pair setup in the referred ConfigMap.
Resolution If the environment variable is not set,
- Follow the guide to enable SGP feature.
Get the EKS managed CRD CNINode
kubectl get cninode <NODDE_NAME>
The CNINode's FEATURE column should have
[{"name":"SecurityGroupsForPods"}]
Alternatively, you can check node for further confirming. Describe the Node
kubectl describe node <NODE_NAME>
The following annotation will be added in node's Capacity
and Allocatable
if Trunk ENI is created successfully
vpc.amazonaws.com/pod-eni: 9 (could be other values depending on your instance type)
Your node should also receive an event like the following:
Normal NodeTrunkInitiated 5m12s vpc-resource-controller The node has trunk interface initialized successfully
Resolution
If the label is missing or set to false check for,
- Instance type supports ENI Trunking. Only Nitro instance supports this feature. See for supported instance types.
On nodes created before feature was enabled,
- Check if there's capacity to create one more ENI.
aws ec2 describe-network-interfaces --filters Name=attachment.instance-id,Values=instance-id
On nodes created after feature was enabled,
- There are Sufficient ENI/IP.
- Sufficient permissions in the Cluster Role.
Describe the SGP Pod
kubectl describe pod sgp-pod
You should see 1 limit and request for the resource vpc.amazonaws.com/pod-eni
Limits:
vpc.amazonaws.com/pod-eni: 1
Requests:
vpc.amazonaws.com/pod-eni: 1
Resolution
If limit/request is missing,
- Validate you have Security Group Policy that matches labels/service account with the Pod.
- Validate the RBAC Role and RoleBindings are not accidentally deleted.
kubectl get rolebindings.rbac.authorization.k8s.io -n kube-system eks-vpc-resource-controller-rolebinding kubectl get roles.rbac.authorization.k8s.io -n kube-system eks-vpc-resource-controller-role NAME ROLE AGE eks-vpc-resource-controller-rolebinding Role/eks-vpc-resource-controller-role 59d NAME CREATED AT eks-vpc-resource-controller-role 2021-11-08T07:40:41Z
- Validate Mutating Webhook Configuration is not accidentally deleted.
kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io vpc-resource-mutating-webhook NAME WEBHOOKS AGE vpc-resource-mutating-webhook 1 59d
Describe the SGP Pod,
kubectl describe pod sgp-pod
The Pod should have the following annotation.
Annotations: vpc.amazonaws.com/pod-eni: [Branch ENI Details]
Resolution
If the Annotation is missing,
- Check the Pod Events for errors emitted by the vpc-resource-controller
- There are no PSP Blocking the annotation.
- There are Sufficient ENI/IP.
- Sufficient permissions in the Cluster Role.
Resolution
If the Pod is still stuck in ContainerCreating
you can,
- Fetch more detailed logs on the Host using the EKS Log collector script
- Check the CNI Logs from the collected logs.
- Open an Issue in this repository if the problem still persists.
If you observe connection failures like intermittent DNS timeouts on pods using security groups, you might need to update the branch ENI cooldown period or kernel ARP cache timeout so the values are equal. Else this could result in re-use of IP address of a recently terminated pod by a new pod before the kernel's ARP cache is updated, which causes DNS failures or general packet drops.
The branch ENI cooldown period is the period of time to wait before deleting the branch ENI for propagation of iptables rules for the deleted pod. This can be set on the amazon-vpc-cni
configmap. See more details here.
To update the kernel ARP cache timeout, set the following parameters for each existing interface on the node. If the branch ENI cooldown period is 30s, set:
sudo sysctl -w net.ipv4.neigh.eth0.gc_stale_time=30
sudo sysctl -w net.ipv4.neigh.eth0.base_reachable_time_ms=15000
Also set the default so all new interfaces created are configured with these values:
sudo sysctl -w net.ipv4.neigh.default.gc_stale_time=30
sudo sysctl -w net.ipv4.neigh.default.base_reachable_time_ms=15000
If the pods are not Running
due to IP addresses being unavailable, but you have few pods running and expect to have IP address available, tune the branch ENI cooldown period accordingly.
The branch ENI cooldown period is the period of time to wait before deleting the branch ENI for propagation of iptables rules for the deleted pod. The default value is 60s, so IP addresses are not released for atleast 60s. This can be configured via the amazon-vpc-cni
configmap as described here. Note that the minimum cooldown period is 30s.
Be sure to also update the kernel ARP cache timeouts if you notice DNS issues as outlined in the above section.
Please follow the troubleshooting steps here for issues with Windows Node and Pods when using prefix delegation
mode.
The following steps should be checked in chronological order to find out any issues with the workflow.
To get the ConfigMap and the data field
kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"
You should have the ConfigMap with the following data in the string,
enable-windows-ipam:true enable-windows-prefix-delegation:true
Resolution
If the ConfigMap is missing or doesn't have the above field, you can create or update the amazon-vpc-cni
ConfigMap with the required fields-
enable-windows-ipam: "true"
enable-windows-prefix-delegation: "true"
Note: Windows IPAM needs to be enabled in order to use windows prefix delegation feature.
In case the controller encounters any error during it's prefix delegation workflow which needs to be acted upon by the customer, it will emit the errors as pod events and/or node events. Therefore, checking the same can be a good starting point to root cause the issue.
You can obtain the pod events using the following command.
kubectl get events --all-namespaces
In case there is any explicit error, the same needs to be looked into.
For example, if the error states that there are insufficient space in the subnet to carve a /28 prefix, then the subnet needs to be looked into to ensure that /28 ranges are available which can be allocated as prefixes.
Same as Verify Node has the Resource Capacity
Same as Verify Pod has the resource limits
Same as Verify Pod has the IPv4 Address Annotation
Configuration options can be used to fine-tune the behaviour of prefix delegation on Windows. The details about the options are available here.
To get the ConfigMap and the data field
kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"
If you see any of the following keys in the data-
minimum-ip-target
warm-ip-target
warm-prefix-target
Then the configuration options have been set.
Resolution
Verify if the configuration is correct as mentioned in the documentation.
Alternatively, to isolate the issue, try removing the above keys from the config map.
Same as Look for Issues on the Windows Host
If you have a PSP that blocks annotation to Pod, you will have to allow annotation from the following User eks:vpc-resource-controller
subjects:
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: system:authenticated
- kind: User
name: eks:vpc-resource-controller
apiGroup: rbac.authorization.k8s.io
- kind: ServiceAccount
name: eks-vpc-resource-controller
To get cluster role for your EKS Cluster
aws eks describe-cluster --name cluster-name --region us-west-2 | j
q .cluster.roleArn
To find the policies attached to the cluster role
aws iam list-attached-role-policies --role-name role-name-from-above
The Policy arn:aws:iam::aws:policy/AmazonEKSVPCResourceController
must be present for the Windows/SGP feature to work. If it's missing, please add the policy.
New ENI Creation or Assigning Secondary IPv4 Address can fail if you don't have sufficient IPv4 Address in your Subnet.
To find the list of IPv4 address available
aws ec2 describe-subnets --subnet-id subnet-id-here
From the response you can look for how many IPv4 address are available in the Subnet from the field AvailableIpAddressCount
You should check if the feature is enabled via ConfigMap. To get the ConfigMap and the data field
kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"
If have the ConfigMap with the following data in the string,
enable-windows-prefix-delegation:true
then the feature is enabled.
Resolution
You can disable the feature by editing your config map and setting enable-windows-prefix-delegation
as "false"
.