-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unresponsive aws-node CNI 1.7.5 #1372
Comments
@nmckeown - Can you please share CNI logs by running the log collector script?
|
Thanks @jayanthvn for responding. We don't have access to these nodes as per our security policy. Do you know is there any other way to collect those logs, either via cloudwatch, kubectl proxy ..etc |
Hi @nmckeown Sorry for the delayed response. These logs are from IPAMD on the node. So customers have to run this script on the node and collect the logs. If the customer can open a support ticket and share us the clusterARN we can help retrieve the logs from aws-node. Thanks. |
Hi @jayanthvn, no worries. We actually found a fix for this and it ended up being the EBS performance. We were exhausting all our I/O burst credits so write operations were failing. Increasing our root volume to 1TB resolved this. Thanks for responding. |
Thanks for letting me know :) Glad it fixed. |
Looks like we faced with the same problem with cni 1.7.5 and eks 1.18 in prometheus I found that nodes with exec_sync metric equal 10 seconds, have this problems and pods cannot be started there. |
Hi @DmitriyStoyanov yeah, if your balance goes to zero, it takes from time to recover and hence operations will fail. Eastiest/cheapest option for us was just increase the volume size to 1TB so you don't rely on I/O credits. Other options was move from gp2 to gp3 or look at provisioned IOPS. |
For the moment we just increased for different instances disk size previously we had only 50GB for all instances, now |
What happened:
We have a service that talks out to hundreds of AWS accounts from certain pods. We are losing network connectivity to these pods when registering new accounts every day or so. We notice that the aws-node on these nodes is crashing. It is unresponsive to connect or retrieve logs.
Readiness and Liveness probes for aws-node on these nodes are constantly failing:
rpc error: code = DeadlineExceeded desc = context deadline exceeded
A broken pipe error is observed in the node syslogs for aws-node pods:
dockerd: time="2021-01-31T18:30:11.456914780Z" level=error msg="Handler for GET /containers/4bea941465f4455916f49313de3109c3b4e56dca7c033f8ebb1bfb3c536bbad9/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
What you expected to happen:
Is there any known issue with CNI and pods that are very busy on the network. Multiple GBs could be transferring to these pods at any second.
Environment:
CNI: 1.7.5
Kube-Proxy: v1.15.11
CoreDNS: v1.6.6
EKS: 1.15.12-eks-31566f
OS: 5.4.80-40.140.amzn2.x86_64
Docker Engine: 19.3.6
The text was updated successfully, but these errors were encountered: