Unresponsive aws-node CNI 1.7.5 #1372

nmckeown · 2021-02-03T17:08:18Z

What happened:
We have a service that talks out to hundreds of AWS accounts from certain pods. We are losing network connectivity to these pods when registering new accounts every day or so. We notice that the aws-node on these nodes is crashing. It is unresponsive to connect or retrieve logs.

Readiness and Liveness probes for aws-node on these nodes are constantly failing:
rpc error: code = DeadlineExceeded desc = context deadline exceeded

A broken pipe error is observed in the node syslogs for aws-node pods:
dockerd: time="2021-01-31T18:30:11.456914780Z" level=error msg="Handler for GET /containers/4bea941465f4455916f49313de3109c3b4e56dca7c033f8ebb1bfb3c536bbad9/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

What you expected to happen:
Is there any known issue with CNI and pods that are very busy on the network. Multiple GBs could be transferring to these pods at any second.

Environment:
CNI: 1.7.5
Kube-Proxy: v1.15.11
CoreDNS: v1.6.6
EKS: 1.15.12-eks-31566f
OS: 5.4.80-40.140.amzn2.x86_64
Docker Engine: 19.3.6

The text was updated successfully, but these errors were encountered:

jayanthvn · 2021-02-03T18:54:46Z

@nmckeown - Can you please share CNI logs by running the log collector script?

sudo bash /opt/cni/bin/aws-cni-support.sh

nmckeown · 2021-02-04T10:34:59Z

Thanks @jayanthvn for responding. We don't have access to these nodes as per our security policy. Do you know is there any other way to collect those logs, either via cloudwatch, kubectl proxy ..etc

jayanthvn · 2021-02-08T22:14:53Z

Hi @nmckeown

Sorry for the delayed response. These logs are from IPAMD on the node. So customers have to run this script on the node and collect the logs. If the customer can open a support ticket and share us the clusterARN we can help retrieve the logs from aws-node.

Thanks.

nmckeown · 2021-02-09T09:46:55Z

Hi @jayanthvn, no worries. We actually found a fix for this and it ended up being the EBS performance. We were exhausting all our I/O burst credits so write operations were failing. Increasing our root volume to 1TB resolved this. Thanks for responding.

jayanthvn · 2021-02-09T15:45:54Z

Thanks for letting me know :) Glad it fixed.

DmitriyStoyanov · 2021-02-16T17:33:15Z

Looks like we faced with the same problem with cni 1.7.5 and eks 1.18
In logs I found something like:
Feb 16 16:14:12 ip-10-13-9-254.ec2.internal kubelet[4826]: E0216 16:14:12.816601 4826 remote_runtime.go:351] ExecSync 8d6eacbf23... '/app/grpc-health-probe -addr=:50051' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

in prometheus I found that nodes with exec_sync metric equal 10 seconds, have this problems and pods cannot be started there.
For the moment to help cluster, I terminate such node instances and cluster works fine. But after some time ~ 2 days, I see again issues with the same problem, pod cannot start

DmitriyStoyanov · 2021-02-17T11:51:30Z

And the same burst I/O go to 0

nmckeown · 2021-02-17T15:17:42Z

Hi @DmitriyStoyanov yeah, if your balance goes to zero, it takes from time to recover and hence operations will fail. Eastiest/cheapest option for us was just increase the volume size to 1TB so you don't rely on I/O credits. Other options was move from gp2 to gp3 or look at provisioned IOPS.

DmitriyStoyanov · 2021-02-17T17:38:19Z

For the moment we just increased for different instances disk size previously we had only 50GB for all instances, now
c5.xlarge - 50GB (do nothing) (150 iops)
c5.2xlarge - 100GB (x2) (300 iops)
c5.4xlarge - 200GB (x4) (600 iops) (where we faced with issues several times)
and watch on the metrics, possible will look into gp3, but not today :)
because of terraform-aws-modules/terraform-aws-eks#1134 (comment) and terraform-aws-modules/terraform-aws-eks#1205

nmckeown added the bug label Feb 3, 2021

jayanthvn closed this as completed Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unresponsive aws-node CNI 1.7.5 #1372

Unresponsive aws-node CNI 1.7.5 #1372

nmckeown commented Feb 3, 2021 •

edited

Loading

jayanthvn commented Feb 3, 2021

nmckeown commented Feb 4, 2021

jayanthvn commented Feb 8, 2021

nmckeown commented Feb 9, 2021

jayanthvn commented Feb 9, 2021

DmitriyStoyanov commented Feb 16, 2021 •

edited

Loading

DmitriyStoyanov commented Feb 17, 2021

nmckeown commented Feb 17, 2021

DmitriyStoyanov commented Feb 17, 2021 •

edited

Loading

Unresponsive aws-node CNI 1.7.5 #1372

Unresponsive aws-node CNI 1.7.5 #1372

Comments

nmckeown commented Feb 3, 2021 • edited Loading

jayanthvn commented Feb 3, 2021

nmckeown commented Feb 4, 2021

jayanthvn commented Feb 8, 2021

nmckeown commented Feb 9, 2021

jayanthvn commented Feb 9, 2021

DmitriyStoyanov commented Feb 16, 2021 • edited Loading

DmitriyStoyanov commented Feb 17, 2021

nmckeown commented Feb 17, 2021

DmitriyStoyanov commented Feb 17, 2021 • edited Loading

nmckeown commented Feb 3, 2021 •

edited

Loading

DmitriyStoyanov commented Feb 16, 2021 •

edited

Loading

DmitriyStoyanov commented Feb 17, 2021 •

edited

Loading