-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal kubernetes cluster communication issues using AWS CNI 1.3.0 and m4/m5/c4/c5/r5/i3 instances #318
Comments
This sounds a lot like #263, is Calico being used for network policies? One thing to try is to disable the source/destination check to prove if the issue is related to packets exiting a different adapter to the one they came in on. |
AWS support wanted me to add the following information to the ticket:
The issue was uncovered on a cluster that is runnig aws cni 1.3.0 as we wanted to add a new instance type (c5 )to the cluster.
Continuous issue. That is also reproducible each time on new instances.
I have now tested with version 1.2.1 (downgraded the cluster to it). And the issue also exists in 1.2.1.
The issue also exists in 1.3.2. Though to test I had to create my own images. I don't see any publicly available images for 1.3.2. Out of curiosity, I have also tried the current master branch of the aws cni plugin. And there it seems to work. That is m4/m5/r5/c4/c5/i3 instances don't have the communication problem. And they seem to work as the r4 instances. |
@nickdgriffin No, we don't use Calico. As for the "to disable the source/destination check", do you mean https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#change_source_dest_check or do you refer to the srcdst app (which we have removed). I can give it a try to disable the check on the ENIs of the c5 instances. Why would this be different and not matter on the working r4 instances (the check is enabled there). As for further tests, I have compiled and tested with the current master branch. A quick test showed that the prev broken instances are working with master. I need to verify this. If this is the case, it would be nice to understand which commit did this and why it is needed for the cX, mX and r5 instances. As the r4 instances are working with prev versions of the plugin. Thanks. |
I do mean that, yes. The issue I am referring to comes about when a pod is allocated an IP on a secondary adapter, so you can check if that is the case across your various tests - it might be that your r4 test only had pods being allocated IPs from the primary interface, which is why it worked. If you still have problems with pods that are allocated IPs from the primary adapter then it cannot be the issue I am referring to. |
For the tests we saturated the instance with as many pods as the total number of IPs the instance supported. To make sure that we have pods on all ENIs. |
I have now rerun the tests we did to verify internal cluster communication of the pods assigned to the non-primary ENIs on r5/c4/c5/m4/m5/i3. This time with the aws cni based of master (commit 4a51ed4). No errors now. All the pods on the non-primary ENIs can talk to kubernetes.default (as well as resolv it). This did clearly not work with version 1.2.1, 1.3.0 and 1.3.2. I wonder if commits 6be0029 and/or 96a86f5 are fixing the issue we have seen? I also wonder if we are the only ones seeing this? When can we expect a new release of the aws cni? |
Any news on this? |
Hi, sorry for the late reply. We will investigate this issue, and if the current master works, that's great. I've created a 1.4 Milestone and we will start working on that soon. |
@mogren We saw there is a 1.4.0-rc1 candidate out there. We have tested it and it seems that it solves this issue. Wondering why this issue is not in the milestone for 1.4 and/or why it was not addressed with any update here? |
Hey @recollir, thanks a lot for testing this. I didn't want to include this issue since I was not sure the changes solved your problems. I'll try to get this release out as soon as possible. |
v1.4.0 has been released! Has this issue been resolved with v1.4.0? @recollir |
Currently testing it. |
I have now tested the 1.4.0 version with c5, r4, r5, m5 and i3 instances - for all those: large, 2xlarge and 4xlarge. It seems that the issue is now resolved. We still don't know why though. It would be nice to get an explanation of this. @Jeffwan |
Hi @recollir, The PR you commented a while ago, #346, is the most probably cause for this that I can think of. #305 might affect the Debian images, but should not cause issues for the AL ones. Aside from that, since you're not using Calico, not that many changes were made to the CNI in regard to setting up routes on secondary ENIs before 4a51ed4. To figure out what happened to those ENIs I'd have to take another look at the logs for the v1.3.0 that failed. |
Looking at the iptables files you provided, there are a lot of rules that are set up by some kubernetes firewall script or something, and there are a lot of old rules and stuff in the cluster from a few days earlier. Are you sure nothing has changed with that script? Also, did you use any other tool like |
Internal kubernetes cluster communication issues using AWS CNI 1.3.0 and m4/m5/c4/c5/r5/i3 instances
Kubernetes cluster
What are the issues
r5/m4/m5/c4/c5/i3 instances:
When pods are getting an IP associated to the secondary, tertiary or quaternary ENI of an instance cluster internal communication is not working, e.g.g pods can not communicate with kube-dns, Kubernetes service in default namespace or any other cluster-ip service. Pods on the primary ENI have no problems to talk to those internal services. All pods on all ENIs can talk to the "internet". This is both with kops default jessie/stretch amis as well as with the latest Amazon Linux 2 ami.
r4 instances:
All pods on all ENIs can talk to the cluster internal services.
How we reproduce the issue:
Result
The result of above experiment is a csv file
The text was updated successfully, but these errors were encountered: