Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod requests inner-cluster svc failed with kube-proxy when the pod arp entry GC on the node #3370

Closed
Jexf opened this issue Feb 28, 2022 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Jexf
Copy link
Member

Jexf commented Feb 28, 2022

Describe the bug

Pod requests inner-cluster svc failed with kube-proxy when the pod arp entries GC on the node.

  1. When the current Pod A accesses Service B for the first time (assuming that the Pod is just running and the gateway ARP entry is empty), it will request the MAC address of the gateway first. After the request reaches the node antrea-gw0, it will cache the ARP entry of Pod A, and then make an ARP reply. After Pod A receives the gateway ARP reply, it caches the gateway ARP entry.

  2. Pod A has no cross-node access/Service access requirements within a certain period of time (for example, within 5 minutes).

  3. Since the ARP aging and recycling on the current node are all default parameters and have not been optimized, the gc_stale_time is 60s, the base_reachable_time_ms is 30000ms, and the gc_thresh1 is 128. Normal ARP aging and recycling will be performed. When the time exceeds 2 minutes, the ARP entry corresponding to Pod A on the node has entered the stale state, and because the number of ARP entries on the node exceeds 128, ARP GC is triggered, and the ARP entry corresponding to Pod A on the node will be recycled..

  4. Although the gateway ARP entry in the Pod has also entered the stale state, due to the small number of ARP entries, the ARP GC will not be triggered because the number of ARP entries does not reach gc_thresh1 online, so the gateway ARP entry in PodA always exists.

  5. After 5 minutes, when Pod A accesses Service B again, the packets replied by the backend endpoints arrive at the Pod A node. Check the ip neigh and find that it is empty, then send an ARP Request to Pod A. Because the source IP address is the Service's ClusterIP, and the source MAC is the gateway MAC, it was filtered out after checking on Tables SpoofGuard, resulting in failure.

the detail packets capture log( Pod A IP:10.224.1.170, SVC B ClusterIP:10.10.0.10)

:36:26.193507 62:82:15:bd:0d:84 > fa:5c:47:de:06:02, ethertype IPv4 (0x0800), length 66: 10.224.1.170.47898 > 10.10.0.10.domain: Flags [S], seq 2589761518, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0
16:36:26.193549 fa:5c:47:de:06:02 > f2:39:bf:f0:38:c8, ethertype IPv4 (0x0800), length 66: 10.224.1.170.47898 > 10.224.1.90.domain: Flags [S], seq 2589761518, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0
16:36:26.193838 f2:39:bf:f0:38:c8 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 10.224.1.170 tell 10.224.1.90, length 28
16:36:26.194216 f2:39:bf:f0:38:c8 > fa:5c:47:de:06:02, ethertype IPv4 (0x0800), length 66: 10.224.1.90.domain > 10.224.1.170.47898: Flags [S.], seq 3058005556, ack 2589761519, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0
16:36:26.194238 fa:5c:47:de:06:02 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 10.224.1.170 tell 10.10.0.10, length 28
16:36:27.196254 fa:5c:47:de:06:02 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 10.224.1.170 tell 10.10.0.10, length 28

To Reproduce
1.Use Antrea with Kube-Proxy
2.Create a pod A and svc B
3.The pod A requests svc B
4. the pod arp entry GC on the node
5. the gateway ARP entry in PodA still exists
6. The pod A requests svc B again

Versions:
Antrea 1.5.0

@Jexf Jexf added the kind/bug Categorizes issue or PR as related to a bug. label Feb 28, 2022
@Jexf
Copy link
Member Author

Jexf commented Feb 28, 2022

@tnqn @antoninbas @jianjuns Can I take it?

@Jexf Jexf changed the title Pod requests inner-cluster svc failed with kube-proxy when the pod arp entry will be gc on the node Pod requests inner-cluster svc failed with kube-proxy when the pod arp entry GC on the node Feb 28, 2022
@tnqn
Copy link
Member

tnqn commented Feb 28, 2022

@Jexf Thanks for the report and feel free to take the issue. I assume you are using kube-proxy ipvs mode? otherwise the kernel would have used antrea-gw0's IP as the source IP of the ARP request. It would keep the original source IP that triggers the ARP request only when the source IP is owned by the host and arp_announce is 0.
What's your plan of fixing it? A potential solution is to increase restriction level for arp annoucing of antrea-gw0 to 1, but not sure if this requires extra privilege for agent pod.

@Jexf
Copy link
Member Author

Jexf commented Feb 28, 2022

@tnqn Thanks for the reply. Yes, we are using kube-proxy ipvs mode. Setting arp_announce to 1 for the antrea-gw0 device sounds good to me, but I'm not sure if there has the same problem on windows.

@antoninbas
Copy link
Contributor

Would running kube-proxy with --ipvs-strict-arp solve the issue as kube-proxy would take care of setting arp_announce to 2?

@tnqn
Copy link
Member

tnqn commented Mar 1, 2022

@antoninbas That sounds better. It seems designed for this scenario. Then the only thing we could do is to document this requirement when working with kube-proxy ipvs. Does it work for you @Jexf?
Another thing need to note is that Egress needs arp_ignore to be 0 to respond ARP requests received on external interfaces. Since --ipvs-strict-arp will change it to 1, we need to ensure the userspace arp responder will be activated:

// Start the ARP responder only when the dummy device is not created. The kernel will handle ARP requests
// for IPs assigned to the dummy devices by default.
// TODO: Check the arp_ignore sysctl parameter of the transport interface to determine whether to start
// the ARP responder or not.
if a.dummyDevice == nil && a.arpResponder != nil {
go a.arpResponder.Run(ch)
}

@Jexf
Copy link
Member Author

Jexf commented Mar 2, 2022

@tnqn @antoninbas Thanks for the reply, --ipvs-strict-arp works well for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants