Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS fails on gVisor using netstack on EKS #3301

Closed
moehajj opened this issue Jul 20, 2020 · 48 comments · Fixed by #6803 or #6815
Closed

DNS fails on gVisor using netstack on EKS #3301

moehajj opened this issue Jul 20, 2020 · 48 comments · Fixed by #6803 or #6815
Assignees
Labels
area: integration Issue related to third party integrations area: networking Issue related to networking type: bug Something isn't working

Comments

@moehajj
Copy link

moehajj commented Jul 20, 2020

Description

I'm deploying Pods on my EKS cluster using the gVisor runtime, however the outbound network requests fail while inbound requests succeed. The issue is mitigated when using network=host in the runsc config options.

Steps to reproduce

  1. I created a 2 node EKS cluster and configured a node to use conatinerd as a container CRI and configured the gVisor runtime with containerd (following this tutorial). I also labeled the node I selected for gVisor with app=gvisor.

EKS Cluster Nodes: (you can see the first node using containerd as it's container runtime)

kubectl get nodes -o wide
NAME                                           STATUS   ROLES    AGE    VERSION                INTERNAL-IP      EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal   Ready    <none>   3d1h   v1.16.12-eks-904af05   192.168.31.136   35.161.102.17   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal   Ready    <none>   3d1h   v1.16.12-eks-904af05   192.168.60.139   44.230.198.56   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   docker://19.3.6

runsc config on gVisor node:

[ec2-user@ip-192-168-31-136 ~]$ ls /etc/containerd/
config.toml  runsc.toml
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/config.toml 
disabled_plugins = ["restart"]
[plugins.linux]
  shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
  TypeUrl = "io.containerd.runsc.v1.options"
  ConfigPath = "/etc/containerd/runsc.toml"
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/runsc.toml 
[runsc_config]
  debug="true"
  strace="true"
  log-packets="true"
  debug-log="/tmp/runsc/%ID%/"
  1. I applied a gVisor runtime class to my cluster:
cat << EOF | tee gvisor-runtime.yaml 
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
EOF

kubectl apply -f gvisor-runtime.yaml
  1. And ran a simple nginx Pod using the gvisor runtime:
cat << EOF | tee nginx-gvisor.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: nginx-gvisor
spec:
  containers:
  - name: my-nginx
    image: nginx
    ports:                    
    - containerPort: 80
  nodeSelector:
    app: gvisor
  runtimeClassName: gvisor
EOF

kubectl create -f nginx-gvisor.yaml

To verify the Pod is running with gVisor:

# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}' 
containerd://9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7%  

# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444
9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7 
  1. To test the inbound network traffic of the Pod, I simply curled port 80 of the Pod and it succeeded.
    To test the outbound network traffic of the Pod, I did the following:
kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Err:1 http://security.debian.org/debian-security buster/updates InRelease
  Temporary failure resolving 'security.debian.org'
Err:2 http://deb.debian.org/debian buster InRelease
  Temporary failure resolving 'deb.debian.org'
Err:3 http://deb.debian.org/debian buster-updates InRelease
  Temporary failure resolving 'deb.debian.org'
Reading package lists... Done
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease  Temporary failure resolving 'deb.debian.org'
W: Failed to fetch http://security.debian.org/debian-security/dists/buster/updates/InRelease  Temporary failure resolving 'security.debian.org'
W: Failed to fetch http://deb.debian.org/debian/dists/buster-updates/InRelease  Temporary failure resolving 'deb.debian.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.

You can see that it fails. Other attempts such as wget www.google.com fail as well.

For debug purposes, these are the DNS and routing tables (without net-tools, since I couldn't install them) in the Pod container:

root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface   Destination     Gateway Flags   RefCnt  Use     Metric  Mask    MTU     Window  IRTT
eth0    0101FEA9        00000000        0001    0       0       0       FFFFFFFF        0       0       0
eth0    00000000        0101FEA9        0003    0       0       0       00000000        0       0       0  

I also captured the tcpdump packets on the ENI network interface for the Pod allocated by EKS:
eni567d651201a.nohost.tcpdump.tar.gz.
Details about the network interface:

[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet6 fe80::4cfa:44ff:fe5d:9495  prefixlen 64  scopeid 0x20<link>
        ether 4e:fa:44:5d:94:95  txqueuelen 0  (Ethernet)
        RX packets 3  bytes 270 (270.0 B)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 5  bytes 446 (446.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I also captured runsc debug information for the containers in the Pod:
9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7.tar.gz
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444.tar.gz

  1. Now to verify that it works when the Pod is using the host network, I added network="host" to the /etc/containerd/runsc.toml file and restarted containerd. I reran the same experiment above with the following results:

Verify running Pod:

# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}' 
containerd://e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720%    

# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720 

Successful inbound with curl, and successful outbound as follows:

kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 Packages [213 kB]
Get:3 http://deb.debian.org/debian buster InRelease [121 kB]
Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7905 kB]
Get:6 http://deb.debian.org/debian buster-updates/main amd64 Packages [7868 B]
Fetched 8364 kB in 6s (1462 kB/s)
Reading package lists... Done

DNS and routing table (with net-tools this time) on Pod:

root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface   Destination     Gateway Flags   RefCnt  Use     Metric  Mask    MTU     Window  IRTT
eth0    00000000        0101FEA9        0003    0       0       0       00000000        0       0       0
eth0    0101FEA9        00000000        0001    0       0       0       FFFFFFFF        0       0       0
eth0    751FA8C0        00000000        0001    0       0       0       FFFFFFFF        0       0       0
root@nginx-gvisor:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         169.254.1.1     0.0.0.0         UG    0      0        0 eth0
169.254.1.1     0.0.0.0         255.255.255.255 U     0      0        0 eth0
192.168.31.117  0.0.0.0         255.255.255.255 U     0      0        0 eth0

TCPDump file:
eni567d651201a.host.tcpdump.tar.gz
Details about the network interface:

[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet6 fe80::58a9:b5ff:feda:27e5  prefixlen 64  scopeid 0x20<link>
        ether 5a:a9:b5:da:27:e5  txqueuelen 0  (Ethernet)
        RX packets 10  bytes 796 (796.0 B)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 5  bytes 446 (446.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

runsc debug files:
96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d.tar.gz
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720.tar.gz

Environment

Please include the following details of your environment:

  • runsc -version
[ec2-user@ip-192-168-31-136 ~]$ runsc -version
runsc version release-20200622.1-171-gc66991ad7de6
spec: 1.0.1-dev
  • kubectl version and kubectl get nodes -o wide
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-fd1ea7", GitCommit:"fd1ea7c64d0e3ccbf04b124431c659f65330562a", GitTreeState:"clean", BuildDate:"2020-05-28T19:06:00Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl get nodes -o wide                                                            
NAME                                           STATUS   ROLES    AGE    VERSION                INTERNAL-IP      EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal   Ready    <none>   3d3h   v1.16.12-eks-904af05   192.168.31.136   35.161.102.17   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal   Ready    <none>   3d3h   v1.16.12-eks-904af05   192.168.60.139   44.230.198.56   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   docker://19.3.6
  • uname -a
$ uname -a
Darwin moehajj-C02CJ1ARML7M 19.6.0 Darwin Kernel Version 19.6.0: Sun Jul  5 00:43:10 PDT 2020; root:xnu-6153.141.1~9/RELEASE_X86_64 x86_64
@hbhasker
Copy link
Contributor

Offhand looking at the tcpdump it looks like somehow runsc routing table/lookup is incorrect. Where Netstack is trying to resolve 169.254.1.1 by sending an ARP query and not getting anything back. I will have to setup a cluster to really see what might be going on.

But looking at the /proc/net/route I see that maybe runsc is not sorting the routes correctly.

@moehajj
Copy link
Author

moehajj commented Jul 22, 2020

That was my initial thought as well given the state of the routing table, but when I looked at the routing table for the Pod running using network=host the entries seemed identical, with an extra entry that includes the Pods (which I don't see how that would solve the issue).

I examined the tcpdump further, and I noticed that when I run 'apt-get update', the first messages that are sent are:

With network=host: (file eni567d651201a.host.tcpdump.tar.gz, starting line 16)

16:47:57.559711 76:97:f8:00:6c:ab > 5a:a9:b5:da:27:e5, ethertype IPv4 (0x0800), length 100: 192.168.31.117.46138 > 10.100.0.10.53: 64553+ A? deb.debian.org.default.svc.cluster.local. (58)
16:47:57.559769 76:97:f8:00:6c:ab > 5a:a9:b5:da:27:e5, ethertype IPv4 (0x0800), length 100: 192.168.31.117.46138 > 10.100.0.10.53: 41010+ AAAA? deb.debian.org.default.svc.cluster.local. (58)
16:47:57.560625 5a:a9:b5:da:27:e5 > 76:97:f8:00:6c:ab, ethertype IPv4 (0x0800), length 193: 10.100.0.10.53 > 192.168.31.117.46138: 64553 NXDomain*- 0/1/0 (151)
16:47:57.560634 5a:a9:b5:da:27:e5 > 76:97:f8:00:6c:ab, ethertype IPv4 (0x0800), length 193: 10.100.0.10.53 > 192.168.31.117.46138: 41010 NXDomain*- 0/1/0 (151)
[...]

You can see that that the Pod contacts that DNS server directly, without sending any arp requests.

Using netstack, and without network=host (file eni567d651201a.nohost.tcpdump.tar.gz, starting line 16)

16:25:05.990433 e6:fb:d5:ec:b4:08 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 169.254.1.1 tell 192.168.16.114, length 28
16:25:06.990595 e6:fb:d5:ec:b4:08 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 169.254.1.1 tell 192.168.16.114, length 28
16:25:07.990751 e6:fb:d5:ec:b4:08 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 169.254.1.1 tell 192.168.16.114, length 28

ARP messages are sent out from the eth0 interface (hwaddr: e6:fb:d5:ec:b4:08) on the Pod.
I would expect that since they both have the same DNS resolve (/etc/resolv.conf) and very similar routing tables (/proc/net/route), they should have the same behavior and tcp requests to the DNS server would be sent out. This is why I think it might be an issue with netstack DNS.

Note that I tried it on a single-node kubernetes cluster running on my local machine and things worked fine, but when I ran the same setup on EKS it broke.

@ianlewis ianlewis added area: networking Issue related to networking type: bug Something isn't working labels Jul 22, 2020
@hbhasker
Copy link
Contributor

hbhasker commented Jul 22, 2020

If you look at the routes the order is different. With host network the default route is first but I think runsc is printing it out second.

I believe what is happening is we scrape the routes per interface below and send it over urpc to the sentry

Routes: routes,

Which then proceeds to install the routes without sorting them in any order. Which means the order of routes installed in sentry will be in the order of the interfaces.

n.Stack.SetRouteTable(routes)

@hbhasker
Copy link
Contributor

That said I am curious why that 169... route exists? I am going to have to run this by myself and poke around.

@ianlewis
Copy link
Contributor

EKS has some interesting bits regarding it's CNI plugin implementation. I'm not sure it's relevant yet but they may be making assumptions that don't hold true with gVisor sandboxes.
https://github.com/aws/amazon-vpc-cni-k8s

@moehajj
Copy link
Author

moehajj commented Jul 22, 2020

@hbhasker I believe the 169.254... route exists because EKS has some metadata service running at 169.254.169.254, you can check more about it here.
@ianlewis could you highlight which bits/assumptions you're referring to?

@iangudger
Copy link
Contributor

Netstack uses route order to determine priority. Linux uses a more complicated algorithm. We have talked about implementing it in runsc and having runsc generate the netstack routing table.

@ianlewis
Copy link
Contributor

@moehajj I was mostly speculating. I just know they use ENI and the ipamd daemon to assign addresses which is a bit different than most CNI plugins, and this is the first I've heard of someone running runsc on EKS.

It sounds, though, like the ordering/priority of the routes is the more likely culprit.

@hbhasker
Copy link
Contributor

Actually I am not so sure. Netstack seems to be doing the right thing. It picked the default route and is trying to resolve the link address of the default gateway. I think EKS might be adding an arp entry for 169.254.1.1 as part of the setup.

Could you dump the state of the arp table in the container's namespace. My guess is we will find an entry for it which netstack is not aware of.

@hbhasker
Copy link
Contributor

Looking at Calico docs for example.

Why can’t I see the 169.254.1.1 address mentioned above on my host?
Calico tries hard to avoid interfering with any other configuration on the host. Rather than adding the gateway address to the host side of each workload interface, Calico sets the proxy_arp flag on the interface. This makes the host behave like a gateway, responding to ARPs for 169.254.1.1 without having to actually allocate the IP address to the interface.

I wonder if EKS does something similar.

@hbhasker
Copy link
Contributor

https://www.slideshare.net/AmazonWebServices/kubernetes-networking-in-amazon-eks-con412-aws-reinvent-2018
That presentation mentions static arp for 169.254.1.1.

I think that's why host mode works but netstack doesn't.

@hbhasker
Copy link
Contributor

I believe runsc needs to add support for RTM_ADDNEIGH NETLink command.

https://man7.org/linux/man-pages/man7/rtnetlink.7.html

@hbhasker
Copy link
Contributor

Or scrape any ARP table entries in the namespace and forward them to runsc at startup so that it installs the same ones in it's internal ARP cache.

@moehajj
Copy link
Author

moehajj commented Jul 22, 2020

I checked the arp table in /proc/net/arp and using arp -a on a Pod running using netstack and a Pod using the host network, and both ARP tables are empty. Quite frankly the files under /proc/net are quite similar, i'll include the output below.

network=host

$ cat /proc/net/*
IP address       HW type     Flags       HW address            Mask     Device
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
  eth0: 5350623    2007    0    0    0     0          0         0    92085    1324    0    1    0     0       0          0
fe800000000000003017edfffe008536 03 40 00 c0     eth0
00000000000000000000000000000001 01 80 00 80       lo
sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops     Inode
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
protocol  size sockets  memory press maxhdr  slab module     cl co di ac io in de sh ss gs se re sp bi br ha uh gp em
000003e8 00000040 000f4240 3b9aca00
Type Device      Function
Iface	Destination	Gateway	Flags	RefCnt	Use	Metric	Mask	MTU	Window	IRTT                                                         
eth0	00000000	0101FEA9	0003	0	0	0	00000000	0	0	0                                                                               
eth0	0101FEA9	00000000	0001	0	0	0	FFFFFFFF	0	0	0                                                                               
eth0	E412A8C0	00000000	0001	0	0	0	FFFFFFFF	0	0	0                                                                               
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 1 255 2000 0 0 0 0 0 2000 1311 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IcmpMsg:
IcmpMsg:
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 2 0 0 0 0 1977 1288 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 23 0 0 23 0 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
UdpLite: 0 0 0 0 0 0 0 0
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
Num       RefCount Protocol Flags    Type St Inode Path

network=netstack

$ cat /proc/net/*
IP address       HW type     Flags       HW address            Mask     Device
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
  eth0:     304       4    0    0    0     0          0         0        0       0    0    0    0     0       0          0
00000000000000000000000000000001 01 80 00 00       lo
fe80000000000000f47f03fffe459b4d 02 80 00 00     eth0
sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops     Inode
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
protocol  size sockets  memory press maxhdr  slab module     cl co di ac io in de sh ss gs se re sp bi br ha uh gp em
000003e8 00000040 000f4240 3b9aca00
Type Device      Function
Iface	Destination	Gateway	Flags	RefCnt	Use	Metric	Mask	MTU	Window	IRTT                                                         
eth0	0101FEA9	00000000	0001	0	0	0	FFFFFFFF	0	0	0                                                                               
eth0	00000000	0101FEA9	0003	0	0	0	00000000	0	0	0                                                                               
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 0 0 4 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IcmpMsg:
IcmpMsg:
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 0 0 0 0 0 0 0 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 0 0 0 0 0 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
UdpLite: 0 0 0 0 0 0 0 0
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
Num       RefCount Protocol Flags    Type St Inode Path

The output is ordered by order of the files on the Pod:

$ ls /proc/net
arp  dev  if_inet6  ipv6_route  netlink  netstat  packet  protocols  psched  ptype  route  snmp  tcp  tcp6  udp  udp6  unix

@hbhasker
Copy link
Contributor

That is rather strange. Because without ARP I am not sure how the host network bit is working. The routes say that 169.254.1.1 is the default gateway which means it should have the link address of the gateway before it can send the packets to the non-local destinations.

@moehajj
Copy link
Author

moehajj commented Jul 22, 2020

Yeah I find it strange as well, it seems like a runsc issue since Pods running using runc have the 169.254.1.1 static arp. I was following the AWS CNI proposal to debug further and get a better understanding of what's going on, I reached a dead end but with a few things I noticed that might be useful.

So I setup 2 nodes, both using runsc but 1 configured with network=host and the other with the default netstack network. I ran two versions of the same ubuntu image that has nettools installs robertxie/ubuntu-nettools, one on each configuration.

kubectl get pods -o wide
NAME                  READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
ubuntu-net-host       1/1     Running   0          91s   192.168.53.193   ip-192-168-60-139.us-west-2.compute.internal   <none>           <none>
ubuntu-net-netstack   1/1     Running   0          3s    192.168.1.182    ip-192-168-31-136.us-west-2.compute.internal   <none>           <none>
  1. I noticed that the pod that is using the netstack network does not get an egress ip rule assigned to it, while pods running with host network (using runsc or runc) do.

On node with runsc using host network

ip route show
default via 192.168.32.1 dev eth0 
169.254.169.254 dev eth0 
192.168.32.0/19 dev eth0 proto kernel scope link src 192.168.60.139 
192.168.32.217 dev enid6c09ee496d scope link 
192.168.35.91 dev enidcb5860247f scope link 
192.168.37.181 dev eni564327aa972 scope link 
192.168.49.88 dev enia9ad1fc6e5f scope link 
192.168.53.193 dev enid9cb3177b0e scope link 		<——
192.168.54.153 dev enibdd59383046 scope link 
192.168.55.67 dev enie9afb6b6f81 scope link 
192.168.56.254 dev eni19639745f02 scope link 
192.168.58.249 dev eni48d32331e45 scope link 

ip rule list
0:      from all lookup local 
512:    from all to 192.168.56.235 lookup main 
512:    from all to 192.168.62.219 lookup main 
512:    from all to 192.168.63.206 lookup main 
512:    from all to 192.168.34.124 lookup main 
512:    from all to 192.168.45.140 lookup main 
512:    from all to 192.168.43.82 lookup main 
512:    from all to 192.168.49.190 lookup main 
512:    from all to 192.168.51.111 lookup main 
512:    from all to 192.168.55.67 lookup main 
512:    from all to 192.168.49.88 lookup main 
512:    from all to 192.168.54.153 lookup main 
512:    from all to 192.168.58.249 lookup main 
512:    from all to 192.168.35.91 lookup main 
512:    from all to 192.168.56.254 lookup main 
512:    from all to 192.168.32.217 lookup main 
512:    from all to 192.168.37.181 lookup main 
512:    from all to 192.168.53.193 lookup main 		<—— 
1024:   from all fwmark 0x80/0x80 lookup main 
1536:   from 192.168.49.88 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.54.153 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.58.249 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.35.91 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.56.254 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.32.217 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.37.181 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.53.193 to 192.168.0.0/16 lookup 2 		<—— 
32766:  from all lookup main 
32767:  from all lookup default 

ip route show table 2
default via 192.168.32.1 dev eth1 
192.168.32.1 dev eth1 scope link 

On node with runsc using netstack network

ip route show
default via 192.168.0.1 dev eth0 
169.254.169.254 dev eth0 
192.168.0.0/19 dev eth0 proto kernel scope link src 192.168.31.136 
192.168.1.182 dev enif3e00791c23 scope link 		<—-
192.168.18.89 dev enic3634746f05 scope link 
192.168.25.50 dev eni8c7e75e6afd scope link 
192.168.27.227 dev enia205ab59220 scope link 

ip rule list
0:      from all lookup local 
512:    from all to 192.168.26.139 lookup main 
512:    from all to 192.168.27.227 lookup main 
512:    from all to 192.168.25.50 lookup main 
512:    from all to 192.168.18.89 lookup main 
512:    from all to 192.168.1.182 lookup main  		<—-
1024:   from all fwmark 0x80/0x80 lookup main 
1536:   from 192.168.25.50 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.18.89 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.27.57 to 192.168.0.0/16 lookup 2 
32766:  from all lookup main 
32767:  from all lookup default 


ip route show table 2
default via 192.168.0.1 dev eth1 
192.168.0.1 dev eth1 scope link 

However, to my disappointment, when I added a rule using sudo ip rule add from 192.168.1.182 to 192.168.0.0/16 lookup 2 priority 1536, egress network access on the Pod still failed.

  1. Another thing I noticed is that I get a lot of ioctl(..) failed messages on the Pod using the host network, but I don't see these messages on the other Pod. Specifically when I'm doing things like ifconfig. And the eth0 interface on netstack does not have BROADCAST,MULTICAST activity types.

On pod using host network

root@ubuntu-net-host:/# ifconfig
SIOCGIFCONF: Inappropriate ioctl for device
root@ubuntu-net-host:/# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 0 ioctl(SIOCGIFTXQLEN) failed: Inappropriate ioctl for device

    link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 127.0.0.1/8 scope global dynamic 
    inet6 ::1/128 scope global dynamic 
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 0 ioctl(SIOCGIFTXQLEN) failed: Inappropriate ioctl for device

    link/ether 9a:fa:5b:6b:4e:f2 brd ff:ff:ff:ff:ff:ff
    inet 192.168.53.193/32 scope global dynamic 
    inet6 fe80::98fa:5bff:fe6b:4ef2/64 scope global dynamic 

On pod using netstack network

root@ubuntu-net-netstack:/# ifconfig
eth0      Link encap:Ethernet  HWaddr ae:57:05:3f:4d:03  
          inet addr:192.168.1.182  Mask:255.255.255.255
          inet6 addr: fe80::ac57:5ff:fe3f:4d03/128 Scope:Global
          UP RUNNING  MTU:9001  Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:9001 
          RX bytes:452 (452.0 B)  TX bytes:0 (0.0 B)
          Memory:34d3f0500002329-0 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.255.255.255
          inet6 addr: ::1/128 Scope:Global
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:65536 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Memory:10000-0 

root@ubuntu-net-netstack:/# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/32 scope global dynamic 
    inet6 ::1/128 scope global dynamic 
2: eth0: <UP,LOWER_UP> mtu 9001 
    link/ether ae:57:05:3f:4d:03 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.182/32 scope global dynamic 
    inet6 fe80::ac57:5ff:fe3f:4d03/128 scope global dynamic 

I hope this is at all relevant to solving this issue.

@hbhasker were you able to reproduce these results on your own cluster?

@hbhasker
Copy link
Contributor

I haven't yet gotten around to setting up my own EKS pod. It will take me sometime as I am not familiar with EKS much or AWS in general. That said, --network=host does not forward all ioctls and that's probably why you see some failures. Netstack implements some of the ioctls that are needed for ifconfig and that's why it works.

All netstack interfaces do support multicast/broadcast but I think we don't set flags appropriately or don't return them correctly for ifconfig to show them.

runsc does a few other things at startup as well, it steals the routes from the host for the interface being handed to runsc and passes them to runsc instead. So if you inquire the routes in namespace in which runsc is running you may not see all the rules as some of them have been stolen and handed to runsc at startup ( runsc removes the IP address from the host otherwise the host will respond to TCP SYN's etc with RST as it won't be aware of any listening sockets etc in Netstack).

I will see if i can figure out how setup EKS and post if i find something. But mostly it looks like maybe we need to scrape any arp entries from the namespace and pass them to runsc at startup. From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc.

@moehajj
Copy link
Author

moehajj commented Jul 30, 2020

@hbhasker Any follow-up on this? Anything I can help with?

@hbhasker
Copy link
Contributor

@moehajj Sorry I haven't been able to work on this yet. That said one thing that will be great is if you already have an EKS cluster that I can get access to then it will make my life a lot simpler rather than having to setup one. I spent sometime reading up EKS etc but didnt' get to the point of actually setting up one.

@moehajj
Copy link
Author

moehajj commented Jul 31, 2020

@hbhasker I won't be able to give you access to an EKS cluster, but if you can quickly set up an AWS account and spin up a cluster (I found this guide very helpful when I started) I can give you the scripts that do the rest (e.g. install gVisor on nodes).

  1. Create a cluster with ssh access, you need a key-pair (eks_key.pem) for your ec2 instance (follow this guide) to use when you ssh:
eksctl create cluster --name gvisor-demo --nodes 2 --region us-west-2 --ssh-access --ssh-public-key eks_key
  1. And then just ssh into the nodes and setup gvisor with containerd. The default Linux AMI already has containerd installed so all you need to do is configure your kubelet to use containerd and configure containerd with a gVisor runtime handler.

SSH into first node

export n0_EIP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="ExternalIP")].address}')
ssh -i /path/to/eks_key.pem ec2-user@$n0_EIP

On the node, here I configure gVisor with networking using netstack

# Install dependencies
sudo yum install -y git # need

# Install Golang
wget https://dl.google.com/go/go1.14.4.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.14.4.linux-amd64.tar.gz

GOROOT=/usr/local/go
GOPATH=$HOME/go
PATH=/usr/local/go/bin:$HOME/go/bin:$PATH

## Create systemd drop-in for containerd
sudo sed -i 's;--container-runtime=docker;--container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock;' /etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf 
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Install gVisor runsc

set -e
wget https://storage.googleapis.com/gvisor/releases/nightly/latest/runsc
sudo mv runsc /usr/local/bin
sudo chown root:root /usr/local/bin/runsc
sudo chmod 0755 /usr/local/bin/runsc

# Install gvisor-containerd-shim
git clone https://github.com/google/gvisor-containerd-shim.git
cd gvisor-containerd-shim
make
sudo make install


# Install gVisor runtime (will need to create runtime in gvisor and assign pods to runsc)
cat <<EOF | sudo tee /etc/containerd/config.toml
disabled_plugins = ["restart"]
[plugins.linux]
  shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
  TypeUrl = "io.containerd.runsc.v1.options"
  ConfigPath = "/etc/containerd/runsc.toml"
EOF
#Runsc options config
cat <<EOF | sudo tee /etc/containerd/runsc.toml
[runsc_config]
  debug="true"
  strace="true"
  debug-log="/tmp/runsc/%ID%/"
EOF

# Restart containerd
sudo systemctl restart containerd
  1. And from here it's gVisor stuff. I like to label the nodes I've selected to have the gVisor handler, and use a nodeSelector in my runtime class:

Label node

export n0_name=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl label node $n0_name runtime=gvisor

Deploy RuntimeClass

cat <<EOF | tee gvisor-runtime.yaml
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
scheduling:
  nodeSelector:
    runtime: gvisor
EOF

kubectl apply -f gvisor-runtime.yaml

I hope this helps reduce the cluster setup overhead!

@moehajj
Copy link
Author

moehajj commented Jul 31, 2020

@hbhasker Looks like things have worked as of a new runsc commit.

[ec2-user@ip-192-168-69-37 ~]$ runsc --version
runsc version release-20200622.1-236-g112eb0c5b9e6
spec: 1.0.1-dev

But something broke where I can't kubectl port-forward a pod running on gVisor. I'll do some more testing to check at which commit things got fixed & when the port-fowarding broke (and I'll raise a new issue if needed).

@hbhasker
Copy link
Contributor

Glad to hear that the latest version worked. We did recently make some forwarding fixes I think*. I will have to go through our commit history and see. Please let me know if you identify the commit causing the regression.

@fvoznika
Copy link
Member

Re: kubectl port-forward, it doesn't work with runsc because containerd make assumptions about the container's network that are not true for sandboxes. There are more details here: kubernetes/enhancements#1846

@hbhasker
Copy link
Contributor

hbhasker commented Aug 3, 2020

@moehajj Can we mark this issue fixed as it looks like your initial issue is now resolved?

@moehajj
Copy link
Author

moehajj commented Aug 3, 2020

@hbhasker, unfortunately, I've been looking into why things suddenly worked and now I'm no longer able to reproduce a working version. I tried different runsc commits, different kubernetes version on eks (1.16,1.17) and different CNI plugin versions (0.7.5, 0.8.6), but using netstack has not been successful. I'm not sure what happened, so I apologize for the false hope. It would be great if we could resume looking into this issue, and if you could try reproducing it on your end.

@moehajj
Copy link
Author

moehajj commented Aug 6, 2020

I've figured out why things had worked all of a sudden, it was because I had deployed a Calico Daemonset to enable network policies, and having calico nodes intercepting packets seems to fix the issue. Do you think this might be an issue with the EKS CNI that Calico somehow mended?

@ianlewis ianlewis added the area: integration Issue related to third party integrations label Aug 14, 2020
@amscanne
Copy link
Contributor

Hey Mohammed, that's a great write-up! Just one small point -- the write-up uses the unmaintained containerd-shim from https://github.com/google/gvisor-containerd-shim.git (see the warning at the top of the repository, and the fact that it is an "archive" repository).

Since about a year ago (3bb5f71), the shim has been built and shipped with the core repository and is included in releases as well. You can actually just install it directly from the bucket, like runsc itself, e.g.wget https://storage.googleapis.com/gvisor/releases/release/latest/containerd-shim-runsc-v1. This also saves you from needing the Go toolchain for the installation.

@pkit
Copy link
Contributor

pkit commented Oct 28, 2021

fwiw, I've written a guide on setting up an EKS cluster with gVisor, and a custom runsc version of your choice, as the container runtime. I hope it serves as a helpful starting point smile

I'm not sure that PRing your article that has nothing to do with the problem described in this repo is a good idea. Sorry.

@pkit
Copy link
Contributor

pkit commented Oct 28, 2021

What happens is this: EKS relies on static arp entries for 169.254.1.1 to be present.
Vanilla namespace for containerd CNI looks like this:

$ sudo ip netns exec cni-661976d9-58c3-ce5e-b781-37ad4d95628f arp -a
gateway (169.254.1.1) at 12:cf:1e:29:a2:df [ether] PERM on eth0
gateway (169.254.1.1) at 12:cf:1e:29:a2:df [ether] PERM on eth0

For gvisor arp table is empty because here nothing regarding ARP is copied from the namespace.
More than that, gvisor ARP neighboring, described here is used only in tests.
I.e. bottom line: gvisor does not really expose any static ARP handling API neither to CNI nor to container itself.
Fast fix would be probably to use that "testing" code to copy static entries at runsc boot and be done with it.
Will try to do a PoC on that.

From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc.

That assessment was correct. But the running a command inside the container part is pretty funny, as that's the first thing I've tried.
Namely:

bash-5.1# arp -i eth0 -s 169.254.1.1 be:b2:bf:4c:f9:8d
SIOCSARP: Not a tty
bash-5.1# ip neighbor add 169.254.1.1 lladdr be:b2:bf:4c:f9:8d dev eth0 nud permanent
RTNETLINK answers: Permission denied
bash-5.1# arp -a
bash-5.1# ip neighbor show
RTNETLINK answers: Not supported
Dump terminated

Oops.

This issue dragging for 2 years is pretty interesting as it means nobody ever tried to use gvisor on EKS. And as a lot of CNI implementations rely on either static ARP entries or ARP proxy (both of which are not supported).
I wonder if Google uses the same gvisor in GKE sandbox...

@hbhasker
Copy link
Contributor

@pkit gVisor does not support ARP table manipulation as the required IOCTLs and NETLINK commands are not implemented. GKE uses the same gVisor but its not an issue as GKE does not rely on static ARP entries for such things. At some point we will support ARP table commands but its not been a priority for us. But we are always open to contributions and looks like the required NETLINK commands to make ip neighbor add work will be the following ones

RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH

Today we only implement a few of the netlink commands

} else if hdr.Flags&linux.NLM_F_REQUEST == linux.NLM_F_REQUEST {

Also we have no visibility into people using gVisor on EKS. That said proxy ARP should work? As long as there is something on the host that responds to the link address for 169.254.1.1 gVisor should be able to connect to it?

@pkit
Copy link
Contributor

pkit commented Oct 28, 2021

@hbhasker I don't think "online" ARP routing manipulation is needed.
Just fetching the static ARP entries (set up by CNI) somewhere here:

allAddrs, err := iface.Addrs()

And then passing it up here:
func (n *Network) CreateLinksAndRoutes(args *CreateLinksAndRoutesArgs, _ *struct{}) error {

For actual setup using
func (n *neighborCache) addStaticEntry(addr tcpip.Address, linkAddr tcpip.LinkAddress) {

Looks like it should do the trick.
But that's just a theory for now, I'm only reading gvisor code for 2-3 hours or so.

P.S. implementing netlink commands seems like a good idea too, at least to improve the visibility

@hbhasker
Copy link
Contributor

While this is doable I am not sure we want to do support these one-offs. I have been reviewing the CNI spec(https://github.com/containernetworking/cni/blob/master/SPEC.md) and from what I can see it does not provide for any ARP table manipulation directly. In case of EKS this is I am guessing being done by using an CNI plugin that is just executing arbitrary commands in the namespace to setup the ARP entries.

Supporting the exact commands will be the right way to solve this rather than do a 1 off for this specific use case. Also EKS can also support this by properly responding to ARP requests for that IP instead of statically inserting an entry. That is what GKE does for example for things like the metadata server which usually are reached by link local addresses (169.254.169.254 see: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity).

@pkit
Copy link
Contributor

pkit commented Oct 29, 2021

I'm not sure why that's a problem as it's clear to me that gvisor should (or even must) copy the full networking config from the namespace it claims to run in.
Otherwise it should drop the claim of supporting the real world workloads.
We had an exact same problem in zerovm and we we bold enough to claim we cannot support it.

TL;DR not copying the arp config from a namespace seems like a pretty big compatibility problem to me.

@hbhasker
Copy link
Contributor

I would not phrase it as a big problem as its clearly not a common use-case. But that said maybe its just worth doing it to make it work with EKS better. I will take a stab at implementing it.

@pkit
Copy link
Contributor

pkit commented Oct 29, 2021

I'm ok with implementing it too.
If you don't have time or incentive.

@hbhasker
Copy link
Contributor

@pkit I will be happy to review if you have cycles to implement it as I have quite a few higher priority things on my plate at the moment.

@pkit
Copy link
Contributor

pkit commented Oct 29, 2021

Cool. As we already started working on it anyway.
I hope I will submit a PR soon.

@hbhasker
Copy link
Contributor

Thanks!

@crappycrypto
Copy link

@pkit gVisor does not support ARP table manipulation as the required IOCTLs and NETLINK commands are not implemented. GKE uses the same gVisor but its not an issue as GKE does not rely on static ARP entries for such things. At some point we will support ARP table commands but its not been a priority for us. But we are always open to contributions and looks like the required NETLINK commands to make ip neighbor add work will be the following ones

RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH

There is a pull request for RTM_*NEIGH #6623 which is basically finished.

@pkit
Copy link
Contributor

pkit commented Oct 31, 2021

@crappycrypto it's good to hear. But I think ioctl on the IF level need to be implemented too for arp commands to work as expected.

@crappycrypto
Copy link

crappycrypto commented Oct 31, 2021

The pull request fixes the iproute2 based ip neigh commands for adding and removing arp entries. The net-tools based arp command does not indeed require SIOCDARP, SIOCSARP and /proc/net/arp

UPDATE: removed latest release qualifier, since arp command requires these two ioctls and /proc since before 2000.

@pkit
Copy link
Contributor

pkit commented Oct 31, 2021

Unfortunately "running the latest release" is not an option if we want to run existing code. Otherwise you're right indeed.

pkit pushed a commit to pkit/gvisor that referenced this issue Oct 31, 2021
copy and setup PERMANENT (static) ARP entries
from CNI namespace to the sandbox

Fixes google#3301
@pkit
Copy link
Contributor

pkit commented Oct 31, 2021

See #6803
Checked it on actual amazon-vpc-cni-k8s and it indeed fixes the problem described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: integration Issue related to third party integrations area: networking Issue related to networking type: bug Something isn't working
Projects
None yet