DNS fails on gVisor using netstack on EKS #3301

moehajj · 2020-07-20T17:51:42Z

Description

I'm deploying Pods on my EKS cluster using the gVisor runtime, however the outbound network requests fail while inbound requests succeed. The issue is mitigated when using network=host in the runsc config options.

Steps to reproduce

I created a 2 node EKS cluster and configured a node to use conatinerd as a container CRI and configured the gVisor runtime with containerd (following this tutorial). I also labeled the node I selected for gVisor with app=gvisor.

EKS Cluster Nodes: (you can see the first node using containerd as it's container runtime)

kubectl get nodes -o wide
NAME                                           STATUS   ROLES    AGE    VERSION                INTERNAL-IP      EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal   Ready    <none>   3d1h   v1.16.12-eks-904af05   192.168.31.136   35.161.102.17   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal   Ready    <none>   3d1h   v1.16.12-eks-904af05   192.168.60.139   44.230.198.56   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   docker://19.3.6

runsc config on gVisor node:

[ec2-user@ip-192-168-31-136 ~]$ ls /etc/containerd/
config.toml  runsc.toml
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/config.toml 
disabled_plugins = ["restart"]
[plugins.linux]
  shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
  TypeUrl = "io.containerd.runsc.v1.options"
  ConfigPath = "/etc/containerd/runsc.toml"
[ec2-user@ip-192-168-31-136 ~]$ cat /etc/containerd/runsc.toml 
[runsc_config]
  debug="true"
  strace="true"
  log-packets="true"
  debug-log="/tmp/runsc/%ID%/"

I applied a gVisor runtime class to my cluster:

cat << EOF | tee gvisor-runtime.yaml 
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
EOF

kubectl apply -f gvisor-runtime.yaml

And ran a simple nginx Pod using the gvisor runtime:

cat << EOF | tee nginx-gvisor.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: nginx-gvisor
spec:
  containers:
  - name: my-nginx
    image: nginx
    ports:                    
    - containerPort: 80
  nodeSelector:
    app: gvisor
  runtimeClassName: gvisor
EOF

kubectl create -f nginx-gvisor.yaml

To verify the Pod is running with gVisor:

# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}' 
containerd://9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7%  

# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444
9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7

To test the inbound network traffic of the Pod, I simply curled port 80 of the Pod and it succeeded.
To test the outbound network traffic of the Pod, I did the following:

kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Err:1 http://security.debian.org/debian-security buster/updates InRelease
  Temporary failure resolving 'security.debian.org'
Err:2 http://deb.debian.org/debian buster InRelease
  Temporary failure resolving 'deb.debian.org'
Err:3 http://deb.debian.org/debian buster-updates InRelease
  Temporary failure resolving 'deb.debian.org'
Reading package lists... Done
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease  Temporary failure resolving 'deb.debian.org'
W: Failed to fetch http://security.debian.org/debian-security/dists/buster/updates/InRelease  Temporary failure resolving 'security.debian.org'
W: Failed to fetch http://deb.debian.org/debian/dists/buster-updates/InRelease  Temporary failure resolving 'deb.debian.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.

You can see that it fails. Other attempts such as wget www.google.com fail as well.

For debug purposes, these are the DNS and routing tables (without net-tools, since I couldn't install them) in the Pod container:

root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface   Destination     Gateway Flags   RefCnt  Use     Metric  Mask    MTU     Window  IRTT
eth0    0101FEA9        00000000        0001    0       0       0       FFFFFFFF        0       0       0
eth0    00000000        0101FEA9        0003    0       0       0       00000000        0       0       0

I also captured the tcpdump packets on the ENI network interface for the Pod allocated by EKS:
eni567d651201a.nohost.tcpdump.tar.gz.
Details about the network interface:

[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet6 fe80::4cfa:44ff:fe5d:9495  prefixlen 64  scopeid 0x20<link>
        ether 4e:fa:44:5d:94:95  txqueuelen 0  (Ethernet)
        RX packets 3  bytes 270 (270.0 B)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 5  bytes 446 (446.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I also captured runsc debug information for the containers in the Pod:
9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7.tar.gz
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444.tar.gz

Now to verify that it works when the Pod is using the host network, I added network="host" to the /etc/containerd/runsc.toml file and restarted containerd. I reran the same experiment above with the following results:

Verify running Pod:

# Get the container ID
kubectl get pod nginx-gvisor -o jsonpath='{.status.containerStatuses[0].containerID}' 
containerd://e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720%    

# List conatienrs running with runsc on gVisor node
[ec2-user@ip-192-168-31-136 gvisor]$ sudo env "PATH=$PATH" runsc --root /run/containerd/runsc/k8s.io list -quiet
96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720

Successful inbound with curl, and successful outbound as follows:

kubectl exec --stdin --tty nginx-gvisor -- /bin/bash
root@nginx-gvisor:/# apt-get update
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 Packages [213 kB]
Get:3 http://deb.debian.org/debian buster InRelease [121 kB]
Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7905 kB]
Get:6 http://deb.debian.org/debian buster-updates/main amd64 Packages [7868 B]
Fetched 8364 kB in 6s (1462 kB/s)
Reading package lists... Done

DNS and routing table (with net-tools this time) on Pod:

root@nginx-gvisor:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 10.100.0.10
options ndots:5
root@nginx-gvisor:/# cat /proc/net/route
Iface   Destination     Gateway Flags   RefCnt  Use     Metric  Mask    MTU     Window  IRTT
eth0    00000000        0101FEA9        0003    0       0       0       00000000        0       0       0
eth0    0101FEA9        00000000        0001    0       0       0       FFFFFFFF        0       0       0
eth0    751FA8C0        00000000        0001    0       0       0       FFFFFFFF        0       0       0
root@nginx-gvisor:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         169.254.1.1     0.0.0.0         UG    0      0        0 eth0
169.254.1.1     0.0.0.0         255.255.255.255 U     0      0        0 eth0
192.168.31.117  0.0.0.0         255.255.255.255 U     0      0        0 eth0

TCPDump file:
eni567d651201a.host.tcpdump.tar.gz
Details about the network interface:

[ec2-user@ip-192-168-31-136 ~]$ ifconfig
eni567d651201a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet6 fe80::58a9:b5ff:feda:27e5  prefixlen 64  scopeid 0x20<link>
        ether 5a:a9:b5:da:27:e5  txqueuelen 0  (Ethernet)
        RX packets 10  bytes 796 (796.0 B)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 5  bytes 446 (446.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

runsc debug files:
96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d.tar.gz
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720.tar.gz

Environment

Please include the following details of your environment:

runsc -version

[ec2-user@ip-192-168-31-136 ~]$ runsc -version
runsc version release-20200622.1-171-gc66991ad7de6
spec: 1.0.1-dev

kubectl version and kubectl get nodes -o wide

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-fd1ea7", GitCommit:"fd1ea7c64d0e3ccbf04b124431c659f65330562a", GitTreeState:"clean", BuildDate:"2020-05-28T19:06:00Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl get nodes -o wide                                                            
NAME                                           STATUS   ROLES    AGE    VERSION                INTERNAL-IP      EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-31-136.us-west-2.compute.internal   Ready    <none>   3d3h   v1.16.12-eks-904af05   192.168.31.136   35.161.102.17   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   containerd://1.3.2
ip-192-168-60-139.us-west-2.compute.internal   Ready    <none>   3d3h   v1.16.12-eks-904af05   192.168.60.139   44.230.198.56   Amazon Linux 2   4.14.181-142.260.amzn2.x86_64   docker://19.3.6

uname -a

$ uname -a
Darwin moehajj-C02CJ1ARML7M 19.6.0 Darwin Kernel Version 19.6.0: Sun Jul  5 00:43:10 PDT 2020; root:xnu-6153.141.1~9/RELEASE_X86_64 x86_64

The text was updated successfully, but these errors were encountered:

hbhasker · 2020-07-22T00:25:22Z

Offhand looking at the tcpdump it looks like somehow runsc routing table/lookup is incorrect. Where Netstack is trying to resolve 169.254.1.1 by sending an ARP query and not getting anything back. I will have to setup a cluster to really see what might be going on.

But looking at the /proc/net/route I see that maybe runsc is not sorting the routes correctly.

moehajj · 2020-07-22T01:11:50Z

That was my initial thought as well given the state of the routing table, but when I looked at the routing table for the Pod running using network=host the entries seemed identical, with an extra entry that includes the Pods (which I don't see how that would solve the issue).

I examined the tcpdump further, and I noticed that when I run 'apt-get update', the first messages that are sent are:

With network=host: (file eni567d651201a.host.tcpdump.tar.gz, starting line 16)

16:47:57.559711 76:97:f8:00:6c:ab > 5a:a9:b5:da:27:e5, ethertype IPv4 (0x0800), length 100: 192.168.31.117.46138 > 10.100.0.10.53: 64553+ A? deb.debian.org.default.svc.cluster.local. (58)
16:47:57.559769 76:97:f8:00:6c:ab > 5a:a9:b5:da:27:e5, ethertype IPv4 (0x0800), length 100: 192.168.31.117.46138 > 10.100.0.10.53: 41010+ AAAA? deb.debian.org.default.svc.cluster.local. (58)
16:47:57.560625 5a:a9:b5:da:27:e5 > 76:97:f8:00:6c:ab, ethertype IPv4 (0x0800), length 193: 10.100.0.10.53 > 192.168.31.117.46138: 64553 NXDomain*- 0/1/0 (151)
16:47:57.560634 5a:a9:b5:da:27:e5 > 76:97:f8:00:6c:ab, ethertype IPv4 (0x0800), length 193: 10.100.0.10.53 > 192.168.31.117.46138: 41010 NXDomain*- 0/1/0 (151)
[...]

You can see that that the Pod contacts that DNS server directly, without sending any arp requests.

Using netstack, and without network=host (file eni567d651201a.nohost.tcpdump.tar.gz, starting line 16)

16:25:05.990433 e6:fb:d5:ec:b4:08 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 169.254.1.1 tell 192.168.16.114, length 28
16:25:06.990595 e6:fb:d5:ec:b4:08 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 169.254.1.1 tell 192.168.16.114, length 28
16:25:07.990751 e6:fb:d5:ec:b4:08 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 169.254.1.1 tell 192.168.16.114, length 28

ARP messages are sent out from the eth0 interface (hwaddr: e6:fb:d5:ec:b4:08) on the Pod.
I would expect that since they both have the same DNS resolve (/etc/resolv.conf) and very similar routing tables (/proc/net/route), they should have the same behavior and tcp requests to the DNS server would be sent out. This is why I think it might be an issue with netstack DNS.

Note that I tried it on a single-node kubernetes cluster running on my local machine and things worked fine, but when I ran the same setup on EKS it broke.

hbhasker · 2020-07-22T02:00:47Z

If you look at the routes the order is different. With host network the default route is first but I think runsc is printing it out second.

I believe what is happening is we scrape the routes per interface below and send it over urpc to the sentry

gvisor/runsc/sandbox/network.go

Line 201 in a75d9f7

Routes: routes,

Which then proceeds to install the routes without sorting them in any order. Which means the order of routes installed in sentry will be in the order of the interfaces.

gvisor/runsc/boot/network.go

Line 295 in a75d9f7

n.Stack.SetRouteTable(routes)

hbhasker · 2020-07-22T02:04:30Z

That said I am curious why that 169... route exists? I am going to have to run this by myself and poke around.

ianlewis · 2020-07-22T02:21:50Z

EKS has some interesting bits regarding it's CNI plugin implementation. I'm not sure it's relevant yet but they may be making assumptions that don't hold true with gVisor sandboxes.
https://github.com/aws/amazon-vpc-cni-k8s

moehajj · 2020-07-22T03:43:52Z

@hbhasker I believe the 169.254... route exists because EKS has some metadata service running at 169.254.169.254, you can check more about it here.
@ianlewis could you highlight which bits/assumptions you're referring to?

iangudger · 2020-07-22T03:59:36Z

Netstack uses route order to determine priority. Linux uses a more complicated algorithm. We have talked about implementing it in runsc and having runsc generate the netstack routing table.

ianlewis · 2020-07-22T06:46:13Z

@moehajj I was mostly speculating. I just know they use ENI and the ipamd daemon to assign addresses which is a bit different than most CNI plugins, and this is the first I've heard of someone running runsc on EKS.

It sounds, though, like the ordering/priority of the routes is the more likely culprit.

hbhasker · 2020-07-22T12:48:42Z

Actually I am not so sure. Netstack seems to be doing the right thing. It picked the default route and is trying to resolve the link address of the default gateway. I think EKS might be adding an arp entry for 169.254.1.1 as part of the setup.

Could you dump the state of the arp table in the container's namespace. My guess is we will find an entry for it which netstack is not aware of.

hbhasker · 2020-07-22T12:51:28Z

Looking at Calico docs for example.

Why can’t I see the 169.254.1.1 address mentioned above on my host?
Calico tries hard to avoid interfering with any other configuration on the host. Rather than adding the gateway address to the host side of each workload interface, Calico sets the proxy_arp flag on the interface. This makes the host behave like a gateway, responding to ARPs for 169.254.1.1 without having to actually allocate the IP address to the interface.

I wonder if EKS does something similar.

hbhasker · 2020-07-22T13:01:31Z

https://www.slideshare.net/AmazonWebServices/kubernetes-networking-in-amazon-eks-con412-aws-reinvent-2018
That presentation mentions static arp for 169.254.1.1.

I think that's why host mode works but netstack doesn't.

hbhasker · 2020-07-22T13:08:21Z

https://github.com/aws/amazon-vpc-cni-k8s/blob/5ea64a559aa0126af4cf7c7af45cca6cfa3eb906/cmd/routed-eni-cni-plugin/driver/driver.go#L150

hbhasker · 2020-07-22T13:09:35Z

I believe runsc needs to add support for RTM_ADDNEIGH NETLink command.

https://man7.org/linux/man-pages/man7/rtnetlink.7.html

hbhasker · 2020-07-22T13:21:56Z

Or scrape any ARP table entries in the namespace and forward them to runsc at startup so that it installs the same ones in it's internal ARP cache.

moehajj · 2020-07-22T13:58:40Z

I checked the arp table in /proc/net/arp and using arp -a on a Pod running using netstack and a Pod using the host network, and both ARP tables are empty. Quite frankly the files under /proc/net are quite similar, i'll include the output below.

network=host

$ cat /proc/net/*
IP address       HW type     Flags       HW address            Mask     Device
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
  eth0: 5350623    2007    0    0    0     0          0         0    92085    1324    0    1    0     0       0          0
fe800000000000003017edfffe008536 03 40 00 c0     eth0
00000000000000000000000000000001 01 80 00 80       lo
sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops     Inode
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
protocol  size sockets  memory press maxhdr  slab module     cl co di ac io in de sh ss gs se re sp bi br ha uh gp em
000003e8 00000040 000f4240 3b9aca00
Type Device      Function
Iface	Destination	Gateway	Flags	RefCnt	Use	Metric	Mask	MTU	Window	IRTT                                                         
eth0	00000000	0101FEA9	0003	0	0	0	00000000	0	0	0                                                                               
eth0	0101FEA9	00000000	0001	0	0	0	FFFFFFFF	0	0	0                                                                               
eth0	E412A8C0	00000000	0001	0	0	0	FFFFFFFF	0	0	0                                                                               
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 1 255 2000 0 0 0 0 0 2000 1311 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IcmpMsg:
IcmpMsg:
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 2 0 0 0 0 1977 1288 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 23 0 0 23 0 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
UdpLite: 0 0 0 0 0 0 0 0
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
Num       RefCount Protocol Flags    Type St Inode Path

network=netstack

$ cat /proc/net/*
IP address       HW type     Flags       HW address            Mask     Device
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
  eth0:     304       4    0    0    0     0          0         0        0       0    0    0    0     0       0          0
00000000000000000000000000000001 01 80 00 00       lo
fe80000000000000f47f03fffe459b4d 02 80 00 00     eth0
sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops     Inode
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
protocol  size sockets  memory press maxhdr  slab module     cl co di ac io in de sh ss gs se re sp bi br ha uh gp em
000003e8 00000040 000f4240 3b9aca00
Type Device      Function
Iface	Destination	Gateway	Flags	RefCnt	Use	Metric	Mask	MTU	Window	IRTT                                                         
eth0	0101FEA9	00000000	0001	0	0	0	FFFFFFFF	0	0	0                                                                               
eth0	00000000	0101FEA9	0003	0	0	0	00000000	0	0	0                                                                               
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 0 0 4 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IcmpMsg:
IcmpMsg:
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 0 0 0 0 0 0 0 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 0 0 0 0 0 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
UdpLite: 0 0 0 0 0 0 0 0
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
Num       RefCount Protocol Flags    Type St Inode Path

The output is ordered by order of the files on the Pod:

$ ls /proc/net
arp  dev  if_inet6  ipv6_route  netlink  netstat  packet  protocols  psched  ptype  route  snmp  tcp  tcp6  udp  udp6  unix

hbhasker · 2020-07-22T14:42:44Z

That is rather strange. Because without ARP I am not sure how the host network bit is working. The routes say that 169.254.1.1 is the default gateway which means it should have the link address of the gateway before it can send the packets to the non-local destinations.

moehajj · 2020-07-22T17:10:22Z

Yeah I find it strange as well, it seems like a runsc issue since Pods running using runc have the 169.254.1.1 static arp. I was following the AWS CNI proposal to debug further and get a better understanding of what's going on, I reached a dead end but with a few things I noticed that might be useful.

So I setup 2 nodes, both using runsc but 1 configured with network=host and the other with the default netstack network. I ran two versions of the same ubuntu image that has nettools installs robertxie/ubuntu-nettools, one on each configuration.

kubectl get pods -o wide
NAME                  READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
ubuntu-net-host       1/1     Running   0          91s   192.168.53.193   ip-192-168-60-139.us-west-2.compute.internal   <none>           <none>
ubuntu-net-netstack   1/1     Running   0          3s    192.168.1.182    ip-192-168-31-136.us-west-2.compute.internal   <none>           <none>

I noticed that the pod that is using the netstack network does not get an egress ip rule assigned to it, while pods running with host network (using runsc or runc) do.

On node with runsc using host network

ip route show
default via 192.168.32.1 dev eth0 
169.254.169.254 dev eth0 
192.168.32.0/19 dev eth0 proto kernel scope link src 192.168.60.139 
192.168.32.217 dev enid6c09ee496d scope link 
192.168.35.91 dev enidcb5860247f scope link 
192.168.37.181 dev eni564327aa972 scope link 
192.168.49.88 dev enia9ad1fc6e5f scope link 
192.168.53.193 dev enid9cb3177b0e scope link 		<——
192.168.54.153 dev enibdd59383046 scope link 
192.168.55.67 dev enie9afb6b6f81 scope link 
192.168.56.254 dev eni19639745f02 scope link 
192.168.58.249 dev eni48d32331e45 scope link 

ip rule list
0:      from all lookup local 
512:    from all to 192.168.56.235 lookup main 
512:    from all to 192.168.62.219 lookup main 
512:    from all to 192.168.63.206 lookup main 
512:    from all to 192.168.34.124 lookup main 
512:    from all to 192.168.45.140 lookup main 
512:    from all to 192.168.43.82 lookup main 
512:    from all to 192.168.49.190 lookup main 
512:    from all to 192.168.51.111 lookup main 
512:    from all to 192.168.55.67 lookup main 
512:    from all to 192.168.49.88 lookup main 
512:    from all to 192.168.54.153 lookup main 
512:    from all to 192.168.58.249 lookup main 
512:    from all to 192.168.35.91 lookup main 
512:    from all to 192.168.56.254 lookup main 
512:    from all to 192.168.32.217 lookup main 
512:    from all to 192.168.37.181 lookup main 
512:    from all to 192.168.53.193 lookup main 		<—— 
1024:   from all fwmark 0x80/0x80 lookup main 
1536:   from 192.168.49.88 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.54.153 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.58.249 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.35.91 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.56.254 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.32.217 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.37.181 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.53.193 to 192.168.0.0/16 lookup 2 		<—— 
32766:  from all lookup main 
32767:  from all lookup default 

ip route show table 2
default via 192.168.32.1 dev eth1 
192.168.32.1 dev eth1 scope link

On node with runsc using netstack network

ip route show
default via 192.168.0.1 dev eth0 
169.254.169.254 dev eth0 
192.168.0.0/19 dev eth0 proto kernel scope link src 192.168.31.136 
192.168.1.182 dev enif3e00791c23 scope link 		<—-
192.168.18.89 dev enic3634746f05 scope link 
192.168.25.50 dev eni8c7e75e6afd scope link 
192.168.27.227 dev enia205ab59220 scope link 

ip rule list
0:      from all lookup local 
512:    from all to 192.168.26.139 lookup main 
512:    from all to 192.168.27.227 lookup main 
512:    from all to 192.168.25.50 lookup main 
512:    from all to 192.168.18.89 lookup main 
512:    from all to 192.168.1.182 lookup main  		<—-
1024:   from all fwmark 0x80/0x80 lookup main 
1536:   from 192.168.25.50 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.18.89 to 192.168.0.0/16 lookup 2 
1536:   from 192.168.27.57 to 192.168.0.0/16 lookup 2 
32766:  from all lookup main 
32767:  from all lookup default 


ip route show table 2
default via 192.168.0.1 dev eth1 
192.168.0.1 dev eth1 scope link

However, to my disappointment, when I added a rule using sudo ip rule add from 192.168.1.182 to 192.168.0.0/16 lookup 2 priority 1536, egress network access on the Pod still failed.

Another thing I noticed is that I get a lot of ioctl(..) failed messages on the Pod using the host network, but I don't see these messages on the other Pod. Specifically when I'm doing things like ifconfig. And the eth0 interface on netstack does not have BROADCAST,MULTICAST activity types.

On pod using host network

root@ubuntu-net-host:/# ifconfig
SIOCGIFCONF: Inappropriate ioctl for device
root@ubuntu-net-host:/# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 0 ioctl(SIOCGIFTXQLEN) failed: Inappropriate ioctl for device

    link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 127.0.0.1/8 scope global dynamic 
    inet6 ::1/128 scope global dynamic 
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 0 ioctl(SIOCGIFTXQLEN) failed: Inappropriate ioctl for device

    link/ether 9a:fa:5b:6b:4e:f2 brd ff:ff:ff:ff:ff:ff
    inet 192.168.53.193/32 scope global dynamic 
    inet6 fe80::98fa:5bff:fe6b:4ef2/64 scope global dynamic

On pod using netstack network

root@ubuntu-net-netstack:/# ifconfig
eth0      Link encap:Ethernet  HWaddr ae:57:05:3f:4d:03  
          inet addr:192.168.1.182  Mask:255.255.255.255
          inet6 addr: fe80::ac57:5ff:fe3f:4d03/128 Scope:Global
          UP RUNNING  MTU:9001  Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:9001 
          RX bytes:452 (452.0 B)  TX bytes:0 (0.0 B)
          Memory:34d3f0500002329-0 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.255.255.255
          inet6 addr: ::1/128 Scope:Global
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:65536 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Memory:10000-0 

root@ubuntu-net-netstack:/# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/32 scope global dynamic 
    inet6 ::1/128 scope global dynamic 
2: eth0: <UP,LOWER_UP> mtu 9001 
    link/ether ae:57:05:3f:4d:03 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.182/32 scope global dynamic 
    inet6 fe80::ac57:5ff:fe3f:4d03/128 scope global dynamic

I hope this is at all relevant to solving this issue.

@hbhasker were you able to reproduce these results on your own cluster?

hbhasker · 2020-07-22T18:05:26Z

I haven't yet gotten around to setting up my own EKS pod. It will take me sometime as I am not familiar with EKS much or AWS in general. That said, --network=host does not forward all ioctls and that's probably why you see some failures. Netstack implements some of the ioctls that are needed for ifconfig and that's why it works.

All netstack interfaces do support multicast/broadcast but I think we don't set flags appropriately or don't return them correctly for ifconfig to show them.

runsc does a few other things at startup as well, it steals the routes from the host for the interface being handed to runsc and passes them to runsc instead. So if you inquire the routes in namespace in which runsc is running you may not see all the rules as some of them have been stolen and handed to runsc at startup ( runsc removes the IP address from the host otherwise the host will respond to TCP SYN's etc with RST as it won't be aware of any listening sockets etc in Netstack).

I will see if i can figure out how setup EKS and post if i find something. But mostly it looks like maybe we need to scrape any arp entries from the namespace and pass them to runsc at startup. From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc.

moehajj · 2020-07-30T20:46:46Z

@hbhasker Any follow-up on this? Anything I can help with?

hbhasker · 2020-07-30T22:18:53Z

@moehajj Sorry I haven't been able to work on this yet. That said one thing that will be great is if you already have an EKS cluster that I can get access to then it will make my life a lot simpler rather than having to setup one. I spent sometime reading up EKS etc but didnt' get to the point of actually setting up one.

moehajj · 2020-07-31T16:18:05Z

@hbhasker I won't be able to give you access to an EKS cluster, but if you can quickly set up an AWS account and spin up a cluster (I found this guide very helpful when I started) I can give you the scripts that do the rest (e.g. install gVisor on nodes).

Create a cluster with ssh access, you need a key-pair (eks_key.pem) for your ec2 instance (follow this guide) to use when you ssh:

eksctl create cluster --name gvisor-demo --nodes 2 --region us-west-2 --ssh-access --ssh-public-key eks_key

And then just ssh into the nodes and setup gvisor with containerd. The default Linux AMI already has containerd installed so all you need to do is configure your kubelet to use containerd and configure containerd with a gVisor runtime handler.

SSH into first node

export n0_EIP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="ExternalIP")].address}')
ssh -i /path/to/eks_key.pem ec2-user@$n0_EIP

On the node, here I configure gVisor with networking using netstack

# Install dependencies
sudo yum install -y git # need

# Install Golang
wget https://dl.google.com/go/go1.14.4.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.14.4.linux-amd64.tar.gz

GOROOT=/usr/local/go
GOPATH=$HOME/go
PATH=/usr/local/go/bin:$HOME/go/bin:$PATH

## Create systemd drop-in for containerd
sudo sed -i 's;--container-runtime=docker;--container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock;' /etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf 
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Install gVisor runsc

set -e
wget https://storage.googleapis.com/gvisor/releases/nightly/latest/runsc
sudo mv runsc /usr/local/bin
sudo chown root:root /usr/local/bin/runsc
sudo chmod 0755 /usr/local/bin/runsc

# Install gvisor-containerd-shim
git clone https://github.com/google/gvisor-containerd-shim.git
cd gvisor-containerd-shim
make
sudo make install


# Install gVisor runtime (will need to create runtime in gvisor and assign pods to runsc)
cat <<EOF | sudo tee /etc/containerd/config.toml
disabled_plugins = ["restart"]
[plugins.linux]
  shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
  TypeUrl = "io.containerd.runsc.v1.options"
  ConfigPath = "/etc/containerd/runsc.toml"
EOF
#Runsc options config
cat <<EOF | sudo tee /etc/containerd/runsc.toml
[runsc_config]
  debug="true"
  strace="true"
  debug-log="/tmp/runsc/%ID%/"
EOF

# Restart containerd
sudo systemctl restart containerd

And from here it's gVisor stuff. I like to label the nodes I've selected to have the gVisor handler, and use a nodeSelector in my runtime class:

Label node

export n0_name=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl label node $n0_name runtime=gvisor

Deploy RuntimeClass

cat <<EOF | tee gvisor-runtime.yaml
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
scheduling:
  nodeSelector:
    runtime: gvisor
EOF

kubectl apply -f gvisor-runtime.yaml

I hope this helps reduce the cluster setup overhead!

moehajj · 2020-07-31T20:43:48Z

@hbhasker Looks like things have worked as of a new runsc commit.

[ec2-user@ip-192-168-69-37 ~]$ runsc --version
runsc version release-20200622.1-236-g112eb0c5b9e6
spec: 1.0.1-dev

But something broke where I can't kubectl port-forward a pod running on gVisor. I'll do some more testing to check at which commit things got fixed & when the port-fowarding broke (and I'll raise a new issue if needed).

hbhasker · 2020-07-31T21:13:37Z

Glad to hear that the latest version worked. We did recently make some forwarding fixes I think*. I will have to go through our commit history and see. Please let me know if you identify the commit causing the regression.

fvoznika · 2020-07-31T21:40:09Z

Re: kubectl port-forward, it doesn't work with runsc because containerd make assumptions about the container's network that are not true for sandboxes. There are more details here: kubernetes/enhancements#1846

hbhasker · 2020-08-03T18:32:38Z

@moehajj Can we mark this issue fixed as it looks like your initial issue is now resolved?

moehajj · 2020-08-03T22:28:28Z

@hbhasker, unfortunately, I've been looking into why things suddenly worked and now I'm no longer able to reproduce a working version. I tried different runsc commits, different kubernetes version on eks (1.16,1.17) and different CNI plugin versions (0.7.5, 0.8.6), but using netstack has not been successful. I'm not sure what happened, so I apologize for the false hope. It would be great if we could resume looking into this issue, and if you could try reproducing it on your end.

moehajj · 2020-08-06T18:52:36Z

I've figured out why things had worked all of a sudden, it was because I had deployed a Calico Daemonset to enable network policies, and having calico nodes intercepting packets seems to fix the issue. Do you think this might be an issue with the EKS CNI that Calico somehow mended?

amscanne · 2021-10-22T15:34:55Z

Hey Mohammed, that's a great write-up! Just one small point -- the write-up uses the unmaintained containerd-shim from https://github.com/google/gvisor-containerd-shim.git (see the warning at the top of the repository, and the fact that it is an "archive" repository).

Since about a year ago (3bb5f71), the shim has been built and shipped with the core repository and is included in releases as well. You can actually just install it directly from the bucket, like runsc itself, e.g.wget https://storage.googleapis.com/gvisor/releases/release/latest/containerd-shim-runsc-v1. This also saves you from needing the Go toolchain for the installation.

pkit · 2021-10-28T08:24:23Z

fwiw, I've written a guide on setting up an EKS cluster with gVisor, and a custom runsc version of your choice, as the container runtime. I hope it serves as a helpful starting point smile

I'm not sure that PRing your article that has nothing to do with the problem described in this repo is a good idea. Sorry.

pkit · 2021-10-28T12:11:25Z

What happens is this: EKS relies on static arp entries for 169.254.1.1 to be present.
Vanilla namespace for containerd CNI looks like this:

$ sudo ip netns exec cni-661976d9-58c3-ce5e-b781-37ad4d95628f arp -a
gateway (169.254.1.1) at 12:cf:1e:29:a2:df [ether] PERM on eth0
gateway (169.254.1.1) at 12:cf:1e:29:a2:df [ether] PERM on eth0

For gvisor arp table is empty because here nothing regarding ARP is copied from the namespace.
More than that, gvisor ARP neighboring, described here is used only in tests.
I.e. bottom line: gvisor does not really expose any static ARP handling API neither to CNI nor to container itself.
Fast fix would be probably to use that "testing" code to copy static entries at runsc boot and be done with it.
Will try to do a PoC on that.

From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc.

That assessment was correct. But the running a command inside the container part is pretty funny, as that's the first thing I've tried.
Namely:

bash-5.1# arp -i eth0 -s 169.254.1.1 be:b2:bf:4c:f9:8d
SIOCSARP: Not a tty
bash-5.1# ip neighbor add 169.254.1.1 lladdr be:b2:bf:4c:f9:8d dev eth0 nud permanent
RTNETLINK answers: Permission denied
bash-5.1# arp -a
bash-5.1# ip neighbor show
RTNETLINK answers: Not supported
Dump terminated

Oops.

This issue dragging for 2 years is pretty interesting as it means nobody ever tried to use gvisor on EKS. And as a lot of CNI implementations rely on either static ARP entries or ARP proxy (both of which are not supported).
I wonder if Google uses the same gvisor in GKE sandbox...

hbhasker · 2021-10-28T20:44:59Z

@pkit gVisor does not support ARP table manipulation as the required IOCTLs and NETLINK commands are not implemented. GKE uses the same gVisor but its not an issue as GKE does not rely on static ARP entries for such things. At some point we will support ARP table commands but its not been a priority for us. But we are always open to contributions and looks like the required NETLINK commands to make ip neighbor add work will be the following ones

RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH

Today we only implement a few of the netlink commands

gvisor/pkg/sentry/socket/netlink/route/protocol.go

Line 577 in 979d6e7

} else if hdr.Flags&linux.NLM_F_REQUEST == linux.NLM_F_REQUEST {

Also we have no visibility into people using gVisor on EKS. That said proxy ARP should work? As long as there is something on the host that responds to the link address for 169.254.1.1 gVisor should be able to connect to it?

pkit · 2021-10-28T22:49:11Z

@hbhasker I don't think "online" ARP routing manipulation is needed.
Just fetching the static ARP entries (set up by CNI) somewhere here:

gvisor/runsc/sandbox/network.go

Line 148 in d350c95

allAddrs, err := iface.Addrs()

And then passing it up here:

gvisor/runsc/boot/network.go

Line 152 in 8b56b6b

    
           func (n *Network) CreateLinksAndRoutes(args *CreateLinksAndRoutesArgs, _ *struct{}) error {

For actual setup using

gvisor/pkg/tcpip/stack/neighbor_cache.go

Line 181 in fcad6f9

    
           func (n *neighborCache) addStaticEntry(addr tcpip.Address, linkAddr tcpip.LinkAddress) {

Looks like it should do the trick.
But that's just a theory for now, I'm only reading gvisor code for 2-3 hours or so.

P.S. implementing netlink commands seems like a good idea too, at least to improve the visibility

hbhasker · 2021-10-29T18:14:34Z

While this is doable I am not sure we want to do support these one-offs. I have been reviewing the CNI spec(https://github.com/containernetworking/cni/blob/master/SPEC.md) and from what I can see it does not provide for any ARP table manipulation directly. In case of EKS this is I am guessing being done by using an CNI plugin that is just executing arbitrary commands in the namespace to setup the ARP entries.

Supporting the exact commands will be the right way to solve this rather than do a 1 off for this specific use case. Also EKS can also support this by properly responding to ARP requests for that IP instead of statically inserting an entry. That is what GKE does for example for things like the metadata server which usually are reached by link local addresses (169.254.169.254 see: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity).

pkit · 2021-10-29T18:22:08Z

I'm not sure why that's a problem as it's clear to me that gvisor should (or even must) copy the full networking config from the namespace it claims to run in.
Otherwise it should drop the claim of supporting the real world workloads.
We had an exact same problem in zerovm and we we bold enough to claim we cannot support it.

TL;DR not copying the arp config from a namespace seems like a pretty big compatibility problem to me.

hbhasker · 2021-10-29T20:15:13Z

I would not phrase it as a big problem as its clearly not a common use-case. But that said maybe its just worth doing it to make it work with EKS better. I will take a stab at implementing it.

pkit · 2021-10-29T20:17:30Z

I'm ok with implementing it too.
If you don't have time or incentive.

hbhasker · 2021-10-29T20:18:32Z

@pkit I will be happy to review if you have cycles to implement it as I have quite a few higher priority things on my plate at the moment.

pkit · 2021-10-29T20:20:07Z

Cool. As we already started working on it anyway.
I hope I will submit a PR soon.

hbhasker · 2021-10-29T20:24:41Z

Thanks!

crappycrypto · 2021-10-31T10:13:12Z

@pkit gVisor does not support ARP table manipulation as the required IOCTLs and NETLINK commands are not implemented. GKE uses the same gVisor but its not an issue as GKE does not rely on static ARP entries for such things. At some point we will support ARP table commands but its not been a priority for us. But we are always open to contributions and looks like the required NETLINK commands to make ip neighbor add work will be the following ones

RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH

There is a pull request for RTM_*NEIGH #6623 which is basically finished.

pkit · 2021-10-31T10:15:43Z

@crappycrypto it's good to hear. But I think ioctl on the IF level need to be implemented too for arp commands to work as expected.

crappycrypto · 2021-10-31T11:21:41Z

The pull request fixes the iproute2 based ip neigh commands for adding and removing arp entries. The net-tools based arp command does not indeed require SIOCDARP, SIOCSARP and /proc/net/arp

UPDATE: removed latest release qualifier, since arp command requires these two ioctls and /proc since before 2000.

pkit · 2021-10-31T11:25:45Z

Unfortunately "running the latest release" is not an option if we want to run existing code. Otherwise you're right indeed.

copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes google#3301

pkit · 2021-10-31T17:14:05Z

See #6803
Checked it on actual amazon-vpc-cni-k8s and it indeed fixes the problem described here.

ianlewis added area: networking Issue related to networking type: bug Something isn't working labels Jul 22, 2020

ianlewis added the area: integration Issue related to third party integrations label Aug 14, 2020

pkit pushed a commit to pkit/gvisor that referenced this issue Oct 31, 2021

copy PERM ARP entries from namespace on boot

3166369

copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes google#3301

pkit mentioned this issue Oct 31, 2021

copy PERM ARP entries from namespace on boot #6803

Merged

copybara-service bot mentioned this issue Nov 2, 2021

copy PERM ARP entries from namespace on boot #6815

Merged

copybara-service bot closed this as completed in a0849e6 Nov 2, 2021

copybara-service bot closed this as completed in #6815 Nov 2, 2021

copybara-service bot closed this as completed in #6803 Nov 2, 2021

kevinGC mentioned this issue Nov 18, 2021

Get "cni plugin not initialized" when running node with gVisor on EKS #6844

Closed

DNS fails on gVisor using netstack on EKS #3301

DNS fails on gVisor using netstack on EKS #3301

Comments

moehajj commented Jul 20, 2020 • edited Loading

hbhasker commented Jul 22, 2020

moehajj commented Jul 22, 2020

hbhasker commented Jul 22, 2020 • edited by ianlewis Loading

hbhasker commented Jul 22, 2020

ianlewis commented Jul 22, 2020

moehajj commented Jul 22, 2020 • edited Loading

iangudger commented Jul 22, 2020

ianlewis commented Jul 22, 2020

hbhasker commented Jul 22, 2020

hbhasker commented Jul 22, 2020

hbhasker commented Jul 22, 2020

hbhasker commented Jul 22, 2020

hbhasker commented Jul 22, 2020

hbhasker commented Jul 22, 2020

moehajj commented Jul 22, 2020

hbhasker commented Jul 22, 2020

moehajj commented Jul 22, 2020 • edited Loading

hbhasker commented Jul 22, 2020

moehajj commented Jul 30, 2020

hbhasker commented Jul 30, 2020

moehajj commented Jul 31, 2020 • edited Loading

moehajj commented Jul 31, 2020

hbhasker commented Jul 31, 2020

fvoznika commented Jul 31, 2020

hbhasker commented Aug 3, 2020

moehajj commented Aug 3, 2020

moehajj commented Aug 6, 2020

amscanne commented Oct 22, 2021

pkit commented Oct 28, 2021

pkit commented Oct 28, 2021

hbhasker commented Oct 28, 2021

pkit commented Oct 28, 2021 • edited Loading

hbhasker commented Oct 29, 2021

pkit commented Oct 29, 2021 • edited Loading

hbhasker commented Oct 29, 2021

pkit commented Oct 29, 2021

hbhasker commented Oct 29, 2021

pkit commented Oct 29, 2021

hbhasker commented Oct 29, 2021

crappycrypto commented Oct 31, 2021

pkit commented Oct 31, 2021

crappycrypto commented Oct 31, 2021 • edited Loading

pkit commented Oct 31, 2021

pkit commented Oct 31, 2021

moehajj commented Jul 20, 2020 •

edited

Loading

hbhasker commented Jul 22, 2020 •

edited by ianlewis

Loading

moehajj commented Jul 22, 2020 •

edited

Loading

moehajj commented Jul 22, 2020 •

edited

Loading

moehajj commented Jul 31, 2020 •

edited

Loading

pkit commented Oct 28, 2021 •

edited

Loading

pkit commented Oct 29, 2021 •

edited

Loading

crappycrypto commented Oct 31, 2021 •

edited

Loading