-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS fails on gVisor using netstack on EKS #3301
Comments
Offhand looking at the tcpdump it looks like somehow runsc routing table/lookup is incorrect. Where Netstack is trying to resolve 169.254.1.1 by sending an ARP query and not getting anything back. I will have to setup a cluster to really see what might be going on. But looking at the /proc/net/route I see that maybe runsc is not sorting the routes correctly. |
That was my initial thought as well given the state of the routing table, but when I looked at the routing table for the Pod running using I examined the tcpdump further, and I noticed that when I run 'apt-get update', the first messages that are sent are: With
You can see that that the Pod contacts that DNS server directly, without sending any arp requests. Using netstack, and without
ARP messages are sent out from the eth0 interface (hwaddr: e6:fb:d5:ec:b4:08) on the Pod. Note that I tried it on a single-node kubernetes cluster running on my local machine and things worked fine, but when I ran the same setup on EKS it broke. |
If you look at the routes the order is different. With host network the default route is first but I think runsc is printing it out second. I believe what is happening is we scrape the routes per interface below and send it over urpc to the sentry gvisor/runsc/sandbox/network.go Line 201 in a75d9f7
Which then proceeds to install the routes without sorting them in any order. Which means the order of routes installed in sentry will be in the order of the interfaces. Line 295 in a75d9f7
|
That said I am curious why that 169... route exists? I am going to have to run this by myself and poke around. |
EKS has some interesting bits regarding it's CNI plugin implementation. I'm not sure it's relevant yet but they may be making assumptions that don't hold true with gVisor sandboxes. |
Netstack uses route order to determine priority. Linux uses a more complicated algorithm. We have talked about implementing it in runsc and having runsc generate the netstack routing table. |
@moehajj I was mostly speculating. I just know they use ENI and the ipamd daemon to assign addresses which is a bit different than most CNI plugins, and this is the first I've heard of someone running runsc on EKS. It sounds, though, like the ordering/priority of the routes is the more likely culprit. |
Actually I am not so sure. Netstack seems to be doing the right thing. It picked the default route and is trying to resolve the link address of the default gateway. I think EKS might be adding an arp entry for 169.254.1.1 as part of the setup. Could you dump the state of the arp table in the container's namespace. My guess is we will find an entry for it which netstack is not aware of. |
Looking at Calico docs for example. Why can’t I see the 169.254.1.1 address mentioned above on my host? I wonder if EKS does something similar. |
https://www.slideshare.net/AmazonWebServices/kubernetes-networking-in-amazon-eks-con412-aws-reinvent-2018 I think that's why host mode works but netstack doesn't. |
I believe runsc needs to add support for RTM_ADDNEIGH NETLink command. |
Or scrape any ARP table entries in the namespace and forward them to runsc at startup so that it installs the same ones in it's internal ARP cache. |
I checked the arp table in network=host $ cat /proc/net/*
IP address HW type Flags HW address Mask Device
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
lo: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eth0: 5350623 2007 0 0 0 0 0 0 92085 1324 0 1 0 0 0 0
fe800000000000003017edfffe008536 03 40 00 c0 eth0
00000000000000000000000000000001 01 80 00 80 lo
sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
sk RefCnt Type Proto Iface R Rmem User Inode
protocol size sockets memory press maxhdr slab module cl co di ac io in de sh ss gs se re sp bi br ha uh gp em
000003e8 00000040 000f4240 3b9aca00
Type Device Function
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 00000000 0101FEA9 0003 0 0 0 00000000 0 0 0
eth0 0101FEA9 00000000 0001 0 0 0 FFFFFFFF 0 0 0
eth0 E412A8C0 00000000 0001 0 0 0 FFFFFFFF 0 0 0
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 1 255 2000 0 0 0 0 0 2000 1311 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IcmpMsg:
IcmpMsg:
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 2 0 0 0 0 1977 1288 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 23 0 0 23 0 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
UdpLite: 0 0 0 0 0 0 0 0
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
Num RefCount Protocol Flags Type St Inode Path network=netstack $ cat /proc/net/*
IP address HW type Flags HW address Mask Device
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
lo: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eth0: 304 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000000000000000000000000000001 01 80 00 00 lo
fe80000000000000f47f03fffe459b4d 02 80 00 00 eth0
sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSPassive PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPPrequeued TCPDirectCopyFromBacklog TCPDirectCopyFromPrequeue TCPPrequeueDropped TCPHPHits TCPHPHitsToUser TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPForwardRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPSchedulerFailed TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
sk RefCnt Type Proto Iface R Rmem User Inode
protocol size sockets memory press maxhdr slab module cl co di ac io in de sh ss gs se re sp bi br ha uh gp em
000003e8 00000040 000f4240 3b9aca00
Type Device Function
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 0101FEA9 00000000 0001 0 0 0 FFFFFFFF 0 0 0
eth0 00000000 0101FEA9 0003 0 0 0 00000000 0 0 0
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 0 0 4 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IcmpMsg:
IcmpMsg:
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
Tcp: 1 200 120000 -1 0 0 0 0 0 0 0 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
Udp: 0 0 0 0 0 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti
UdpLite: 0 0 0 0 0 0 0 0
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
Num RefCount Protocol Flags Type St Inode Path The output is ordered by order of the files on the Pod: $ ls /proc/net
arp dev if_inet6 ipv6_route netlink netstat packet protocols psched ptype route snmp tcp tcp6 udp udp6 unix |
That is rather strange. Because without ARP I am not sure how the host network bit is working. The routes say that 169.254.1.1 is the default gateway which means it should have the link address of the gateway before it can send the packets to the non-local destinations. |
Yeah I find it strange as well, it seems like a So I setup 2 nodes, both using kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ubuntu-net-host 1/1 Running 0 91s 192.168.53.193 ip-192-168-60-139.us-west-2.compute.internal <none> <none>
ubuntu-net-netstack 1/1 Running 0 3s 192.168.1.182 ip-192-168-31-136.us-west-2.compute.internal <none> <none>
On node with runsc using host network ip route show
default via 192.168.32.1 dev eth0
169.254.169.254 dev eth0
192.168.32.0/19 dev eth0 proto kernel scope link src 192.168.60.139
192.168.32.217 dev enid6c09ee496d scope link
192.168.35.91 dev enidcb5860247f scope link
192.168.37.181 dev eni564327aa972 scope link
192.168.49.88 dev enia9ad1fc6e5f scope link
192.168.53.193 dev enid9cb3177b0e scope link <——
192.168.54.153 dev enibdd59383046 scope link
192.168.55.67 dev enie9afb6b6f81 scope link
192.168.56.254 dev eni19639745f02 scope link
192.168.58.249 dev eni48d32331e45 scope link
ip rule list
0: from all lookup local
512: from all to 192.168.56.235 lookup main
512: from all to 192.168.62.219 lookup main
512: from all to 192.168.63.206 lookup main
512: from all to 192.168.34.124 lookup main
512: from all to 192.168.45.140 lookup main
512: from all to 192.168.43.82 lookup main
512: from all to 192.168.49.190 lookup main
512: from all to 192.168.51.111 lookup main
512: from all to 192.168.55.67 lookup main
512: from all to 192.168.49.88 lookup main
512: from all to 192.168.54.153 lookup main
512: from all to 192.168.58.249 lookup main
512: from all to 192.168.35.91 lookup main
512: from all to 192.168.56.254 lookup main
512: from all to 192.168.32.217 lookup main
512: from all to 192.168.37.181 lookup main
512: from all to 192.168.53.193 lookup main <——
1024: from all fwmark 0x80/0x80 lookup main
1536: from 192.168.49.88 to 192.168.0.0/16 lookup 2
1536: from 192.168.54.153 to 192.168.0.0/16 lookup 2
1536: from 192.168.58.249 to 192.168.0.0/16 lookup 2
1536: from 192.168.35.91 to 192.168.0.0/16 lookup 2
1536: from 192.168.56.254 to 192.168.0.0/16 lookup 2
1536: from 192.168.32.217 to 192.168.0.0/16 lookup 2
1536: from 192.168.37.181 to 192.168.0.0/16 lookup 2
1536: from 192.168.53.193 to 192.168.0.0/16 lookup 2 <——
32766: from all lookup main
32767: from all lookup default
ip route show table 2
default via 192.168.32.1 dev eth1
192.168.32.1 dev eth1 scope link On node with runsc using netstack network ip route show
default via 192.168.0.1 dev eth0
169.254.169.254 dev eth0
192.168.0.0/19 dev eth0 proto kernel scope link src 192.168.31.136
192.168.1.182 dev enif3e00791c23 scope link <—-
192.168.18.89 dev enic3634746f05 scope link
192.168.25.50 dev eni8c7e75e6afd scope link
192.168.27.227 dev enia205ab59220 scope link
ip rule list
0: from all lookup local
512: from all to 192.168.26.139 lookup main
512: from all to 192.168.27.227 lookup main
512: from all to 192.168.25.50 lookup main
512: from all to 192.168.18.89 lookup main
512: from all to 192.168.1.182 lookup main <—-
1024: from all fwmark 0x80/0x80 lookup main
1536: from 192.168.25.50 to 192.168.0.0/16 lookup 2
1536: from 192.168.18.89 to 192.168.0.0/16 lookup 2
1536: from 192.168.27.57 to 192.168.0.0/16 lookup 2
32766: from all lookup main
32767: from all lookup default
ip route show table 2
default via 192.168.0.1 dev eth1
192.168.0.1 dev eth1 scope link However, to my disappointment, when I added a rule using
On pod using host network root@ubuntu-net-host:/# ifconfig
SIOCGIFCONF: Inappropriate ioctl for device
root@ubuntu-net-host:/# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 0 ioctl(SIOCGIFTXQLEN) failed: Inappropriate ioctl for device
link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
inet 127.0.0.1/8 scope global dynamic
inet6 ::1/128 scope global dynamic
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 0 ioctl(SIOCGIFTXQLEN) failed: Inappropriate ioctl for device
link/ether 9a:fa:5b:6b:4e:f2 brd ff:ff:ff:ff:ff:ff
inet 192.168.53.193/32 scope global dynamic
inet6 fe80::98fa:5bff:fe6b:4ef2/64 scope global dynamic On pod using netstack network root@ubuntu-net-netstack:/# ifconfig
eth0 Link encap:Ethernet HWaddr ae:57:05:3f:4d:03
inet addr:192.168.1.182 Mask:255.255.255.255
inet6 addr: fe80::ac57:5ff:fe3f:4d03/128 Scope:Global
UP RUNNING MTU:9001 Metric:1
RX packets:6 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:9001
RX bytes:452 (452.0 B) TX bytes:0 (0.0 B)
Memory:34d3f0500002329-0
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.255.255.255
inet6 addr: ::1/128 Scope:Global
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:65536
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Memory:10000-0
root@ubuntu-net-netstack:/# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/32 scope global dynamic
inet6 ::1/128 scope global dynamic
2: eth0: <UP,LOWER_UP> mtu 9001
link/ether ae:57:05:3f:4d:03 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.182/32 scope global dynamic
inet6 fe80::ac57:5ff:fe3f:4d03/128 scope global dynamic I hope this is at all relevant to solving this issue. @hbhasker were you able to reproduce these results on your own cluster? |
I haven't yet gotten around to setting up my own EKS pod. It will take me sometime as I am not familiar with EKS much or AWS in general. That said, --network=host does not forward all ioctls and that's probably why you see some failures. Netstack implements some of the ioctls that are needed for ifconfig and that's why it works. All netstack interfaces do support multicast/broadcast but I think we don't set flags appropriately or don't return them correctly for ifconfig to show them. runsc does a few other things at startup as well, it steals the routes from the host for the interface being handed to runsc and passes them to runsc instead. So if you inquire the routes in namespace in which runsc is running you may not see all the rules as some of them have been stolen and handed to runsc at startup ( runsc removes the IP address from the host otherwise the host will respond to TCP SYN's etc with RST as it won't be aware of any listening sockets etc in Netstack). I will see if i can figure out how setup EKS and post if i find something. But mostly it looks like maybe we need to scrape any arp entries from the namespace and pass them to runsc at startup. From what I can see the static arp is being installed in the namespace rather than by running a command inside the container by doing a docker exec. In such case the arp cache on the host will be updated but it is invisible to runsc. |
@hbhasker Any follow-up on this? Anything I can help with? |
@moehajj Sorry I haven't been able to work on this yet. That said one thing that will be great is if you already have an EKS cluster that I can get access to then it will make my life a lot simpler rather than having to setup one. I spent sometime reading up EKS etc but didnt' get to the point of actually setting up one. |
@hbhasker I won't be able to give you access to an EKS cluster, but if you can quickly set up an AWS account and spin up a cluster (I found this guide very helpful when I started) I can give you the scripts that do the rest (e.g. install gVisor on nodes).
eksctl create cluster --name gvisor-demo --nodes 2 --region us-west-2 --ssh-access --ssh-public-key eks_key
SSH into first node export n0_EIP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="ExternalIP")].address}')
ssh -i /path/to/eks_key.pem ec2-user@$n0_EIP On the node, here I configure gVisor with networking using netstack # Install dependencies
sudo yum install -y git # need
# Install Golang
wget https://dl.google.com/go/go1.14.4.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.14.4.linux-amd64.tar.gz
GOROOT=/usr/local/go
GOPATH=$HOME/go
PATH=/usr/local/go/bin:$HOME/go/bin:$PATH
## Create systemd drop-in for containerd
sudo sed -i 's;--container-runtime=docker;--container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock;' /etc/systemd/system/kubelet.service.d/10-eksclt.al2.conf
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# Install gVisor runsc
set -e
wget https://storage.googleapis.com/gvisor/releases/nightly/latest/runsc
sudo mv runsc /usr/local/bin
sudo chown root:root /usr/local/bin/runsc
sudo chmod 0755 /usr/local/bin/runsc
# Install gvisor-containerd-shim
git clone https://github.com/google/gvisor-containerd-shim.git
cd gvisor-containerd-shim
make
sudo make install
# Install gVisor runtime (will need to create runtime in gvisor and assign pods to runsc)
cat <<EOF | sudo tee /etc/containerd/config.toml
disabled_plugins = ["restart"]
[plugins.linux]
shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
ConfigPath = "/etc/containerd/runsc.toml"
EOF
#Runsc options config
cat <<EOF | sudo tee /etc/containerd/runsc.toml
[runsc_config]
debug="true"
strace="true"
debug-log="/tmp/runsc/%ID%/"
EOF
# Restart containerd
sudo systemctl restart containerd
Label node export n0_name=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl label node $n0_name runtime=gvisor Deploy RuntimeClass cat <<EOF | tee gvisor-runtime.yaml
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
scheduling:
nodeSelector:
runtime: gvisor
EOF
kubectl apply -f gvisor-runtime.yaml I hope this helps reduce the cluster setup overhead! |
@hbhasker Looks like things have worked as of a new [ec2-user@ip-192-168-69-37 ~]$ runsc --version
runsc version release-20200622.1-236-g112eb0c5b9e6
spec: 1.0.1-dev But something broke where I can't |
Glad to hear that the latest version worked. We did recently make some forwarding fixes I think*. I will have to go through our commit history and see. Please let me know if you identify the commit causing the regression. |
Re: |
@moehajj Can we mark this issue fixed as it looks like your initial issue is now resolved? |
@hbhasker, unfortunately, I've been looking into why things suddenly worked and now I'm no longer able to reproduce a working version. I tried different |
I've figured out why things had worked all of a sudden, it was because I had deployed a Calico Daemonset to enable network policies, and having calico nodes intercepting packets seems to fix the issue. Do you think this might be an issue with the EKS CNI that Calico somehow mended? |
Hey Mohammed, that's a great write-up! Just one small point -- the write-up uses the unmaintained containerd-shim from https://github.com/google/gvisor-containerd-shim.git (see the warning at the top of the repository, and the fact that it is an "archive" repository). Since about a year ago (3bb5f71), the shim has been built and shipped with the core repository and is included in releases as well. You can actually just install it directly from the bucket, like runsc itself, e.g. |
I'm not sure that PRing your article that has nothing to do with the problem described in this repo is a good idea. Sorry. |
What happens is this: EKS relies on static arp entries for
For gvisor arp table is empty because here nothing regarding ARP is copied from the namespace.
That assessment was correct. But the
Oops. This issue dragging for 2 years is pretty interesting as it means nobody ever tried to use gvisor on EKS. And as a lot of CNI implementations rely on either static ARP entries or ARP proxy (both of which are not supported). |
@pkit gVisor does not support ARP table manipulation as the required IOCTLs and NETLINK commands are not implemented. GKE uses the same gVisor but its not an issue as GKE does not rely on static ARP entries for such things. At some point we will support ARP table commands but its not been a priority for us. But we are always open to contributions and looks like the required NETLINK commands to make ip neighbor add work will be the following ones RTM_NEWNEIGH, RTM_DELNEIGH, RTM_GETNEIGH Today we only implement a few of the netlink commands
Also we have no visibility into people using gVisor on EKS. That said proxy ARP should work? As long as there is something on the host that responds to the link address for 169.254.1.1 gVisor should be able to connect to it? |
@hbhasker I don't think "online" ARP routing manipulation is needed. gvisor/runsc/sandbox/network.go Line 148 in d350c95
And then passing it up here: Line 152 in 8b56b6b
For actual setup using gvisor/pkg/tcpip/stack/neighbor_cache.go Line 181 in fcad6f9
Looks like it should do the trick. But that's just a theory for now, I'm only reading gvisor code for 2-3 hours or so. P.S. implementing netlink commands seems like a good idea too, at least to improve the visibility |
While this is doable I am not sure we want to do support these one-offs. I have been reviewing the CNI spec(https://github.com/containernetworking/cni/blob/master/SPEC.md) and from what I can see it does not provide for any ARP table manipulation directly. In case of EKS this is I am guessing being done by using an CNI plugin that is just executing arbitrary commands in the namespace to setup the ARP entries. Supporting the exact commands will be the right way to solve this rather than do a 1 off for this specific use case. Also EKS can also support this by properly responding to ARP requests for that IP instead of statically inserting an entry. That is what GKE does for example for things like the metadata server which usually are reached by link local addresses (169.254.169.254 see: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity). |
I'm not sure why that's a problem as it's clear to me that gvisor should (or even must) copy the full networking config from the namespace it claims to run in. TL;DR not copying the arp config from a namespace seems like a pretty big compatibility problem to me. |
I would not phrase it as a big problem as its clearly not a common use-case. But that said maybe its just worth doing it to make it work with EKS better. I will take a stab at implementing it. |
I'm ok with implementing it too. |
@pkit I will be happy to review if you have cycles to implement it as I have quite a few higher priority things on my plate at the moment. |
Cool. As we already started working on it anyway. |
Thanks! |
There is a pull request for RTM_*NEIGH #6623 which is basically finished. |
@crappycrypto it's good to hear. But I think ioctl on the IF level need to be implemented too for |
The pull request fixes the iproute2 based UPDATE: removed latest release qualifier, since |
Unfortunately "running the latest release" is not an option if we want to run existing code. Otherwise you're right indeed. |
copy and setup PERMANENT (static) ARP entries from CNI namespace to the sandbox Fixes google#3301
See #6803 |
Description
I'm deploying Pods on my EKS cluster using the gVisor runtime, however the outbound network requests fail while inbound requests succeed. The issue is mitigated when using
network=host
in the runsc config options.Steps to reproduce
conatinerd
as a container CRI and configured the gVisor runtime with containerd (following this tutorial). I also labeled the node I selected for gVisor withapp=gvisor
.EKS Cluster Nodes: (you can see the first node using containerd as it's container runtime)
runsc
config on gVisor node:To verify the Pod is running with gVisor:
curl
ed port 80 of the Pod and it succeeded.To test the outbound network traffic of the Pod, I did the following:
You can see that it fails. Other attempts such as
wget www.google.com
fail as well.For debug purposes, these are the DNS and routing tables (without net-tools, since I couldn't install them) in the Pod container:
I also captured the tcpdump packets on the ENI network interface for the Pod allocated by EKS:
eni567d651201a.nohost.tcpdump.tar.gz.
Details about the network interface:
I also captured
runsc
debug information for the containers in the Pod:9f71a133fc27c3a305710552489c16977d5c48cd40f31810c2010dac393c5ba7.tar.gz
9411dfee3811da9dd45e8681f697bcf5326173d6510238ce70beb02ffe00f444.tar.gz
network="host"
to the/etc/containerd/runsc.toml
file and restarted containerd. I reran the same experiment above with the following results:Verify running Pod:
Successful inbound with
curl
, and successful outbound as follows:DNS and routing table (with net-tools this time) on Pod:
TCPDump file:
eni567d651201a.host.tcpdump.tar.gz
Details about the network interface:
runsc
debug files:96198907b56174067a1aa2b9c0fa3644670675b25fa28a7b44234fc232cccd5d.tar.gz
e4ec52fdad3e889bf386b1eca03e231ad53e0452e4bc623282732eba0d2da720.tar.gz
Environment
Please include the following details of your environment:
runsc -version
[ec2-user@ip-192-168-31-136 ~]$ runsc -version runsc version release-20200622.1-171-gc66991ad7de6 spec: 1.0.1-dev
kubectl version
andkubectl get nodes -o wide
uname -a
$ uname -a Darwin moehajj-C02CJ1ARML7M 19.6.0 Darwin Kernel Version 19.6.0: Sun Jul 5 00:43:10 PDT 2020; root:xnu-6153.141.1~9/RELEASE_X86_64 x86_64
The text was updated successfully, but these errors were encountered: