Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

nslookup (dns) for windows services fails after stopping vms and restarting a hybrid kubernetes cluster in azure #1903

Closed
douglaswaights opened this issue Dec 8, 2017 · 24 comments
Labels

Comments

@douglaswaights
Copy link

douglaswaights commented Dec 8, 2017

Hi!

Is this a request for help?:

yes


Is this an ISSUE or FEATURE REQUEST? (choose one):

issue


What version of acs-engine?:

0.10.0


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
kubernetes version 1.8.4 with acs-engine 0.10

What happened:

I have created a hybrid kubernetes cluster with the acs-engine (1 master, 1 linux, 1 windows) and to begin with the out-of-the-box windows iis 1709 pods can see each other through nslookup (although i have to use the fqdn as per bug Azure/ACS#94 after execing into the pods. Everything works as expected.

I then shutdown the vms in the cluster and then later restart them. Now nslookup fails to see the windows pods from one to the other if i exec into them again. if i deploy nginx on the linux and expose with loadbalancer that is visible fine from the outside world

What you expected to happen:

the cluster should return to its orginal state as it was when created with dns and service discovery working

How to reproduce it (as minimally and precisely as possible):

spin up a hybrid cluster in azure and add a couple of iis pods and the corresponding services for each. Confirm they can see each other. Turn off the vms. Turn them back on again and although everything looks ok on the surface dns is now broken.

Anything else we need to know:

Can you explain to me why this might happen. I presume i should be able to spin down the vms and bring them back up later. i.e the cluster doesnt have to be always up.

Can you help me get the dns working again and troubleshoot? It doesnt help if i stagger the re-launch order of the vms

Thanks
Doug

kubernetes-hybrid.json below created in Azure NorthEurope

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"kubernetesConfig": {
"addons": [
{
"name": "tiller",
"enabled": true
},
{
"name": "kubernetes-dashboard",
"enabled": true
}
],
"enableRbac": true
},
"orchestratorRelease":"1.8"
},
"masterProfile": {
"count": 1,
"dnsPrefix": "sdmhybridk8s",
"vmSize": "Standard_D2_v2"
},
"agentPoolProfiles": [
{
"name": "linuxpool1",
"count": 1,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet"
},
{
"name": "windowspool2",
"count": 1,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet",
"osType": "Windows"
}
],
"windowsProfile": {
"adminUsername": "sdm",
"adminPassword": "redacted"
},
"linuxProfile": {
"adminUsername": "azureuser",
"ssh": {
"publicKeys": [
{
"keyData": "redacted"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "redacted",
"secret": "redacted"
}
}
}

@douglaswaights douglaswaights changed the title nslookup (dns) for windows services fails after stopping vms and restarting a hybrid cluster in azure nslookup (dns) for windows services fails after stopping vms and restarting a hybrid kubernetes cluster in azure Dec 8, 2017
@douglaswaights
Copy link
Author

@jackfrancis @JiangtianLi
Is this related to #1881 do you think?

Any idea how or when this might be resolved? At the moment a hybrid cluster becomes unusable once the VMS are stopped after the initial creation.

@JiangtianLi
Copy link
Contributor

@douglaswaights Can you help to collect more info? From the windows container:
Resolve-DnsName www.bing.com
From the linux container:
nslookup windows_service_name
ping windows_pod_ip

@douglaswaights
Copy link
Author

hi, sure.

I had to create a new cluster though as i removed the old one. The new one created starts up again as expected with the below output.

Name Type TTL Section NameHost


www.bing.com CNAME 6 Answer www-bing-com.a-0001.a-msedge.net
www-bing-com.a-0001.a-msedge.n CNAME 6 Answer a-0001.a-msedge.net
et

Name : a-0001.a-msedge.net
QueryType : A
TTL : 6
Section : Answer
IP4Address : 204.79.197.200

Name : a-0001.a-msedge.net
QueryType : A
TTL : 6
Section : Answer
IP4Address : 13.107.21.200

Name : a-msedge.net
QueryType : SOA
TTL : 6
Section : Authority
NameAdministrator : msnhst.microsoft.com
SerialNumber : 2016092901
TimeToZoneRefresh : 1800
TimeToZoneFailureRetry : 900
TimeToExpiration : 2419200
DefaultTTL : 240

root@nginx:/# nslookup iis-1-svc
Server: 10.0.0.10
Address: 10.0.0.10#53

Name: iis-1-svc.default.svc.cluster.local
Address: 10.0.134.108

root@nginx:/# ping iis-1-svc
PING iis-1-svc.default.svc.cluster.local (10.0.134.108) 56(84) bytes of data.
^C
--- iis-1-svc.default.svc.cluster.local ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3068ms

root@nginx:/#

Unfortunately after i stop the VM's and then restart them, the Windows VM (despite being running in the azure portal) never seems to get past the Not ready state in Kubernetes (see below).

C:\git\dm\kubernetes\svc (develop -> origin)
λ kubectl get nodes
NAME STATUS ROLES AGE VERSION
28804k8s9010 NotReady 2h v1.8.4-20+75463ace30b8ff
k8s-linuxpool1-28804767-0 Ready agent 2h v1.8.4
k8s-master-28804767-0 Ready master 2h v1.8.4

The original cluster at least managed to have the windows node running after vm restart although the DNS obviously wasn't working.

Anything else i can try?

Earlier today, I went back to ACS-engine 0.8 with the previous version of Windows Server and kubernetes 1.7.9 (i think it was) and this was behaving better after retarting the VM's.

I don't know if the problem is with the new acs-engine, kubernetes 1.8 or Windows RS 3 or what really...

@JiangtianLi
Copy link
Contributor

@douglaswaights The issue appears to be RS3 Windows and kubelet on that Windows node. If you can RDP to windows, can you check sc query kubelet and kubelet logs in c:\k? Meanwhile I will try a repro here.

@douglaswaights
Copy link
Author

douglaswaights commented Dec 11, 2017

can you point me in the direction of how to rdp in? Connect doesnt seem to be available / enabled in the usual manner with a vm running in the portal...

I've tried adding a new windows vm to the same vnet so i can rdp into that from my local machine then the idea was to rdp into the windows node from there.... the ports and rules look ok and nsg etc but it still doesnt want to let me in.... i guess im missing something.

@ghost
Copy link

ghost commented Dec 12, 2017

@douglaswaights you can establish an RDP connection to your Windows host with a simple ssh tunnel using the master node, here is how:

Get windows node hostname

kubectl get nodes

NAME                        STATUS    ROLES     AGE       VERSION
25784k8s9010                Ready     <none>    3d        v1.8.4-20+75463ace30b8ff
<snip>

write down the name of the windows node

Establish an ssh tunnel with master node to gain RDP access to the windows node:

From the kubeconfig.<region>.json located in your acs-engine/_output/<cluser>/kubeconfig directory, get the 'server' FQDN entry, it should look like <clustername>.<region>.cloudapp.azure.com

Establish an ssh tunnel with the windows host through the master node:

ssh -L 33890:<windows hostname>:3389 azureuser@<clustername>.<region>.cloudapp.azure.com

(or an equivalent setup if you are using a Windows tool like Putty)

From that point, you should be able to establish an RDP connection to localhost:33890, it will be redirected to your Windows host.
You need to authenticate with the adminUsernameand adminPAssword specified in your kubernetes-hybrid.json file.

beware This is a Windows Core server, so don't expect any fancy gui out there...

@ghost
Copy link

ghost commented Dec 12, 2017

@JiangtianLi I have the very same issue as @douglaswaights except that I don't even have to reboot anything. Windows pods are unable to resolve any public or cluster (service) ip.
Linux pods in the same cluster don"t have any DNS resolution issue.

Here are my outputs:
sc query kubelet

SERVICE_NAME: kubelet
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 4  RUNNING
                                (STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0

And here are my Kubelet logs:
kubelet.log (dated 8 Dec 2017)
kubelet.err.log (dated 12 Dev 2017)

@douglaswaights
Copy link
Author

Thanks a lot @odauby - very helpful!

@JiangtianLi
Here is sc query kubelet

SERVICE_NAME: kubelet
TYPE : 10 WIN32_OWN_PROCESS
STATE : 4 RUNNING
(STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0

and here are the logs
logs.zip

Today i spun the cluster up and the nodes all came up but again the dns wasnt working.

Yesterday i spun it up one time and the stars aligned and my services could communicate.

Be great to get this fixed!

@JiangtianLi
Copy link
Contributor

@douglaswaights @odauby It appears that kubelet is running on Windows node. And from the kubelet I didn't see failure to start kubelet. So does the "Not Ready" issue still exist or it is the DNS issue?

For DNS issue, if you run Resolve-DnsName www.bing.com inside the POD, does it show error like:
No DNS servers configured for local system?
Also if you wait for more than 15 minutes, is the DNS issue gone? If so, the issue will likely be the one we fix in Windows in the next update.

@ghost
Copy link

ghost commented Dec 13, 2017

Looks like the DNS query times out, and no, waiting 15 minutes does not improve the situation on my side.

Below commands have been executed from a windows pod:

PS C:\> Resolve-DnsName www.bing.com
Resolve-DnsName : www.bing.com : This operation returned because the timeout period expired
At line:1 char:1
+ Resolve-DnsName www.bing.com
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationTimeout: (www.bing.com:String) [Resolve-DnsName], Win32Exception
    + FullyQualifiedErrorId : ERROR_TIMEOUT,Microsoft.DnsClient.Commands.ResolveDnsName
PS C:\> Resolve-DnsName www.bing.com -Server 10.0.0.10
Resolve-DnsName : www.bing.com : This operation returned because the timeout period expired
At line:1 char:1
+ Resolve-DnsName www.bing.com -Server 10.0.0.10
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationTimeout: (www.bing.com:String) [Resolve-DnsName], Win32Exception
    + FullyQualifiedErrorId : ERROR_TIMEOUT,Microsoft.DnsClient.Commands.ResolveDnsName
PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : m-web-7b88ddfddc-xctwq
   Primary Dns Suffix  . . . . . . . : 
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter vEthernet (acf659d8187a4c45ef24c677020bd22e943321de46e3b4487b25a34502bab57c_l2bridge):

   Connection-specific DNS Suffix  . : 
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
   Physical Address. . . . . . . . . : 00-15-5D-A6-C0-A5
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::6987:e7:1489:f47b%24(Preferred) 
   IPv4 Address. . . . . . . . . . . : 10.244.2.30(Preferred) 
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1
   DNS Servers . . . . . . . . . . . : 10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled

@douglaswaights
Copy link
Author

I've fired up the cluster again and this time all the nodes are in ready state. However i have the same problem with the DNS. Waiting 15 mins does not help.
C:>ipconfig /all

Windows IP Configuration

Host Name . . . . . . . . . . . . : iis-1709-1
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter vEthernet (a5135afeb2fc1e344f71a02117ba9b5ba9b8daa8d7fb469aac3fe2286f8138be_l2bridge):

Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
Physical Address. . . . . . . . . : 00-15-5D-97-55-EE
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::a87b:77a2:597f:5ee5%24(Preferred)
IPv4 Address. . . . . . . . . . . : 10.244.2.4(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 10.240.0.1
DNS Servers . . . . . . . . . . . : 10.0.0.10
NetBIOS over Tcpip. . . . . . . . : Disabled

nslookup iis-2-svc.default.svc.cluster.local
DNS request timed out.
timeout was 2 seconds.
Server: UnKnown
Address: 10.0.0.10

DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
DNS request timed out.
timeout was 2 seconds.
*** Request to UnKnown timed-out

PS C:> Resolve-DnsName www.bing.com
Resolve-DnsName : www.bing.com : This operation returned because the timeout period expired
At line:1 char:1

  • Resolve-DnsName www.bing.com
  •   + CategoryInfo          : OperationTimeout: (www.bing.com:String) [Resolve-DnsName], Win32Exception
      + FullyQualifiedErrorId : ERROR_TIMEOUT,Microsoft.DnsClient.Commands.ResolveDnsName
    
    

PS C:>

I also tried creating new pods and services after waiting for 20 mins or so to see if that made a difference with something changing in the windows node networking but the same result.

@JiangtianLi
Copy link
Contributor

This looks like a different issue. Does Test-NetConnection 10.0.0.10 -Port 80 work inside container? What is the output of Get-HnsEndpoint and Get-HNSNetwork from Windows node?

@douglaswaights
Copy link
Author

Test-NetConnection fails in the container
WARNING: TCP connect to (10.0.0.10 : 80) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut

ComputerName : 10.0.0.10
RemoteAddress : 10.0.0.10
RemotePort : 80
InterfaceAlias : vEthernet (9fcd829dc8aee649e5d3a0c5800ed17156c2dd4dd65bd308bc5db9ae1a723412_l2bridge)
SourceAddress : 10.244.2.195
PingSucceeded : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded : False

Inside the windows node

PS C:\Users\sdm> Get-HnsEndpoint

ActivityId : 9b63aa27-a6d2-4ea0-bb30-3da5f4f913e0
DNSServerList : 10.0.0.10
ID : 20c707a8-ac6d-4acb-aef8-8707781a4396
IPAddress : 10.244.0.4
IsRemoteEndpoint : True
MacAddress : 00:11:22:33:44:55
Name : Ethernet
Policies : {@{Type=L2Driver}}
Resources : @{AllocationOrder=1; Allocators=System.Object[]; ID=9b63aa27-a6d2-4ea0-bb30-3da5f4f913e0;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
State : 1
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : 98875e14-c13a-4362-93ff-882afee7de45
DNSServerList : 10.0.0.10
ID : 1794d3e6-33fa-4bf3-9aca-b6889558078b
IPAddress : 10.240.255.5
IsRemoteEndpoint : True
MacAddress : 00:11:22:33:44:55
Name : Ethernet
Policies : {@{Type=L2Driver}}
Resources : @{AllocationOrder=1; Allocators=System.Object[]; ID=98875e14-c13a-4362-93ff-882afee7de45;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
State : 1
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : dbccdc66-5354-41b7-b6a5-e4a6fd3ea7c3
DNSServerList : 10.0.0.10
ID : 7be8d656-6be2-43f1-b4b0-5bdbd184a3dc
IPAddress : 10.244.0.2
IsRemoteEndpoint : True
MacAddress : 00:11:22:33:44:55
Name : Ethernet
Policies : {@{Type=L2Driver}}
Resources : @{AllocationOrder=1; Allocators=System.Object[]; ID=dbccdc66-5354-41b7-b6a5-e4a6fd3ea7c3;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
State : 1
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : 6650c5bf-b80e-4ddb-b00a-d2ecc1bde529
CreateProcessingStartTime : 131576732010281718
DNSServerList : 10.0.0.10
GatewayAddress : 10.240.0.1
ID : 91339e88-722e-43bd-8ba4-6516ab906835
IPAddress : 10.244.2.14
MacAddress : 00-15-5D-97-5A-A0
Name : e3c221952c094124f32bb2af10ac373cae06634e4f3ce07a7ae53ffc91eac00f_l2bridge
Policies : {@{ExceptionList=System.Object[]; Type=OutBoundNAT}, @{DestinationPrefix=10.0.0.0/16;
NeedEncap=True; Type=ROUTE}, @{Type=L2Driver}}
PrefixLength : 24
Resources : @{AllocationOrder=5; Allocators=System.Object[]; ID=6650c5bf-b80e-4ddb-b00a-d2ecc1bde529;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
StartTime : 131576732013891740
State : 3
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : 3c37a58f-f7ab-4491-b01b-fcc999c71c80
CreateProcessingStartTime : 131576732010041723
DNSServerList : 10.0.0.10
GatewayAddress : 10.240.0.1
ID : cccc4f0d-aa13-48da-9637-d07b4b39cfe8
IPAddress : 10.244.2.19
MacAddress : 00-15-5D-97-57-77
Name : abf6271465dd90541deda6d72dde572757f2800ab7a81036a2fc405040ded557_l2bridge
Policies : {@{ExceptionList=System.Object[]; Type=OutBoundNAT}, @{DestinationPrefix=10.0.0.0/16;
NeedEncap=True; Type=ROUTE}, @{Type=L2Driver}}
PrefixLength : 24
Resources : @{AllocationOrder=5; Allocators=System.Object[]; ID=3c37a58f-f7ab-4491-b01b-fcc999c71c80;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
StartTime : 131576732012531732
State : 3
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : 26777281-0024-4d87-ab6e-69e31e5e5582
DNSServerList : 10.0.0.10
ID : 9f9a8985-852a-4dab-a8b4-f74c17cd9960
IPAddress : 10.244.0.5
IsRemoteEndpoint : True
MacAddress : 00:11:22:33:44:55
Name : Ethernet
Policies : {@{Type=L2Driver}}
Resources : @{AllocationOrder=1; Allocators=System.Object[]; ID=26777281-0024-4d87-ab6e-69e31e5e5582;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
State : 1
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : b4171851-4987-4ecb-89ff-94baee7dd576
DNSServerList : 10.0.0.10
ID : cb603daf-0849-4acc-b268-8a2399ac93f9
IPAddress : 10.244.0.3
IsRemoteEndpoint : True
MacAddress : 00:11:22:33:44:55
Name : Ethernet
Policies : {@{Type=L2Driver}}
Resources : @{AllocationOrder=1; Allocators=System.Object[]; ID=b4171851-4987-4ecb-89ff-94baee7dd576;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
State : 1
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : 15aef9d6-453a-4ef1-a4db-9c1185f10256
DNSServerList : 10.0.0.10
ID : 46658298-78e2-4ece-affc-392caa8c0e07
IPAddress : 10.244.0.6
IsRemoteEndpoint : True
MacAddress : 00:11:22:33:44:55
Name : Ethernet
Policies : {@{Type=L2Driver}}
Resources : @{AllocationOrder=1; Allocators=System.Object[]; ID=15aef9d6-453a-4ef1-a4db-9c1185f10256;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
State : 1
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : f8da6e9f-b5ff-4ff1-ab2f-64dda6087702
CreateProcessingStartTime : 131576732010071713
DNSServerList : 10.0.0.10
GatewayAddress : 10.240.0.1
ID : 03c5b058-4179-4821-a686-b239542db7aa
IPAddress : 10.244.2.69
MacAddress : 00-15-5D-97-54-4E
Name : 8d6872a539f19d1dfd308e147f3c1c5b61880aeb50f34469f7b6a8720cdc2b5a_l2bridge
Policies : {@{ExceptionList=System.Object[]; Type=OutBoundNAT}, @{DestinationPrefix=10.0.0.0/16;
NeedEncap=True; Type=ROUTE}, @{Type=L2Driver}}
PrefixLength : 24
Resources : @{AllocationOrder=5; Allocators=System.Object[]; ID=f8da6e9f-b5ff-4ff1-ab2f-64dda6087702;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
StartTime : 131576732011411728
State : 3
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

ActivityId : cf5ad1fb-232a-417b-a114-4ef4aecfdf40
CreateProcessingStartTime : 131576732010551722
DNSServerList : 10.0.0.10
GatewayAddress : 10.240.0.1
ID : 25c749ed-b093-44dc-8e95-51bf3c97800c
IPAddress : 10.244.2.195
MacAddress : 00-15-5D-97-57-F6
Name : 9fcd829dc8aee649e5d3a0c5800ed17156c2dd4dd65bd308bc5db9ae1a723412_l2bridge
Policies : {@{ExceptionList=System.Object[]; Type=OutBoundNAT}, @{DestinationPrefix=10.0.0.0/16;
NeedEncap=True; Type=ROUTE}, @{Type=L2Driver}}
PrefixLength : 24
Resources : @{AllocationOrder=5; Allocators=System.Object[]; ID=cf5ad1fb-232a-417b-a114-4ef4aecfdf40;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=42102725-f95c-4370-b01e-0819bd367057}
SharedContainers : {}
StartTime : 131576732016971757
State : 3
Type : l2bridge
Version : 21474836481
VirtualNetwork : e8d2c084-7392-475c-a28c-6b8b3dcf5290
VirtualNetworkName : l2bridge

PS C:\Users\sdm> Get-HNSNetwork

ActivityId : 42102725-f95c-4370-b01e-0819bd367057
CurrentEndpointCount : 4
DNSServerList : 10.0.0.10
Extensions : {@{Id=e7c3b2f0-f3c5-48df-af2b-10fed6d72e7a; IsEnabled=False},
@{Id=e9b59cfa-2be1-4b21-828f-b6fbdbddc017; IsEnabled=True},
@{Id=ea24cd6c-d17a-4348-9190-09f0d5be83dd; IsEnabled=False}}
ID : e8d2c084-7392-475c-a28c-6b8b3dcf5290
LayeredOn : 21ec8194-0b4b-469c-a659-4647aef313ae
MacPools : {@{EndMacAddress=00-15-5D-97-5F-FF; StartMacAddress=00-15-5D-97-50-00}}
ManagementIP : 10.240.0.4
MaxConcurrentEndpoints : 4
Name : l2bridge
Policies : {}
Resources : @{AllocationOrder=0; ID=42102725-f95c-4370-b01e-0819bd367057; PortOperationTime=0; State=1;
SwitchOperationTime=0; VfpOperationTime=0; parentId=3017700c-8262-47e7-a525-b37d4590cb77}
State : 1
Subnets : {@{AddressPrefix=10.244.2.0/24; GatewayAddress=10.240.0.1}}
TotalEndpoints : 12
Type : l2bridge
Version : 21474836481

ActivityId : 8bbb780a-30ce-4c3a-93ef-bcb99e1e3bdb
AutomaticDNS : True
CurrentEndpointCount : 4
Extensions : {@{Id=e7c3b2f0-f3c5-48df-af2b-10fed6d72e7a; IsEnabled=False},
@{Id=e9b59cfa-2be1-4b21-828f-b6fbdbddc017; IsEnabled=False},
@{Id=ea24cd6c-d17a-4348-9190-09f0d5be83dd; IsEnabled=False}}
ID : fb79e97c-0210-466b-8535-9e663a07ce70
LayeredOn : ec438af6-d904-40f5-a63f-90c52ed3e82a
MacPools : {@{EndMacAddress=00-15-5D-12-BF-FF; StartMacAddress=00-15-5D-12-B0-00}}
MaxConcurrentEndpoints : 4
Name : nat
Policies : {}
Resources : @{AllocationOrder=2; Allocators=System.Object[]; ID=8bbb780a-30ce-4c3a-93ef-bcb99e1e3bdb;
PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
parentId=60f20dcf-97df-467e-91bd-efb727e6b96a}
State : 1
Subnets : {@{AddressPrefix=172.30.160.0/20; GatewayAddress=172.30.160.1}}
TotalEndpoints : 4
Type : nat
Version : 21474836481

PS C:\Users\sdm>

@ghost
Copy link

ghost commented Dec 14, 2017

@JiangtianLi Here is my input

TL;DR I don't think the problem is DNS but TCP and maybe even IP routing !

  • I spawned a fresh Kubernetes 1.8.4 cluster with 1 master, 1 Windows and 1 Linux nodes.
  • I instantiated Windows and Linux pods on that cluster. The Windows pods were based on these images:
    • microsoft/windowsservercore:1709
    • microsoft/iis:windowsservercore-1709
  • I performed the requested tests (and a bit more)
  • I rebooted the master, Windows and Linux nodes.
  • Re-performed the tests to see the difference.

Facts I observed:

  • before the reboot, Windows pods could resolve public and cluster DNS'es, but only with FQDN. Short names (existing Kubernetes service short names like 'redis' or 'elasticsearch') could not be resolved.
  • before the reboot, Windows pods could consume these services with their FQDN or ip addresses.
  • after the reboot, Windows pods could not resolve any DNS (short & FQDN cluster DNS and external DNS)
  • after the reboot, Windows pods could not access the services, even with their ip addresses
  • Linux pods could always resolve short & FQDN cluster DNS and external DNS
  • Windows pods default gateway is never in their ip subnet ! This sounds very wrong.

Assumption:

  • Windows pods never could resolve service short names because they lack DNS suffix.

before the reboot:

Windows pod default gateway is in another subnet, this is weird.
Please also note the lack of DNS suffix:

PS C:\> ipconfig

Windows IP Configuration


Ethernet adapter vEthernet (ad8cb8ae02790020423bdaf52da34a16423046d63ec981715ec5ae4b73f5e515_l2bridge):

   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::11ab:68f6:8022:4535%24
   IPv4 Address. . . . . . . . . . . : 10.244.2.147
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1

Port TCP/80 looks like closed on 10.0.0.10:

PS C:\> Test-NetConnection 10.0.0.10 -Port 80
WARNING: TCP connect to (10.0.0.10 : 80) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut


ComputerName           : 10.0.0.10
RemoteAddress          : 10.0.0.10
RemotePort             : 80
InterfaceAlias         : vEthernet (ad8cb8ae02790020423bdaf52da34a16423046d63ec981715ec5ae4b73f5e515_l2bridge)
SourceAddress          : 10.244.2.147
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

I assume you wanted to wanted to see if port 53 was open, right ? (DNS uses UDP/53 and TCP/53)
Windows pod can reach 10.0.0.10 on TCP/53:

PS C:\> Test-NetConnection 10.0.0.10 -Port 53


ComputerName     : 10.0.0.10
RemoteAddress    : 10.0.0.10
RemotePort       : 53
InterfaceAlias   : vEthernet (ad8cb8ae02790020423bdaf52da34a16423046d63ec981715ec5ae4b73f5e515_l2bridge)
SourceAddress    : 10.244.2.147
TcpTestSucceeded : True

Service short name resolution fails:

PS C:\> Test-NetConnection elasticsearch -Port 9200
WARNING: Name resolution of elasticsearch failed


ComputerName   : elasticsearch
RemoteAddress  :
InterfaceAlias :
SourceAddress  :
PingSucceeded  : False

Service FQDN resolution and TCP handshaking work:

PS C:\> Test-NetConnection elasticsearch.default.svc.cluster.local -Port 9200


ComputerName     : elasticsearch.default.svc.cluster.local
RemoteAddress    : 10.0.98.129
RemotePort       : 9200
InterfaceAlias   : vEthernet (ad8cb8ae02790020423bdaf52da34a16423046d63ec981715ec5ae4b73f5e515_l2bridge)
SourceAddress    : 10.244.2.147
TcpTestSucceeded : True

On Linux pods, service short names do resolve:

# host redis
redis.default.svc.cluster.local has address 10.0.183.202
# host elasticsearch
elasticsearch.default.svc.cluster.local has address 10.0.98.129

This is because they have proper search suffixes:

# cat /etc/resolv.conf
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local 0i2h2grqcmgepol0ns24oji1sd.ax.internal.cloudapp.net
options ndots:5

Get-HnsEndpoint on Windows host:

PS C:\Windows\system32> Get-HnsEndpoint


ActivityId         : 0d78291a-c291-4b75-a684-84f7b993e3a0
ID                 : 1d570800-567a-40ab-8be1-d28b3f6fa56a
IPAddress          : 10.240.255.5
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=0d78291a-c291-4b75-a684-84f7b993e3a0; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : ee9686c7-41a1-4e42-a156-7f65720e69a1
ID                 : 81b68d43-e8fd-4f3c-a9ec-7f2dcce10432
IPAddress          : 10.244.1.2
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=ee9686c7-41a1-4e42-a156-7f65720e69a1; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : 4ae5b438-bc6a-4929-b11d-ae17b605ce89
ID                 : 82e4fb72-6130-4d22-89c0-13ab8eb0222d
IPAddress          : 10.244.1.6
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=4ae5b438-bc6a-4929-b11d-ae17b605ce89; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : cfd4de44-9aff-4be6-bf62-d0992b497713
ID                 : 8ccd5492-13c3-497c-af08-10746224dc7f
IPAddress          : 10.244.1.7
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=cfd4de44-9aff-4be6-bf62-d0992b497713; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : 7fb7c276-b6df-4932-8a02-81d6574dcf31
ID                 : 04aafe2d-c347-4970-a20f-d624659f63fa
IPAddress          : 10.244.1.3
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=7fb7c276-b6df-4932-8a02-81d6574dcf31; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : 714e9422-d98b-4568-8946-c7194a19020a
ID                 : a7039a16-f121-422d-84cc-2ff309980286
IPAddress          : 10.244.1.4
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=714e9422-d98b-4568-8946-c7194a19020a; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId                : ecb23e7b-74b0-4c39-be54-4fd16ad7a223
CreateProcessingStartTime : 131577230897120566
DNSServerList             : 10.0.0.10
GatewayAddress            : 10.240.0.1
ID                        : 2747b986-7ea1-444d-a774-f4af77600f4a
IPAddress                 : 10.244.2.147
MacAddress                : 00-15-5D-68-54-F8
Name                      : ad8cb8ae02790020423bdaf52da34a16423046d63ec981715ec5ae4b73f5e515_l2bri
                            dge
Policies                  : {@{ExceptionList=System.Object[]; Type=OutBoundNAT},
                            @{DestinationPrefix=10.0.0.0/16; NeedEncap=True; Type=ROUTE},
                            @{Type=L2Driver}}
PrefixLength              : 24
Resources                 : @{AllocationOrder=5; Allocators=System.Object[];
                            ID=ecb23e7b-74b0-4c39-be54-4fd16ad7a223; PortOperationTime=0;
                            State=1; SwitchOperationTime=0; VfpOperationTime=0;
                            parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers          : {}
StartTime                 : 131577230911599467
State                     : 3
Type                      : L2Bridge
Version                   : 21474836481
VirtualNetwork            : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName        : l2bridge

ActivityId         : 21452d36-0a2e-498b-8f99-eb6a6b6dbda3
ID                 : 073e2a68-54e2-41a4-97c7-1f3049e14215
IPAddress          : 10.244.1.9
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=21452d36-0a2e-498b-8f99-eb6a6b6dbda3; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId                : b216a0ad-595a-4416-8137-23f3943ae145
CreateProcessingStartTime : 131577308537453378
DNSServerList             : 10.0.0.10
GatewayAddress            : 10.240.0.1
ID                        : ade25b42-ca87-4212-aded-dbe13e7833f9
IPAddress                 : 10.244.2.187
MacAddress                : 00-15-5D-68-52-FB
Name                      : 73c36bd96149106b501ecb7a92d0d049147f882ff847156377f61c6851f58649_l2bri
                            dge
Policies                  : {@{ExceptionList=System.Object[]; Type=OutBoundNAT},
                            @{DestinationPrefix=10.0.0.0/16; NeedEncap=True; Type=ROUTE},
                            @{Type=L2Driver}}
PrefixLength              : 24
Resources                 : @{AllocationOrder=5; Allocators=System.Object[];
                            ID=b216a0ad-595a-4416-8137-23f3943ae145; PortOperationTime=0;
                            State=1; SwitchOperationTime=0; VfpOperationTime=0;
                            parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers          : {}
StartTime                 : 131577308546143397
State                     : 3
Type                      : L2Bridge
Version                   : 21474836481
VirtualNetwork            : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName        : l2bridge

ActivityId         : 9b39f4e4-d8e5-4776-b9e3-c347918e3258
ID                 : 12d0b7fc-1e69-4217-8349-68906b1c6d13
IPAddress          : 10.244.1.8
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=9b39f4e4-d8e5-4776-b9e3-c347918e3258; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

Get-HNSNetwork on Windows host:

PS C:\Windows\system32> Get-HNSNetwork


ActivityId             : a1a3a490-eb67-45dc-ab87-65d8601196b9
AutomaticDNS           : True
CurrentEndpointCount   : 0
Extensions             : {@{Id=e7c3b2f0-f3c5-48df-af2b-10fed6d72e7a; IsEnabled=False},
                         @{Id=e9b59cfa-2be1-4b21-828f-b6fbdbddc017; IsEnabled=False},
                         @{Id=ea24cd6c-d17a-4348-9190-09f0d5be83dd; IsEnabled=False}}
ID                     : f90d81ce-c6f7-4155-8460-32e57a8bc3a0
LayeredOn              : eb80cb89-94b7-482c-ac00-79761fe76399
MacPools               : {@{EndMacAddress=00-15-5D-5C-7F-FF; StartMacAddress=00-15-5D-5C-70-00}}
MaxConcurrentEndpoints : 0
Name                   : nat
Policies               : {}
Resources              : @{AllocationOrder=2; Allocators=System.Object[];
                         ID=a1a3a490-eb67-45dc-ab87-65d8601196b9; PortOperationTime=0; State=1;
                         SwitchOperationTime=0; VfpOperationTime=0;
                         parentId=88aa3fb7-d50c-45f2-bc07-8647eddfad85}
State                  : 1
Subnets                : {@{AddressPrefix=172.21.144.0/20; GatewayAddress=172.21.144.1}}
TotalEndpoints         : 0
Type                   : nat
Version                : 21474836481

ActivityId             : db5afc5b-2147-4ad9-bdb8-66e44ff4fccb
CurrentEndpointCount   : 0
Extensions             : {@{Id=e7c3b2f0-f3c5-48df-af2b-10fed6d72e7a; IsEnabled=False},
                         @{Id=e9b59cfa-2be1-4b21-828f-b6fbdbddc017; IsEnabled=True},
                         @{Id=ea24cd6c-d17a-4348-9190-09f0d5be83dd; IsEnabled=False}}
ID                     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
LayeredOn              : 94dc20f0-5425-4421-be0f-74a619e06a70
MacPools               : {@{EndMacAddress=00-15-5D-68-5F-FF; StartMacAddress=00-15-5D-68-50-00}}
ManagementIP           : 10.240.0.4
MaxConcurrentEndpoints : 2
Name                   : l2bridge
Policies               : {}
Resources              : @{AllocationOrder=0; ID=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb;
                         PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
                         parentId=265324da-48d7-42fc-aa8e-2cb60195f4c4}
State                  : 1
Subnets                : {@{AddressPrefix=10.244.2.0/24; GatewayAddress=10.240.0.1}}
TotalEndpoints         : 2
Type                   : L2Bridge
Version                : 21474836481


after the reboot:

Windows pods still have no DNS suffix and a weird gateway:

PS C:\> ipconfig

Windows IP Configuration


Ethernet adapter vEthernet (8081e06f80b31934e8fd81bc92727c1ea608d8a4f14b76491b7fb828d5666e31_l2bridge):

   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::2dc6:fce9:2bfe:7d4c%29
   IPv4 Address. . . . . . . . . . . : 10.244.2.11
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1

Windows pods can't reach TCP/80 on 10.0.0.10 (but I assume this is ok):

PS C:\> Test-NetConnection 10.0.0.10 -Port 80
WARNING: TCP connect to (10.0.0.10 : 80) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut


ComputerName           : 10.0.0.10
RemoteAddress          : 10.0.0.10
RemotePort             : 80
InterfaceAlias         : vEthernet (8081e06f80b31934e8fd81bc92727c1ea608d8a4f14b76491b7fb828d5666e31_l2bridge)
SourceAddress          : 10.244.2.11
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

But now,they can't reach TCP/53 on 10.0.0.10 anymore:

PS C:\> Test-NetConnection 10.0.0.10 -Port 53
WARNING: TCP connect to (10.0.0.10 : 53) failed
WARNING: Ping to 10.0.0.10 failed with status: TimedOut


ComputerName           : 10.0.0.10
RemoteAddress          : 10.0.0.10
RemotePort             : 53
InterfaceAlias         : vEthernet (8081e06f80b31934e8fd81bc92727c1ea608d8a4f14b76491b7fb828d5666e31_l2bridge)
SourceAddress          : 10.244.2.11
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

Service short names are still an issue:

PS C:\> Test-NetConnection elasticsearch -Port 9200
WARNING: Name resolution of elasticsearch failed


ComputerName   : elasticsearch
RemoteAddress  :
InterfaceAlias :
SourceAddress  :
PingSucceeded  : False

Service long names, too:

PS C:\> Test-NetConnection elasticsearch.default.svc.cluster.local -Port 9200
WARNING: Name resolution of elasticsearch.default.svc.cluster.local failed


ComputerName   : elasticsearch.default.svc.cluster.local
RemoteAddress  :
InterfaceAlias :
SourceAddress  :
PingSucceeded  : False

Even connection to the service ip address fails:

PS C:\> Test-NetConnection 10.0.98.129 -Port 9200
WARNING: TCP connect to (10.0.98.129 : 9200) failed
WARNING: Ping to 10.0.98.129 failed with status: TimedOut


ComputerName           : 10.0.98.129
RemoteAddress          : 10.0.98.129
InterfaceAlias         : vEthernet (8081e06f80b31934e8fd81bc92727c1ea608d8a4f14b76491b7fb828d5666e31_l2bridge)
SourceAddress          : 10.244.2.11
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

While Linux pods do not have any DNS or IP issue:

# host elasticsearch
elasticsearch.default.svc.cluster.local has address 10.0.98.129
# curl elasticsearch:9200
{
  "name" : "a_XYYv5",
  "cluster_name" : "default",
  "cluster_uuid" : "WFixfDjZTqeCZH_nttpUaQ",
  "version" : {
    "number" : "5.4.3",
    "build_hash" : "eed30a8",
    "build_date" : "2017-06-22T00:34:03.743Z",
    "build_snapshot" : false,
    "lucene_version" : "6.5.1"
  },
  "tagline" : "You Know, for Search"
}

Get-HnsEndpoint on Windows host:

PS C:\Windows\system32> Get-HnsEndpoint


ActivityId         : 36a69ed0-23ac-4be1-a0b8-20a9e843940c
ID                 : 49df7852-084c-4272-91d9-b3d5c6b65692
IPAddress          : 10.244.1.7
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=36a69ed0-23ac-4be1-a0b8-20a9e843940c; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : 8750eb27-1d95-4069-ab1c-250d30199ee7
ID                 : 448b1e0e-603a-4cde-a0ec-ff609616f0e4
IPAddress          : 10.244.1.5
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=8750eb27-1d95-4069-ab1c-250d30199ee7; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : 22cadb1b-b3f8-4721-a64c-63aa627c6096
ID                 : d7b9b3cf-d882-4d4e-b0af-917c711873ad
IPAddress          : 10.240.255.5
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=22cadb1b-b3f8-4721-a64c-63aa627c6096; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId                : 23408461-7f56-47c6-a9ee-5a39387d63a0
CreateProcessingStartTime : 131577587184355850
DNSServerList             : 10.0.0.10
GatewayAddress            : 10.240.0.1
ID                        : 723d7ca1-4ac5-4455-9862-2e5e291bb21d
IPAddress                 : 10.244.2.65
MacAddress                : 00-15-5D-68-57-9D
Name                      : f06cbe3d6a97ac2a315bc0d570c19a070a36a5c6a509f2474480df53aed52b4d_l2bri
                            dge
Policies                  : {@{ExceptionList=System.Object[]; Type=OutBoundNAT},
                            @{DestinationPrefix=10.0.0.0/16; NeedEncap=True; Type=ROUTE},
                            @{Type=L2Driver}}
PrefixLength              : 24
Resources                 : @{AllocationOrder=5; Allocators=System.Object[];
                            ID=23408461-7f56-47c6-a9ee-5a39387d63a0; PortOperationTime=0;
                            State=1; SwitchOperationTime=0; VfpOperationTime=0;
                            parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers          : {}
StartTime                 : 131577587185285580
State                     : 3
Type                      : L2Bridge
Version                   : 21474836481
VirtualNetwork            : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName        : l2bridge

ActivityId         : eb4e3441-d579-4282-9bb9-af1dc9ea40d0
ID                 : 4282a630-033e-43e4-a09d-3397dc3a8ba9
IPAddress          : 10.244.1.2
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=eb4e3441-d579-4282-9bb9-af1dc9ea40d0; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : cc37d973-dd66-41ea-ba59-fa7c6baeadda
ID                 : ef95ce57-546c-49b4-a48a-6f264fca4b9d
IPAddress          : 10.244.1.6
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=cc37d973-dd66-41ea-ba59-fa7c6baeadda; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : f9a51406-c4cb-4e5c-a896-31f428b6d386
ID                 : 9fa7dcdb-b2bf-46b3-bdf1-aead9f092963
IPAddress          : 10.244.1.4
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=f9a51406-c4cb-4e5c-a896-31f428b6d386; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId                : 675f6cec-487a-4b34-bfa1-33efb987eb33
CreateProcessingStartTime : 131577587187235057
DNSServerList             : 10.0.0.10
GatewayAddress            : 10.240.0.1
ID                        : e0210fea-cfa8-4afe-a138-db209b917afa
IPAddress                 : 10.244.2.11
MacAddress                : 00-15-5D-68-52-3B
Name                      : 8081e06f80b31934e8fd81bc92727c1ea608d8a4f14b76491b7fb828d5666e31_l2bri
                            dge
Policies                  : {@{ExceptionList=System.Object[]; Type=OutBoundNAT},
                            @{DestinationPrefix=10.0.0.0/16; NeedEncap=True; Type=ROUTE},
                            @{Type=L2Driver}}
PrefixLength              : 24
Resources                 : @{AllocationOrder=5; Allocators=System.Object[];
                            ID=675f6cec-487a-4b34-bfa1-33efb987eb33; PortOperationTime=0;
                            State=1; SwitchOperationTime=0; VfpOperationTime=0;
                            parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers          : {}
StartTime                 : 131577587188107744
State                     : 3
Type                      : L2Bridge
Version                   : 21474836481
VirtualNetwork            : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName        : l2bridge

ActivityId         : dc37aea3-d6e3-419b-8e87-cf0011b6277f
ID                 : c4c813ed-9e2c-4671-a346-8204ae9f4465
IPAddress          : 10.244.1.3
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=dc37aea3-d6e3-419b-8e87-cf0011b6277f; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

ActivityId         : 8bcbf84b-cdf5-4d55-b2e8-55db38a2b9a0
ID                 : 638174de-5731-4955-beff-a430cf487f72
IPAddress          : 10.244.1.8
IsRemoteEndpoint   : True
MacAddress         : 00:11:22:33:44:55
Name               : Ethernet
Policies           : {@{Type=L2Driver}}
Resources          : @{AllocationOrder=1; Allocators=System.Object[];
                     ID=8bcbf84b-cdf5-4d55-b2e8-55db38a2b9a0; PortOperationTime=0; State=1;
                     SwitchOperationTime=0; VfpOperationTime=0;
                     parentId=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb}
SharedContainers   : {}
State              : 1
Type               : L2Bridge
Version            : 21474836481
VirtualNetwork     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
VirtualNetworkName : l2bridge

Get-HNSNetwork on Windows host:

PS C:\Windows\system32> Get-HNSNetwork


ActivityId             : a1a3a490-eb67-45dc-ab87-65d8601196b9
AutomaticDNS           : True
CurrentEndpointCount   : 0
Extensions             : {@{Id=e7c3b2f0-f3c5-48df-af2b-10fed6d72e7a; IsEnabled=False},
                         @{Id=e9b59cfa-2be1-4b21-828f-b6fbdbddc017; IsEnabled=False},
                         @{Id=ea24cd6c-d17a-4348-9190-09f0d5be83dd; IsEnabled=False}}
ID                     : f90d81ce-c6f7-4155-8460-32e57a8bc3a0
LayeredOn              : eb80cb89-94b7-482c-ac00-79761fe76399
MacPools               : {@{EndMacAddress=00-15-5D-5C-7F-FF; StartMacAddress=00-15-5D-5C-70-00}}
MaxConcurrentEndpoints : 0
Name                   : nat
Policies               : {}
Resources              : @{AllocationOrder=2; Allocators=System.Object[];
                         ID=a1a3a490-eb67-45dc-ab87-65d8601196b9; PortOperationTime=0; State=1;
                         SwitchOperationTime=0; VfpOperationTime=0;
                         parentId=88aa3fb7-d50c-45f2-bc07-8647eddfad85}
State                  : 1
Subnets                : {@{AddressPrefix=172.21.144.0/20; GatewayAddress=172.21.144.1}}
TotalEndpoints         : 0
Type                   : nat
Version                : 21474836481

ActivityId             : db5afc5b-2147-4ad9-bdb8-66e44ff4fccb
CurrentEndpointCount   : 1
Extensions             : {@{Id=e7c3b2f0-f3c5-48df-af2b-10fed6d72e7a; IsEnabled=False},
                         @{Id=e9b59cfa-2be1-4b21-828f-b6fbdbddc017; IsEnabled=True},
                         @{Id=ea24cd6c-d17a-4348-9190-09f0d5be83dd; IsEnabled=False}}
ID                     : e30debd4-0c3b-49d7-adcd-968a1b6742c1
LayeredOn              : 94dc20f0-5425-4421-be0f-74a619e06a70
MacPools               : {@{EndMacAddress=00-15-5D-68-5F-FF; StartMacAddress=00-15-5D-68-50-00}}
ManagementIP           : 10.240.0.4
MaxConcurrentEndpoints : 2
Name                   : l2bridge
Policies               : {}
Resources              : @{AllocationOrder=0; ID=db5afc5b-2147-4ad9-bdb8-66e44ff4fccb;
                         PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0;
                         parentId=265324da-48d7-42fc-aa8e-2cb60195f4c4}
State                  : 1
Subnets                : {@{AddressPrefix=10.244.2.0/24; GatewayAddress=10.240.0.1}}
TotalEndpoints         : 4
Type                   : L2Bridge
Version                : 21474836481

Kind regards,

O.

@JiangtianLi
Copy link
Contributor

@odauby Thanks for the detailed observation. It seems there are two issues. One is resolving unqualified name and the other is reboot VM and therefore kubelet/kubeproxy service disrupt the network. We'll look into it. The default gateway is by design since windows node uses CNI.
/cc @madhanrm

@brunsgaard
Copy link

@odauby I see the exact same thing on with acs build from master today and Kubernetes 1.9.1.

@JiangtianLi Let me know if you need more debug information regarding this issue. I will be happy to help.

@Noirax90
Copy link

I'm also having the issue described, i originally deployed a kubernetes hybrid cluster running 1.8.4 and everything worked great for a week, but then i had to reboot the windows machines, after the restart dns and tcp stopped working.

I've also tried upgrading to version 1.9.1 to see if it would solve the problem but no luck there.

However in my case it is only related to outgoing traffic, i have a service that points toward a pod that is running iis and it can serve the traffic fine, but only for pages that does not try to access a external database or such for obvious reasons. I'm not sure if the same applies for @brunsgaard

I've also noticed that if i log in to the host via rdp, i cannot ping any external addresses, it properly resolves the dns and if i run Test-NetConnection -Port 80 on hosts it succeeds

PS C:\Users\azureuser> ping google.com

Pinging google.com [172.217.20.110] with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Ping statistics for 172.217.20.110:
    Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),
PS C:\Users\azureuser> nslookup google.com
Server:  UnKnown
Address:  168.63.129.16

Non-authoritative answer:
Name:    google.com
Addresses:  2a00:1450:4013:c00::65
          172.217.20.110
PS C:\Users\azureuser> Test-NetConnection google.com -Port 80


ComputerName     : google.com
RemoteAddress    : 172.217.20.110
RemotePort       : 80
InterfaceAlias   : vEthernet (Ethernet 2)
SourceAddress    : 10.240.0.5
TcpTestSucceeded : True

@JiangtianLi
Copy link
Contributor

@KaptenMorot For the issue with reboot, we are aware of it and working with Windows team on it. Meanwhile, one mitigation is to restart hns network on Windows, e.g., Get-HnsNetworks | ? Name -eq l2Bridge | Remove-HnsNetwork

For the ping issue, I think ping packet is blocked from Azure VM node. I can't ping www.google.com from master node either.

@ghost
Copy link

ghost commented Jan 17, 2018

Just tried with freshly released acs-engine v 0.12.0, same result.
The DNS issue looks quite similar to this one, someone proposed to consume the kube-dns pods instead of the kube-dns service, and this works for me with just a couple of Powershell commands within the Windows containers:

$adapter=Get-NetAdapter
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3

# in case you need short DNS name resolution (we have services running in the default namespace)
Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"

@JiangtianLi
Copy link
Contributor

@odauby There is currently an issue with service vip on windows node so using POD IP instead of cluster IP is indeed the workaround. We are going to roll out the patch asap.

@cypres
Copy link

cypres commented Feb 6, 2018

Seems related to #2027 and possibly #2174

@jbiel
Copy link

jbiel commented Feb 27, 2018

@JiangtianLi - we are (intermittently) running into the issue where our Windows containers cannot communicate with service IPs. This issue can occur on a fresh node that didn't previously have an HNS interface created. Do you have any more information on the patch that you referenced on your Jan 17th comment? Thanks.

Our environment:

@JiangtianLi
Copy link
Contributor

@jbiel Can you use https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/hns.psm1 to run

Get-HnsEndpoints | ConvertTo-Json -depth 10
Get-HnsPolicyLists | ConvertTo-Json -depth 10

on windows node and resolve-dnsname www.bing.com on windows container and share the output?

@stale
Copy link

stale bot commented Mar 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

@stale stale bot added the stale label Mar 9, 2019
@stale stale bot closed this as completed Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants