Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS Windows node pods unable to resolve DNS to Linux Pods #606

Closed
krishnaputhran opened this issue Nov 27, 2019 · 13 comments
Closed

EKS Windows node pods unable to resolve DNS to Linux Pods #606

krishnaputhran opened this issue Nov 27, 2019 · 13 comments
Labels
EKS Amazon Elastic Kubernetes Service

Comments

@krishnaputhran
Copy link

Tell us about your request

  1. Created EKS cluster
  2. Added Linux worker nodes
  3. Added Windows worker nodes
  4. Deployed services to Linux worker nodes -> working fine
  5. Deployed services to Windows worker nodes
  6. Service in Windows worker node needs to talk to a database deployed in Linux worker node
  7. Windows service fails to resolve the DNS name of the database
  8. Pinging the IP address works fine.
  9. Verified that all the security groups are good and both Linux and Windows worker nodes are in the same group.
  10. Problem is similar to the one explained in [eks] [issue]: Windows Pods not able to resolve internal k8s services #236.
  11. But I am still facing the same issue.
  12. After doing some of the worker around mentioned here, like after running the below command
    Set-DnsClientGlobalSetting -SuffixSearchList @("default.svc.cluster.local", "svc.cluster.local", "cluster.local", "us-east1.compute.internal") the ping to modified dns works. That is i have to suffix ".default.svc.cluster.local" to my original dns name. for eg: mysql-0.mysql to mysql-0.mysql.default.svc.cluster.local will work. But I dont' think that's how its supposed to be. Linux services works without this suffix.

Which service(s) is this request for?
EKS

Are you currently working around this issue?
Yes from long time.

@mikestef9 mikestef9 added the EKS Amazon Elastic Kubernetes Service label Nov 27, 2019
@vsiddharth
Copy link
Contributor

Can you provide us with the below mentioned details:

  • AMI ID
  • CNI Config (Windows Node)
  • HNS Policy and Endpoint list (JSON formatted) from the windows node
  • ipconfig /all from within the pod/container
  • Resolve-DNSName (w and w/o DNS suffix) from within the pod/container

@krishnaputhran
Copy link
Author

krishnaputhran commented Nov 29, 2019

First I like to mention that, the DNS resolution failing instance was a Statefull set deployment with one of the service being headeless service. But Linux nodes doesn't have any issue with this. Please find the requested details below.

- AMI ID: ami-034770f7a9c1471e4
- ipconfig /all from within the pod/container :

Windows IP Configuration

Host Name . . . . . . . . . . . . : windows-server-iis-66bf9745b-lsl52
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
DNS Suffix Search List. . . . . . : default.svc.cluster.local
svc.cluster.local
cluster.local
us-east-1.compute.internal

Ethernet adapter vEthernet (cid-abcb2ffd84f59e562878e32ba19a89dc8ab0cbb35414808ff281ddec2e71945d):

Connection-specific DNS Suffix . : default.svc.cluster.local
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
Physical Address. . . . . . . . . : 00-15-5D-03-A9-B5
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::34cc:6a6d:cfd5:5122%24(Preferred)
IPv4 Address. . . . . . . . . . . : 192.168.190.89(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.192.0
Default Gateway . . . . . . . . . : 192.168.128.1
DNS Servers . . . . . . . . . . . : 10.100.0.10
NetBIOS over Tcpip. . . . . . . . : Disabled
Connection-specific DNS Suffix Search List :
default.svc.cluster.local
svc.cluster.local
cluster.local

- Resolve-DNSName (w and w/o DNS suffix) from within the pod/container

<< Without DNS Suffix >>
PS C:> ping mysql-0.mysql
Ping request could not find host mysql-0.mysql. Please check the name and try again.
PS C:> ping mysql-0.mysql.default.svc.cluster.local

<< With DNS Suffix>
Pinging mysql-0.mysql.default.svc.cluster.local [192.168.70.251] with 32 bytes of data:
Reply from 192.168.70.251: bytes=32 time<1ms TTL=254
Reply from 192.168.70.251: bytes=32 time<1ms TTL=254
Reply from 192.168.70.251: bytes=32 time<1ms TTL=254
Reply from 192.168.70.251: bytes=32 time=2ms TTL=254

Ping statistics for 192.168.70.251:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 2ms, Average = 0ms

CNI Config (Windows Node) : Not sure how to fetch these details.
HNS Policy and Endpoint list (JSON formatted) from the windows node << Not sure how to fetch these details >>

@krishnaputhran
Copy link
Author

@vsiddharth Do you have any inputs on this? Waiting for your resolution

@somujay
Copy link

somujay commented Dec 5, 2019

Inside your container, can you run the following command and share the result? When you say DNS lookup doesn't work, how are you validating? Just ping command? Try resolve-dnsname and let us know.
Resolve-DNSName mysql-0.mysql

@krishnaputhran
Copy link
Author

krishnaputhran commented Dec 6, 2019

Hi @somujay . thanks for the response. Please find the dns resolution report

PS C:> resolve-dnsname mysql-0.mysql
resolve-dnsname : mysql-0.mysql : DNS name does not exist
At line:1 char:1

  • resolve-dnsname mysql-0.mysql
  •   + CategoryInfo          : ResourceUnavailable: (mysql-0.mysql:String) [Resolve-DnsName], Win32Exception
      + FullyQualifiedErrorId : DNS_ERROR_RCODE_NAME_ERROR,Microsoft.DnsClient.Commands.ResolveDnsName
    

================================================================
when the fully qualified DNS name is provided.

PS C:> resolve-dnsname mysql-0.mysql.default.svc.cluster.local

Name Type TTL Section IPAddress


mysql-0.mysql.default.svc.cluster.local A 5 Answer 192.168.70.251

=================================================================

As an additional info, there is an activemq server running in another Linux node in the same cluster. For that service DNS resolution is successful.

PS C:> resolve-dnsname activemq

Name Type TTL Section IPAddress


activemq.default.svc.cluster.local A 5 Answer 10.100.63.59

image

@krishnaputhran
Copy link
Author

@somujay Do you have any help on this. ?

@fincd-aws
Copy link

This is a Windows limitation:

"On a Windows pod, you can resolve both kubernetes.default.svc.cluster.local and kubernetes, but not the in-betweens, like kubernetes.default or kubernetes.default.svc"

So "service.namespace" is not expected to work on Windows pods. You shouldn't need to Set-DnsClientGlobalSetting inside the pod before the FQDN works though, but only "service" and the full FQDNs work on Windows. Likewise, you can't just change the pod DNS mode to get Partially-Qualified Domain Name resolution either (same link):

"ClusterFirstWithHostNet is not supported for DNS. Windows treats all names with a '.' as a FQDN and skips PQDN resolution"

Apps in another namespace from their database need to use the FQDN, and please add the final dot to avoid using the search path to make extra, unnecessary DNS requests.

@ExitoLab
Copy link

@mikestef9 @vsiddharth

I am currently having an issue.. Windows pods can not resolve .default.svc.cluster.local however, i was able to resolve this when i exec into the windows pods.

Question:
@krishnaputhran you mentioned a work around.. Can i implement your work around without having to exec into the windows container / pod.

I am currently using coredns

@lalwanin2020
Copy link

EKS relies on core-dns/kube-dns for DNS resolution.
The core-dns pods run on the EKS Linux worker nodes in the kube-system namespace.

EKS Windows relies on the above mechanism for DNS resolution. Please ensure that the core-dns pods are reachable from the EKS Windows worker nodes by adjusting the security groups if required.

When pods are scheduled onto a EKS Windows worker node, the CNI plugin creates HNS Endpoints with required DNS details including both nameservers using DNSClusterIP and a DNS Suffix Search List.

Refer https://github.com/aws/amazon-vpc-cni-plugins/blob/master/plugins/vpc-shared-eni/network/bridge_windows.go#L165 for more details.

@lalwanin2020
Copy link

The customer reached out to Somu Jayabalan personally and was able to resolve the issue by updating the kubernetes Jenkins plugin and reported EKS was working fine.
Hence, closing this issue.

@ExitoLab
Copy link

ExitoLab commented Aug 4, 2020

@lalwanin2020

Yes, you can close the issue. Thanks

@Hakob
Copy link

Hakob commented Nov 4, 2020

Hi all,
I had the same issues about my EKS windows nodes unable to resolve neither internal nor external domain names, and after a few hours of research, I found a workaround just for resolving public domain names.
Hope these issues will be fixed soon in the next releases.

Run this line before starting your windows container (I choose google's DNS, but your choice is up to you):
Set-DnsClientServerAddress -interfacealias vEthernet* -serveraddresses ("8.8.8.8,10.100.0.10")

@MuruganShanmugam
Copy link

Env: AWS EKS 1.18
https://docs.aws.amazon.com/eks/latest/userguide/windows-support.html
https://docs.aws.amazon.com/eks/latest/userguide/launch-windows-workers.html

I would appreciate if someone can clarify my understanding and possible resolve my problem. I am aware that there a limitation with windows container where PQDN is not supported. But I am having trouble to resolve just the leaf name ex: "ping mysql" doesn't resolve

https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#dns-limitations

On Linux, you have a DNS suffix list, which is used when trying to resolve PQDNs. On Windows, we only have 1 DNS suffix, which is the DNS suffix associated with that pod's namespace (mydns.svc.cluster.local for example). Windows can resolve FQDNs and services or names resolvable with just that suffix. For example, a pod spawned in the default namespace, will have the DNS suffix default.svc.cluster.local. On a Windows pod, you can resolve both kubernetes.default.svc.cluster.local and kubernetes, but not the in-betweens, like kubernetes.default or kubernetes.default.svc.

From Windows Node attached to k8s cluster`

PS C:\Users\Administrator> Get-DnsClientGlobalSetting
UseSuffixSearchList : True
SuffixSearchList : {us-west-2.ec2-utilities.amazonaws.com, us-west-2.compute.internal, dev.aws.xxxxx.com}
UseDevolution : True
DevolutionLevel : 0

From sample windows application running in a pod in windows node

PS C:> Get-DnsClientGlobalSetting
UseSuffixSearchList : False
SuffixSearchList : {}
UseDevolution : True
DevolutionLevel : 0

PS C:> ping mysql
Ping request could not find host mysql. Please check the name and try again.
PS C:> ping mysql.dev.aws.xxxxx.com

Pinging i-xxxxx.dev.aws.xxxxx.com [10.XX.XX.XXX] with 32 bytes of data:
Reply from 10.XX.XX.XXX: bytes=32 time<1ms TTL=64
Reply from 10.XX.XX.XXX: bytes=32 time<1ms TTL=64
Reply from 10.XX.XX.XXX: bytes=32 time<1ms TTL=64

Ping statistics for 10.XX.XX.XXX:
Packets: Sent = 3, Received = 3, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms

If I add the DNSSuffix using Set-DnsClientGlobalSetting -SuffixSearchList @("") - ping mysql resolves.

I don't see any issues with Linux pod and I cannot update the DNS-SuffixSearchList while creating the pod as it varies for each env like dev, ci, qa, staging, prod.

I can confirm the following:

  1. Windows node can talk to the linux nodes - Security groups rule looks good
  2. A ping from windows pod with just the leaf name reaches coredns pod as I can see some loggings

Any help/comment is appriciated.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service
Projects
None yet
Development

No branches or pull requests

9 participants