Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution issues when connected to OpenVPN #12012

Open
esp1 opened this issue Jul 8, 2021 · 25 comments · May be fixed by #13878
Open

DNS resolution issues when connected to OpenVPN #12012

esp1 opened this issue Jul 8, 2021 · 25 comments · May be fixed by #13878
Labels
bug Used to indicate a potential bug os/darwin

Comments

@esp1
Copy link

esp1 commented Jul 8, 2021

Describe the bug
The vault command line tool does not resolve VPN hosts when connected to OpenVPN. Note that OpenVPN is configured with a split DNS setup and does not modify /etc/resolv.conf to add in nameservers for VPN hosts.

To Reproduce
Steps to reproduce the behavior:

  1. Log in to vault, e.g. with env VAULT_ADDR=https://internal.vpn.host vault login -method=ldap username=edwin
  2. See error Error authenticating: Put "https://internal.vpn.host/v1/auth/ldap/login/edwin": dial tcp: lookup vault.prod.factual.com on 192.168.1.1:53: no such host

Expected behavior
Should see message Success! You are now authenticated.

Environment:

  • Vault Server Version (retrieve with vault status): 1.4.2
  • Vault CLI Version (retrieve with vault version): v1.7.3 ('5d517c864c8f10385bf65627891bc7ef55f5e827+CHANGES')
  • Client Operating System/Architecture: macOS 11.4/x86_64

Additional context
I am experiencing this issue on the above version of vault CLI installed via homebrew on a Mac.

I believe this is due to the same issue as hashicorp/terraform#3536 that was fixed in hashicorp/terraform#5925. The problem and solution are summarized in hashicorp/terraform#3536 (comment):

The issue is that Mac OS X native net dns resolver goes directly to resolv.conf and our vpn client does not update the resolv.conf since it split tunnels the queries based on dns suffix. We fixed the issue by having it build using this command:
export CGO_ENABLED=1; XC_OS="darwin" XC_ARCH="amd64" make bin

I am able to work around this issue by manually editing /etc/resolv.conf to use the VPN nameservers, or by putting the IP address of the vault server into /etc/hosts.

@ncabatoff ncabatoff added bug Used to indicate a potential bug os/darwin labels Jul 8, 2021
@ncabatoff
Copy link
Collaborator

Hi @esp1,

Thanks for reporting this.

I could be wrong, but I don't think we control the Vault build in homebrew, unless you use our own tap: https://github.com/hashicorp/homebrew-tap

Regardless, from what I understand you would have the same issue with our official binaries, because we don't build them on MacOS, we cross-compile them on Linux. This means they don't include the MacOS resolver bits you need.

The good news is that there's a plan in the works to change this. I don't have a timeline, but the design doc was circulating internally just this week, so I don't think it's that far off.

@heatherezell
Copy link
Contributor

@esp1 I checked in with our release engineering team, and they're going to be addressing this issue as part of an effort that should be landing sometime around September. This may not be the exact time it lands, but it's definitely coming. :)

@rjhornsby
Copy link
Contributor

Thanks for the reply @hsimon-hashicorp, genuinely. Have been fighting this issue for months, maybe longer. Good to know it's getting worked.

As @esp1 notes, we've seen this in Terraform. It's something to do with the underlying Go stock DNS library. It yells YOLO at the mDNS resolver, figures out what upstream DNS server mDNS would use, and calls it directly. I can't even get /etc/hosts to work correctly because of how the library behaves. I use PiHole for my upstream DNS, which has an option to basically force an A record into the cache.

That mostly works until the AWS load balancer, whose IP address is now hardcoded into PiHole (or /etc/hosts), changes without warning. Then you spend half the day blaming everything in sight: AD DNS servers, VPN server config, VPN client config, weird routing issues, probably the cat because they would do something like this to mess with us, until it dawns on you to check the PiHole config to see if there's a "rogue" entry that is overriding what you're expecting to get back from your corporate DNS servers. Because Golang. #smh

@heatherezell
Copy link
Contributor

@rjhornsby That sounds like a pain, to be sure. I'll keep an eye on this one, and please feel free to come along and bump it as you need.

@tandrup
Copy link

tandrup commented Oct 6, 2021

@hsimon-hashicorp Is there any news from the release engineering team regarding when a fix will land?

@rjhornsby
Copy link
Contributor

Just went through another round of "...why isn't DNS resolution behaving properly? wait ... why does it seem to be vault and consul specifically having issues?" before vaguely remembering this bug. Any progress by chance?

@ncabatoff
Copy link
Collaborator

Just went through another round of "...why isn't DNS resolution behaving properly? wait ... why does it seem to be vault and consul specifically having issues?" before vaguely remembering this bug. Any progress by chance?

Yes! I think we neglected to include this in the changelog, but the latest releases should include the fix for this, as per #13728. I'll see about updating CHANGELOG.

@ncabatoff
Copy link
Collaborator

Fixed in #13728.

@rjhornsby
Copy link
Contributor

I'm not sure if I'm doing something wrong, but this doesn't seem to be fixed:

$ vault --version
Vault v1.9.3 ('7dbdd57243a0d8d9d9e07cd01eb657369f8e1b8a+CHANGES')
$ vault status
Error checking seal status: Get "https://vault.mycorp.com:8200/v1/sys/seal-status": dial tcp: lookup vault.mycorp.com on 192.168.3.7:53: no such host

192.168.3.7 is the local (non VPN) DNS server.

However, immediately following up, curl behaves properly:

$ curl https://vault.mycorp.com:8200
<a href="/ui/">Temporary Redirect</a>.

For the cURL to get a redirect like that, it has to use the VPN dns and go through the VPN tunnel because vault is not on the interwebs. This suggests that the problem is in the vault binary itself, not with the network or the VPN configuration.

I've tried both vault 1.9.3 (homebrew) and compiling my own locally using export CGO_ENABLED=1; XC_OS="darwin" XC_ARCH="amd64" make dev[1]. I'm getting the same result for both vault binaries.

[1] go v1.17.6 produces Vault v1.10.0-dev ('057c67f969805a51e944898163aeff069d6a2e37') (cgo)

@ncabatoff
Copy link
Collaborator

Note that we don't control the default homebrew vault, though we do have a homebrew tap: https://github.com/hashicorp/homebrew-tap

Compiling your own locally the way we do would use CGO_ENABLED=0 and add the build tag netcgo, as per https://github.com/hashicorp/vault/blob/main/.github/workflows/build.yml#L206-L206. Not sure offhand how to do it using make, since the build target is more intended for our release automation, but you could try doing what that code is doing.

Or you could go to https://releases.hashicorp.com/vault/1.9.3/.

@rjhornsby
Copy link
Contributor

thanks for the feedback. that helps me understand what's going on.

I tried using both the tap
==> Downloading https://releases.hashicorp.com/vault/1.9.3/vault_1.9.3_darwin_amd64.zip

and - to be sure - grabbing the same binary(?) directly from https://releases.hashicorp.com/vault/1.9.3/, but got the same failed DNS results for both.

However, looking at the build fragment you linked, I was able to compile 1.10 from master like so:

$ GO_TAGS=netcgo make dev

and that ... worked. Name resolution does what it is expected. This also works for the v1.9.3 tag.

It seems that while reading through all the different threads CGO_ENABLED should have worked, it's the netcgo tag that did it? It's also not clear what might be different about how I built it locally vs the HashiCorp official binary.

One of the things I did notice is that on mine I get (cgo) at the end of the version string, whereas with the HashiCorp version I don't

$ ~/tmp/bin/vault --version
Vault v1.9.3 ('7dbdd57243a0d8d9d9e07cd01eb657369f8e1b8a') (cgo)
$ ~/Downloads/vault --version
Vault v1.9.3 (7dbdd57243a0d8d9d9e07cd01eb657369f8e1b8a)

@ncabatoff
Copy link
Collaborator

We don't want to set CGO_ENABLED=1 as that has a bunch of consequences. The fact that you have cgo in your version string makes me wonder if maybe you had the env var populated when you ran make dev?

Re-opening since it sounds like our fix didn't work.

@ncabatoff ncabatoff reopened this Feb 2, 2022
@rjhornsby
Copy link
Contributor

The fact that you have cgo in your version string makes me wonder if maybe you had the env var populated when you ran make dev?

I went back and checked, and I think you're right about my having CGO_ENABLED set in the environment. I recompiled 1.9.3 (7dbdd5724) intentionally making sure CGO_ENABLED was not set and the resulting binary failed DNS. I compiled again with both CGO_ENABLED and the netcgo tag - which succeeded DNS lookups.

@ncabatoff
Copy link
Collaborator

I was under the impression that it was possible to get proper DNS lookups on darwin using CGO_ENABLED=0 and -tags netcgo, but that's looking to be untrue. In hindsight it seems obvious that "netcgo" requires CGO. We'll try to sort this out for the next release, sorry!

@archoversight
Copy link

@ncabatoff I left a comment on your PR, CGO_ENABLED=1 needs to be set or it will not work.

@archoversight
Copy link

Adding a link to this comment regarding cross-compilation for ARM64 from AMD64 on macOS CI: golang/go#12524 (comment)

@schwoerb
Copy link

It appears that this is having issues again.

@CGamesPlay
Copy link

The binaries distributed by Hashicorp and the ones installed by Homebrew all have this issue on the latest versions (v1.12.0). When building myself from source (not cross-compiling), the issue persists:

$ make
...

$ bin/vault status
Error checking seal status: Get "https://vault.service.consul:8200/v1/sys/seal-status": dial tcp: lookup vault.service.consul on [2001:558:feed::1]:53: no such host

$ otool -L bin/vault
bin/vault:
	/usr/lib/libSystem.B.dylib (compatibility version 0.0.0, current version 0.0.0)
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 0.0.0, current version 0.0.0)
	/System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 0.0.0, current version 0.0.0)

When following the workaround mentioned above (this one), the issue is resolved:

$ CGO_ENABLED=1 GOARCH=arm64 make
...

$ bin/vault status
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            1
Threshold               1
Version                 1.11.2
Build Date              2022-07-29T09:48:47Z
Storage Type            raft
Cluster Name            vault-cluster-7d0a318b
Cluster ID              432e615e-9ca5-522a-e48d-7dc069f1a1bd
HA Enabled              true
HA Cluster              https://172.30.0.1:8201/
HA Mode                 active
Active Since            2022-08-30T07:50:53.025121063Z
Raft Committed Index    11074
Raft Applied Index      11074

$ otool -L bin/vault
bin/vault:
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1858.112.0)
	/System/Library/Frameworks/IOKit.framework/Versions/A/IOKit (compatibility version 1.0.0, current version 275.0.0)
	/System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 60158.100.133)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3)

Note that a quick way to see if the issue will manifest, at least on my system when not cross-compiling, is to check for the presence of IOKit.framework in the listing of otool -L. I believe that this library is only coincidentally included and doesn't have to do with the issue at hand, but it may be useful as a smoke test for verifying that the version of Vault is going to work with the system's DNS configuration.

  • go version go1.19.2 darwin/amd64
  • macOS 12.6 (21G115)
  • uname -mpv - Darwin Kernel Version 21.6.0: Mon Aug 22 20:19:52 PDT 2022; root:xnu-8020.140.49~2/RELEASE_ARM64_T6000 arm64 arm

@erSitzt
Copy link

erSitzt commented Dec 5, 2022

Same here with

Installed via brew
Vault v1.12.1 ('e34f8a14fb7a88af4640b09f3ddbb5646b946d9c+CHANGES'), built 2022-10-27T12:32:05Z

on an Mac Mini m1 ( macOS 13.0.1 )

@archoversight
Copy link

The version installed from home-brew is once again failing to resolve correctly:

vault --version
Vault v1.12.2 ('415e1fe3118eebd5df6cb60d13defdc01aa17b03+CHANGES'), built 2022-11-23T12:53:46Z

@rjhornsby
Copy link
Contributor

@archoversight, confirmed. Also made sure I got vault from the hashicorp tap.

$ brew install hashicorp/tap/vault
==> Installing vault from hashicorp/tap
...
$ vault status   # bypasses local mDNS resolver config
Error checking seal status: Get "https://vault.mycorpcom:8200/v1/sys/seal-status": dial tcp: lookup vault.mycorp.com on 192.168.3.7:53: no such host
...
$ curl https://vault.mycorp.com   # uses domain-appropriate DNS servers
<html>
<head><title>301 Moved Permanently</title></head>

For now, I get vault working by maintaining a static DNS entry in the local DNS server (192.168.3.7) but that's brittle obviously.

@const-tmp
Copy link

const-tmp commented Jan 24, 2023

Look like the same issue

➜  ~ vault version
Vault v1.12.2 ('415e1fe3118eebd5df6cb60d13defdc01aa17b03+CHANGES'), built 2022-11-23T12:53:46Z
➜  ~ cat .zshrc
...
export VAULT_ADDR=http://vault.service.consul:8200
export NOMAD_ADDR=http://nomad.service.consul:4646
export CONSUL_HTTP_ADDR=http://consul.service.consul:8500
➜  ~ dig @10.27.96.4 -p 8600 vault.service.consul. ANY
...
vault.service.consul.	0	IN	A	10.27.96.3
➜  ~ vault status
Error checking seal status: Get "http://vault.service.consul:8200/v1/sys/seal-status": dial tcp: lookup vault.service.consul on 8.8.8.8:53: no such host

But Consul and Nomad work with the same setup

➜  ~ consul members
Node            Address          Status  Type    Build   Protocol  DC   Partition  Segment
consul-0        10.27.96.4:8301  alive   server  1.14.3  2         dc1  default    <all>
nomad-0         10.27.96.6:8301  alive   client  1.14.3  2         dc1  default    <default>
nomad-client-0  10.27.96.5:8301  alive   client  1.14.3  2         dc1  default    <default>
vault-0         10.27.96.3:8301  alive   client  1.14.3  2         dc1  default    <default>

➜  ~ nomad job status
No running jobs

@erSitzt
Copy link

erSitzt commented Feb 8, 2023

This is still happening with 1.12.3

❯ vault status
Error checking seal status: Get "https://vault.mydomain.com/v1/sys/seal-status": dial tcp: lookup vault.mydomain.com on 8.8.8.8:53: no such host
❯ 
❯ 
❯ vault version
Vault v1.12.3 ('209b3dd99fe8ca320340d08c70cff5f620261f9b+CHANGES'), built 2023-02-02T09:07:27Z

@ncabatoff
Copy link
Collaborator

Vault 1.13 is going to use Go 1.20, which should allow for an easy fix. I suspect it won't be fixed in 1.13.0 (though maybe?), but I aim to address it by 1.13.1 at least.

@ncabatoff ncabatoff added this to the 1.13.1 milestone Feb 8, 2023
@anwittin anwittin modified the milestones: 1.13.1, 1.13.2 Mar 27, 2023
@ncabatoff ncabatoff removed this from the 1.13.2 milestone Apr 21, 2023
@ncabatoff
Copy link
Collaborator

Going by

Note that a quick way to see if the issue will manifest, at least on my system when not cross-compiling, is to check for the presence of IOKit.framework in the listing of otool -L. I believe that this library is only coincidentally included and doesn't have to do with the issue at hand, but it may be useful as a smoke test for verifying that the version of Vault is going to work with the system's DNS configuration.

I've checked and it seems that we're good now:

$ which vault
/Users/ncc/go/bin/vault
$ vault version
Vault v1.15.1 (b94e275f25ccd9011146d14c00ea9e49fd5032dc), built 2023-10-20T19:16:11Z
$ otool -L ~/go/bin/vault
/Users/ncc/go/bin/vault:
        /usr/lib/libSystem.B.dylib (compatibility version 0.0.0, current version 0.0.0)
        /usr/lib/libresolv.9.dylib (compatibility version 0.0.0, current version 0.0.0)
        /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 0.0.0, current version 0.0.0)
        /System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 0.0.0, current version 0.0.0)

Would one of the impacted people above care to verify this in their environments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug os/darwin
Projects
None yet
Development

Successfully merging a pull request may close this issue.