Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault health check fails with "unsupported path" after config reload #14233

Closed
t-davies opened this issue Aug 23, 2022 · 15 comments
Closed

Vault health check fails with "unsupported path" after config reload #14233

t-davies opened this issue Aug 23, 2022 · 15 comments

Comments

@t-davies
Copy link
Contributor

t-davies commented Aug 23, 2022

Nomad version

Nomad v1.3.2 (bf602974112964e9691729f3f0716ff2bcdb3b44)

Operating system and Environment details

Issue

Nomad is unable to call Vault's health check endpoint with an expired Vault token, this should be possible since this Vault endpoint is not namespaced and does not require a token. The Vault client making the call should not be attempting to send credentials.

There appears to have been some weird behaviour where Nomad didn't correctly renew the Vault token, not sure if that's a bug or down to configuration on our end? See the logs below. Even if this is the case, Nomad should still be able to call the health check endpoint with an expired token - it believing that Vault is not healthy causes further problems.

Previously reported something similar, although we had some other errors at the time that made it a bit ambiguous as to what was the cause, see #13710.

Reproduction steps

  • Configure Nomad with a Vault token that will expire, wait for the token to expire and then check the logs.

Expected Result

  • Nomad should be able to call Vault's health check endpoint regardless of whether the Vault token has expired.

Actual Result

  • Nomad fails to call the Vault health check endpoint after the token expires.

Nomad Server logs (if appropriate)

2022-08-20T20:52:19.216Z [INFO]  nomad.vault: successfully renewed token: next_renewal=37m43.499968891s
2022-08-20T21:07:25.877Z [ERROR] nomad.event_broker: failed resolving ACL for secretID, closing subscriptions: error="no token for secret ID"
2022-08-20T21:29:54.291Z [ERROR] nomad.event_broker: failed resolving ACL for secretID, closing subscriptions: error="no token for secret ID"
2022-08-20T21:30:03.412Z [INFO]  nomad.vault: successfully renewed token: next_renewal=18m51.499975211s
2022-08-20T21:35:37.207Z [ERROR] nomad.event_broker: failed resolving ACL for secretID, closing subscriptions: error="no token for secret ID"
2022-08-20T21:48:55.502Z [INFO]  nomad.vault: successfully renewed token: next_renewal=9m25.499982629s
2022-08-20T21:49:38.045Z [ERROR] nomad.event_broker: failed resolving ACL for secretID, closing subscriptions: error="no token for secret ID"
2022-08-20T21:58:21.825Z [INFO]  nomad.vault: successfully renewed token: next_renewal=4m42.499986896s
2022-08-20T22:03:04.943Z [INFO]  nomad.vault: successfully renewed token: next_renewal=2m20.999986928s
2022-08-20T22:05:26.596Z [INFO]  nomad.vault: successfully renewed token: next_renewal=1m9.999986194s
2022-08-20T22:06:37.274Z [INFO]  nomad.vault: successfully renewed token: next_renewal=34.499988446s
2022-08-20T22:07:12.121Z [INFO]  nomad.vault: successfully renewed token: next_renewal=16.999987714s
2022-08-20T22:07:29.469Z [INFO]  nomad.vault: successfully renewed token: next_renewal=8.49998822s
2022-08-20T22:07:38.562Z [INFO]  nomad.vault: successfully renewed token: next_renewal=3.999988201s
2022-08-20T22:07:42.918Z [INFO]  nomad.vault: successfully renewed token: next_renewal=1.999987027s
2022-08-20T22:07:45.263Z [INFO]  nomad.vault: successfully renewed token: next_renewal=499.989036ms
2022-08-20T22:07:47.258Z [WARN]  nomad.vault: got error or bad auth, so backing off:
error=
| failed to renew the vault token: Error making API request.
|
| Namespace: xxxx
| URL: PUT https://vault.xxxxxxxxxx.xxxx.xxx/v1/auth/token/renew-self
| Code: 403. Errors:
|
| * permission denied
recoverable=false

...

2022-08-21T10:28:30.945Z [WARN]  nomad.vault: failed to contact Vault API: retry=30s
error=
| Error making API request.
|
| Namespace: xxx
| URL: GET https://vault.xxxxxxxxxx.xxxx.xxx/v1/sys/health?drsecondarycode=299&performancestandbycode=299&sealedcode=299&sta
| Code: 404. Errors:
|
| * unsupported path
2022-08-21T10:28:32.717Z [INFO]  nomad.rpc: closing server RPC connection
2022-08-21T10:28:53.092Z [ERROR] nomad.event_broker: failed resolving ACL for secretID, closing subscriptions: error="no token for secret ID"
2022-08-21T10:29:00.987Z [WARN]  nomad.vault: failed to contact Vault API: retry=30s
@t-davies
Copy link
Contributor Author

If I intercept the request going from Nomad to Vault, I see that it is adding Vault token and Vault namespace to the health check request.

GET /v1/sys/health?drsecondarycode=299&performancestandbycode=299&sealedcode=299&standbycode=299&uninitcode=299
host: vault.xxxxxx.xxxx
x-vault-token: hvs.xxxxxxxxxxxxxxxxx
x-vault-request: true
x-vault-namespace: xxxx
accept-encoding: gzip
user-agent: Go-http-client/2.0

@tgross tgross changed the title Vault health check fails with "unsupported path" Vault health check fails with "unsupported path" after token expires Aug 23, 2022
@tgross
Copy link
Member

tgross commented Aug 23, 2022

Hi @t-davies! I've narrowed the title a bit based on this new information. Despite what the code we discussed in #13710 (comment) suggests, it sure as heck seems like the token expiring is what kicks off the problem. I have a sneaking suspicion this may be related to some configuration race conditions we've already fixed in the upcoming Nomad 1.4.0, so I'll definitely check there. But first I'll see if we can come up with a reliable reproduction.

(Aside, dear GitHub: my kingdom for the ability to merge issues!)

@tgross tgross self-assigned this Aug 23, 2022
@t-davies
Copy link
Contributor Author

Thanks! FWIW, after some more investigating it seems like the token failing to renew is because we don't have the nomad-server role configured correctly in Vault, but yeah shouldn't cause this issue even so.

@tgross
Copy link
Member

tgross commented Aug 23, 2022

I've been trying to write a test that exercises this area of the code and I think I've found one obvious bug in how we instantiate the two client configs in vault.go#L445-L465:

v.client = client
if v.config.Namespace != "" {
	v.logger.Debug("configuring Vault namespace", "namespace", v.config.Namespace)
	v.clientSys, err = vapi.NewClient(apiConf)
	if err != nil {
		v.logger.Error("failed to create Vault sys client and not retrying", "error", err)
		return err
	}
	client.SetNamespace(v.config.Namespace)
} else {
	v.clientSys = client
}

// Set the token
v.token = v.config.Token
client.SetToken(v.token)
v.auth = client.Auth().Token()

Note that we always set the token on the client, but that client and clientSys are the same pointer if the namespace is unset. So if we have a token but not a namespace, the clientSys that we're using for the healthchecks will pass the token. Which maybe gets us into "how did this ever work?" territory, but that's not the bug you're seeing because you do have a namespace.

But now I'm looking closely at your logs and I'm seeing that there was roughly 12 hours between the time the renewal failed and the health check started failing? That's so long apart that I wonder if they're actually related at all. Is there any chance your team SIGHUP'd Nomad to get new configuration loaded during that time window? I'm investigating how we do that with the vault client now and it seems like a likely avenue.

@t-davies
Copy link
Contributor Author

t-davies commented Aug 24, 2022

Thanks @tgross, "how did this ever work?" bugs are always the best kind 😅
Just pulled the logs again and yes - looks like there was a SIGHUP in between, it's so close to the check failing too not sure why I didn't include that (sorry!).

[...]
systemd[1]: Reloading HashiCorp Nomad.
nomad[3063]: ==> Caught signal: hangup
nomad[3063]: ==> Reloading configuration...
nomad[3063]: WARNING: keyring exists but -encrypt given, using keyring
nomad[3063]: 2022-08-21T10:28:30.898Z [INFO]  nomad: reloading server connections due to configuration changes
nomad[3063]: 2022-08-21T10:28:30.899Z [ERROR] nomad.rpc: failed to accept RPC conn: error="accept tcp [::]:4647: use of closed network connection" delay
nomad[3063]: 2022-08-21T10:28:30.899Z [INFO]  nomad.rpc: closing server RPC connection
nomad[3063]: 2022-08-21T10:28:30.899Z [INFO]  agent: reloading HTTP server with new TLS configuration
nomad[3063]: 2022-08-21T10:28:30.899Z [ERROR] worker: failed to dequeue evaluation: worker_id=b07ed4ec-985e-fbde-0fb7-410fc8cee855 error="rpc error: EOF
nomad[3063]: 2022-08-21T10:28:30.899Z [ERROR] worker: failed to dequeue evaluation: worker_id=2b44b46b-585d-726e-557b-509eeb3290ef error="rpc error: EOF
systemd[1]: Reloaded HashiCorp Nomad.
nomad[3063]: 2022-08-21T10:28:30.945Z [WARN]  nomad.vault: failed to contact Vault API: retry=30s
nomad[3063]: error=
nomad[3063]: | Error making API request.
nomad[3063]: |
nomad[3063]: | Namespace: xxxx
nomad[3063]: | URL: GET https://vault.xxxxxxxxxxxxxxxx/v1/sys/health?drsecondarycode=299&performancestandbycode=299&sealedcode=299
nomad[3063]: | Code: 404. Errors:
nomad[3063]: |
nomad[3063]: | * unsupported path

@tgross
Copy link
Member

tgross commented Aug 24, 2022

Just pulled the logs again and yes - looks like there was a SIGHUP in between, it's so close to the check failing too not sure why I didn't include that (sorry!).

Great, that really helps narrow things down a lot. I'll report back once I've got a clean repro from there.

@tgross
Copy link
Member

tgross commented Aug 24, 2022

Ok, so I haven't been able to reproduce what you're seeing with Nomad 1.3.2 yet. But I wanted to give you an update on what I've discovered so far. I've got one patch coming out of this which I'll land in main but I don't see any way that can be your problem. I'm going to need to switch off tasks for a little bit after that, but I'll come back to this as soon as I can.

Which maybe gets us into "how did this ever work?" territory

(1) Well it turns out that Vault is just fine getting a token for the health check endpoint, so long as you don't include a namespace. From my HCP cluster here:

$ curl -H "X-Vault-Token: $VAULT_TOKEN" -H "X-Vault-Namespace: admin" "$VAULT_ADDR/v1/sys/health"
{"errors":["unsupported path"]}

$ curl -H "X-Vault-Token: $VAULT_TOKEN" "$VAULT_ADDR/v1/sys/health"
{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1661354828,"version":"1.10.3+ent","cluster_name":"vault-cluster-ad9158cb","cluster_id":"6ed50ef0-e501-d2b1-fde6-a8536cd51223","last_wal":1915934,"license":{"state":"autoloaded","expiry_time":"2022-09-29T06:22:51Z","terminated":false}}

(2) Interestingly enough, Vault very recently added an API to avoid having to configure two different clients: hashicorp/vault#14934 hashicorp/vault#14963 so we could also improve our config story here by bumping our Vault SDK version.

(3) I did find an overt bug in an equality check we do for the configuration.

When we reload the configuration we check if the new Vault config matches the old Vault config at vault.go#L365-L368, so that we don't do more work than we need to. But the IsEqual method for the config is missing a check for the namespace field. If only the namespace has changed between config reload, it won't be detected as a change and the old client will be left in place. But I don't think that's what's happened to you or we wouldn't see the attempt to start the new client.

(4) My colleague @schmichael recently landed merged patches to close race conditions around agent configuration, and #14139 in particular could fix any problems where the config object is memory was being updated concurrently.


My failed reproduction, using the following config pointing at my HCP Vault cluster:

vault {
  enabled   = true
  address   = "$VAULT_ADDR"
  namespace = "admin"
  token     = "$VAULT_TOKEN" # evil, don't put this in the HCL!
}

The expected config reload behavior is as follows:

==> Caught signal: hangup
==> Reloading configuration...
WARNING: keyring exists but -encrypt given, using keyring
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
    2022-08-24T17:53:20.217Z [DEBUG] agent: starting reload of server config
    2022-08-24T17:53:20.218Z [DEBUG] nomad.vault: configuring Vault namespace: namespace=admin
    2022-08-24T17:53:20.348Z [DEBUG] nomad.vault: starting renewal loop: creation_ttl=72h0m0s
    2022-08-24T17:53:20.369Z [DEBUG] nomad.vault: successfully renewed server token
    2022-08-24T17:53:20.369Z [INFO]  nomad.vault: successfully renewed token: next_renewal=35h59m59.999832128s

I'm not seeing the issue if I reload a whole bunch to try to hit a race condition in the config. But my environment doesn't have a lot of cores so it's entirely possible your production environment is hitting a case mydevelopment rig won't.

If I block access to Vault via IP tables, eventually I get the following error. But if I remove the iptables rule before reloading the configuration, it picks up the new client no problem.

    2022-08-24T18:14:23.241Z [ERROR] nomad.vault: failed to validate self token/role: retry=30s error="failed to lookup Vault periodic token: context deadline exceeded"
    2022-08-24T18:14:29.928Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="246.777µs"
    2022-08-24T18:14:30.817Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:60736
==> Caught signal: hangup
==> Reloading configuration...
WARNING: keyring exists but -encrypt given, using keyring
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
    2022-08-24T18:14:30.986Z [DEBUG] agent: starting reload of server config
    2022-08-24T18:14:30.986Z [DEBUG] nomad.vault: configuring Vault namespace: namespace=admin
    2022-08-24T18:14:30.986Z [DEBUG] agent: starting reload of client config
    2022-08-24T18:14:31.025Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="160.113µs"
    2022-08-24T18:14:31.119Z [DEBUG] nomad.vault: starting renewal loop: creation_ttl=72h0m0s
    2022-08-24T18:14:31.142Z [DEBUG] nomad.vault: successfully renewed server token
    2022-08-24T18:14:31.143Z [INFO]  nomad.vault: successfully renewed token: next_renewal=35h59m59.999841236s

If I don't remove the iptables rule before reloading the configuration, it blocks until I either remove the rule or it hits the "context deadline exceeded" error.

==> Caught signal: hangup
==> Reloading configuration...
WARNING: keyring exists but -encrypt given, using keyring
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
    2022-08-24T17:57:31.198Z [DEBUG] agent: starting reload of server config
    2022-08-24T17:57:31.199Z [DEBUG] nomad.vault: configuring Vault namespace: namespace=admin
...
    2022-08-24T17:57:46.840Z [DEBUG] nomad.vault: starting renewal loop: creation_ttl=72h0m0s
    2022-08-24T17:57:46.861Z [DEBUG] nomad.vault: successfully renewed server token
    2022-08-24T17:57:46.861Z [INFO]  nomad.vault: successfully renewed token: next_renewal=35h59m59.999864915s

@tgross tgross changed the title Vault health check fails with "unsupported path" after token expires Vault health check fails with "unsupported path" after config reload Aug 24, 2022
@tgross
Copy link
Member

tgross commented Aug 30, 2022

My colleague @DerekStrickland has a PR open that isn't for this specific symptom but is something that smells awfully similar: #13279 Tagging him here to see if he thinks this could be related.

@DerekStrickland
Copy link
Contributor

It could be related. Do you see a Can't request Vault token for terminal allocation entry at basically the same time?

@t-davies
Copy link
Contributor Author

Sorry @DerekStrickland, missed this notification. Haven't seen any logs similar to that though - unfortunately.

@t-davies
Copy link
Contributor Author

Haven't seen any issues since, but I'm going to try updating to 1.3.5 this week and see how that runs. Seems like a bunch of things that might have contributed towards this happening have been fixed.

@DerekStrickland
Copy link
Contributor

Great. We'll keep this open for now pending your update and observation.

@t-davies
Copy link
Contributor Author

This wasn't seen again after upgrading to 1.3.5, now running 1.4.3 without issues. I'm assuming that one of the various patches did indeed resolve this.

@tgross
Copy link
Member

tgross commented Nov 30, 2022

Glad to hear it! If this pops up again be sure to let us know. Thanks!

@tgross tgross closed this as completed Nov 30, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

3 participants