-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault health check fails with "unsupported path" after config reload #14233
Comments
If I intercept the request going from Nomad to Vault, I see that it is adding Vault token and Vault namespace to the health check request.
|
Hi @t-davies! I've narrowed the title a bit based on this new information. Despite what the code we discussed in #13710 (comment) suggests, it sure as heck seems like the token expiring is what kicks off the problem. I have a sneaking suspicion this may be related to some configuration race conditions we've already fixed in the upcoming Nomad 1.4.0, so I'll definitely check there. But first I'll see if we can come up with a reliable reproduction. (Aside, dear GitHub: my kingdom for the ability to merge issues!) |
Thanks! FWIW, after some more investigating it seems like the token failing to renew is because we don't have the |
I've been trying to write a test that exercises this area of the code and I think I've found one obvious bug in how we instantiate the two client configs in v.client = client
if v.config.Namespace != "" {
v.logger.Debug("configuring Vault namespace", "namespace", v.config.Namespace)
v.clientSys, err = vapi.NewClient(apiConf)
if err != nil {
v.logger.Error("failed to create Vault sys client and not retrying", "error", err)
return err
}
client.SetNamespace(v.config.Namespace)
} else {
v.clientSys = client
}
// Set the token
v.token = v.config.Token
client.SetToken(v.token)
v.auth = client.Auth().Token() Note that we always set the token on the But now I'm looking closely at your logs and I'm seeing that there was roughly 12 hours between the time the renewal failed and the health check started failing? That's so long apart that I wonder if they're actually related at all. Is there any chance your team SIGHUP'd Nomad to get new configuration loaded during that time window? I'm investigating how we do that with the vault client now and it seems like a likely avenue. |
Thanks @tgross, "how did this ever work?" bugs are always the best kind 😅
|
Great, that really helps narrow things down a lot. I'll report back once I've got a clean repro from there. |
Ok, so I haven't been able to reproduce what you're seeing with Nomad 1.3.2 yet. But I wanted to give you an update on what I've discovered so far. I've got one patch coming out of this which I'll land in
(1) Well it turns out that Vault is just fine getting a token for the health check endpoint, so long as you don't include a namespace. From my HCP cluster here: $ curl -H "X-Vault-Token: $VAULT_TOKEN" -H "X-Vault-Namespace: admin" "$VAULT_ADDR/v1/sys/health"
{"errors":["unsupported path"]}
$ curl -H "X-Vault-Token: $VAULT_TOKEN" "$VAULT_ADDR/v1/sys/health"
{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1661354828,"version":"1.10.3+ent","cluster_name":"vault-cluster-ad9158cb","cluster_id":"6ed50ef0-e501-d2b1-fde6-a8536cd51223","last_wal":1915934,"license":{"state":"autoloaded","expiry_time":"2022-09-29T06:22:51Z","terminated":false}} (2) Interestingly enough, Vault very recently added an API to avoid having to configure two different clients: hashicorp/vault#14934 hashicorp/vault#14963 so we could also improve our config story here by bumping our Vault SDK version. (3) I did find an overt bug in an equality check we do for the configuration. When we reload the configuration we check if the new Vault config matches the old Vault config at (4) My colleague @schmichael recently landed merged patches to close race conditions around agent configuration, and #14139 in particular could fix any problems where the config object is memory was being updated concurrently. My failed reproduction, using the following config pointing at my HCP Vault cluster: vault {
enabled = true
address = "$VAULT_ADDR"
namespace = "admin"
token = "$VAULT_TOKEN" # evil, don't put this in the HCL!
} The expected config reload behavior is as follows:
I'm not seeing the issue if I reload a whole bunch to try to hit a race condition in the config. But my environment doesn't have a lot of cores so it's entirely possible your production environment is hitting a case mydevelopment rig won't. If I block access to Vault via IP tables, eventually I get the following error. But if I remove the iptables rule before reloading the configuration, it picks up the new client no problem.
If I don't remove the iptables rule before reloading the configuration, it blocks until I either remove the rule or it hits the "context deadline exceeded" error.
|
My colleague @DerekStrickland has a PR open that isn't for this specific symptom but is something that smells awfully similar: #13279 Tagging him here to see if he thinks this could be related. |
It could be related. Do you see a |
Sorry @DerekStrickland, missed this notification. Haven't seen any logs similar to that though - unfortunately. |
Haven't seen any issues since, but I'm going to try updating to 1.3.5 this week and see how that runs. Seems like a bunch of things that might have contributed towards this happening have been fixed. |
Great. We'll keep this open for now pending your update and observation. |
This wasn't seen again after upgrading to 1.3.5, now running 1.4.3 without issues. I'm assuming that one of the various patches did indeed resolve this. |
Glad to hear it! If this pops up again be sure to let us know. Thanks! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.3.2 (bf602974112964e9691729f3f0716ff2bcdb3b44)
Operating system and Environment details
Issue
Nomad is unable to call Vault's health check endpoint with an expired Vault token, this should be possible since this Vault endpoint is not namespaced and does not require a token. The Vault client making the call should not be attempting to send credentials.
There appears to have been some weird behaviour where Nomad didn't correctly renew the Vault token, not sure if that's a bug or down to configuration on our end? See the logs below. Even if this is the case, Nomad should still be able to call the health check endpoint with an expired token - it believing that Vault is not healthy causes further problems.
Previously reported something similar, although we had some other errors at the time that made it a bit ambiguous as to what was the cause, see #13710.
Reproduction steps
Expected Result
Actual Result
Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: