-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad with Consul Connect Service Mesh randomly fails with a bootstrap error #20516
Comments
I've observed this as well.. more information in support case #147547 |
got a link? |
sorry, I reported this to Hashicorp's Enterprise support portal, behind auth. They are now tracking it internally but public issues helps raise awareness.. and might provide some restored sanity for anyone else experience this issue as well. |
++ also seeing this fairly repeatedly, unsure if the same cause. @ryanhulet @colinbruner - does restarting Consul fix this for you (temporarily at least)? |
Sometimes. I finally gave up and destroyed my entire cluster and brought up a new one on 1.6.1 |
Possibly related to #20185? I wonder if the Consul sync is getting stuck because it doesn't have tokens to deregister services - exiting early - and therefore never registering newly created services, meaning that Envoy can't bootstrap?
Not sure why it would have gotten worse in newer releases though? But we also see it much more often recently (on 1.7.5 currently) could also just be because we see more deploys now though. |
as far as I can tell it is 100% related to workload identities, 1.6.1 doesn't use them and I haven't had that error a single time since downgrading |
this is interesting.. the consul acl stanza settings the quoted issue lists in step #1 do match my consul acl configurations.. acl = {
enabled = true
default_policy = "deny"
enable_token_persistence = true
} I wonder if setting |
Yeah I'm also wondering if disabling persistence will fix it too, was also suggested by support on my ticket. We're (obviously) not going in an manually deleting the ACL tokens like the original issue repro but if it's getting stuck with some old expired token because it's been persisted then maybe could end up with similar behaviour. In our case we are reloading Consul once per day (with a SIGHUP) when we rotate its agent token - which may also a source for this, not sure if you're doing similar? |
similar, but less frequent. We have consul-template generating a Consul ACL token from Vault's Consul secret engine.. this is regenerated and a SIGHUP is issued at the half-life, which I believe to be ~16days. However, I've seen this on Nomad Clients that have only been around for a few hours.. so not necessarily correlated to after the token is rolled.
I'm going to do this in a non-prod environment today, will report back if I'm still seeing errors after the change is made. |
@ryanhulet @t-davies upon the advice of Hashicorp support, I set Will continue to monitor and update if this is not a full mitigation of the error. |
Thanks, yeah we are in a similar situation with |
I've been seeing this problem in a variety of forms for a while now, possibly when I started using Workload Identities. Many times, I see entries like this in my logs:
So it looks like Nomad is failing to bootstrap Envoy because there's something else that's failing to get deregistered. Why is the token I have an AWFUL Nomad job that scans through Consul's logs for those errors and uses a management token to do the deregistration. Here's a slimmed down version of it:
|
You're quite right, In our case we use Vault agent to rotate the token and store it in the config file, so we don't really need token persistence anyway since Consul will just read it from the file if it restarts. |
Hey folks, just wanted to post a heads up that this is on our radar. I've got a strong suspicion that the root cause is the same as what we think is the problem described here: #9307 (comment) That this is showing up for folks when they switch to WI might be related to the timing involved in the two workflows (the old workflow sends the request to the Nomad server, so by the time the request comes back to the Nomad client more time has passed than in the WI workflow). In any case, I'm working on that fix and I'll see if I can reproduce the specific WI workflow issues y'all are seeing here with and without that fix to see if that resolves this issue as well. |
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here. |
Nomad version
Operating system and Environment details
Issue
Ever since upgrading Nomad to 1.7.7 and Consul to 1.18.1, nomad jobs will fail repeatedly with the error:
envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>
This will happen multiple times, until eventually the job will schedule and start correctly. This results in multiple jobs stuck in pending for hours on end because of rescheduling back off
Reproduction steps
Literally run a job similar to the one attached, using Consul with auto-config and mTLS, envoy version 1.27.3. I am also using Vault 1.15.4 for secrets, and Nomad is using Vault and Consul workflow identities
Expected Result
A job to schedule and register with Consul Connect successfully
Actual Result
REPEATED failures with the envoy_bootstrap error
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: