-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connect: improve envoy bootstrap coordination #10451
Comments
This PR wraps the use of the consul envoy bootstrap command in an expoenential backoff closure, configured to timeout after 60 seconds. This is an increase over the current behavior of making 3 attempts over 6 seconds. Should help with #10451
This PR wraps the use of the consul envoy bootstrap command in an expoenential backoff closure, configured to timeout after 60 seconds. This is an increase over the current behavior of making 3 attempts over 6 seconds. Should help with #10451
Looking at this error shortly after its two year anniversary of being posted with zero actual responses from the developers is not giving me warm-and-fuzzy feelings about the reliability of Consul/Nomad service meshes. I'm still seeing frequent timeouts on high-end VMs with near zero load. This really strikes me as fundamentally a Consul bug (or design flaw). Why isn't a registered task effectively immediately available, especially on a server with effectively zero load? Raising the timeout is a band-aid to mask the fact that it can take 60+ seconds to register a task on an otherwise idle Consul cluster, and it clearly isn't enough. The image above shows one of the intermittent failures I've seen. I've also seen tasks that Nomad says are 100% healthy take minutes to show up on Consul. Unless there's some Consul setting that I can change that will make Consul behave with reasonable latency, my only choice will be to abandon Nomad/Consul Service Mesh. |
Switching recently to using Nomad workload identity (new in 1.7.x) to generate tokens from Consul, I've observed this is still relevant. If possible it would be great to see this connectivity between Nomad <-> Consul further prioritized and strengthen with better solutions to problems like these. |
Just a heads up that we've got a proposed fix being worked on. See #9307 (comment) for explanation. |
Thanks Tim, glad this is on the radar! |
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Currently the envoy bootstrap process needs Consul to accomplish 2 things before invoking
consul envoy -bootstrap ...
For sidecar proxies, the parent service needs to finish being registered into Consul catalog
When ACLs are enabled, the minted Service Identity tokens need to be propagated through Consul
The implementation right now is to simply retry 3 times on 2 second intervals. Until the final failure, errors invoking Consul are ignored. This implies the above 2 pre-requsites must be met within 6 seconds of the
envoy_bootstrap_hook
running, which is not always the case.Ideally we would have blocking queries to work with, but neither the agent service registration or acl token creation APIs support blocking queries.
https://www.consul.io/api-docs/agent/service#register-service
https://www.consul.io/api-docs/acl/tokens#create-a-token
I spent a little time looking into setting a watch on the Catalog waiting for the service to be registered, but AFAICT Consul will trip the watch on any service changes, not just the service we care about. In the end, using a watch was a lot more chatty over the wire than just polling every few seconds. For the time being, the best option here might be to just increase the amount of time the boostrap hook allows before giving up, maybe even making it a Nomad client config parameter. Experimenting locally it's hard to gauge how long to wait is "reasonable", so how about an increase from 6 to 60 seconds?
Ideas welcome.
The text was updated successfully, but these errors were encountered: