connect: improve envoy bootstrap coordination #10451

shoenig · 2021-04-26T16:59:57Z

Currently the envoy bootstrap process needs Consul to accomplish 2 things before invoking consul envoy -bootstrap ...

For sidecar proxies, the parent service needs to finish being registered into Consul catalog
When ACLs are enabled, the minted Service Identity tokens need to be propagated through Consul

The implementation right now is to simply retry 3 times on 2 second intervals. Until the final failure, errors invoking Consul are ignored. This implies the above 2 pre-requsites must be met within 6 seconds of the envoy_bootstrap_hook running, which is not always the case.

Ideally we would have blocking queries to work with, but neither the agent service registration or acl token creation APIs support blocking queries.

https://www.consul.io/api-docs/agent/service#register-service
https://www.consul.io/api-docs/acl/tokens#create-a-token

I spent a little time looking into setting a watch on the Catalog waiting for the service to be registered, but AFAICT Consul will trip the watch on any service changes, not just the service we care about. In the end, using a watch was a lot more chatty over the wire than just polling every few seconds. For the time being, the best option here might be to just increase the amount of time the boostrap hook allows before giving up, maybe even making it a Nomad client config parameter. Experimenting locally it's hard to gauge how long to wait is "reasonable", so how about an increase from 6 to 60 seconds?

Ideas welcome.

The text was updated successfully, but these errors were encountered:

This PR wraps the use of the consul envoy bootstrap command in an expoenential backoff closure, configured to timeout after 60 seconds. This is an increase over the current behavior of making 3 attempts over 6 seconds. Should help with #10451

TimMensch · 2023-05-04T23:36:01Z

Looking at this error shortly after its two year anniversary of being posted with zero actual responses from the developers is not giving me warm-and-fuzzy feelings about the reliability of Consul/Nomad service meshes.

I'm still seeing frequent timeouts on high-end VMs with near zero load.

This really strikes me as fundamentally a Consul bug (or design flaw). Why isn't a registered task effectively immediately available, especially on a server with effectively zero load? Raising the timeout is a band-aid to mask the fact that it can take 60+ seconds to register a task on an otherwise idle Consul cluster, and it clearly isn't enough.

The image above shows one of the intermittent failures I've seen. I've also seen tasks that Nomad says are 100% healthy take minutes to show up on Consul. Unless there's some Consul setting that I can change that will make Consul behave with reasonable latency, my only choice will be to abandon Nomad/Consul Service Mesh.

colinbruner · 2024-04-15T12:56:12Z

Switching recently to using Nomad workload identity (new in 1.7.x) to generate tokens from Consul, I've observed this is still relevant.

If possible it would be great to see this connectivity between Nomad <-> Consul further prioritized and strengthen with better solutions to problems like these.

tgross · 2024-06-17T13:33:31Z

Just a heads up that we've got a proposed fix being worked on. See #9307 (comment) for explanation.

colinbruner · 2024-06-18T18:27:37Z

Thanks Tim, glad this is on the radar!

Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051

tgross · 2024-06-27T14:16:50Z

I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here.

github-actions · 2024-12-21T02:16:30Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig added the type/bug label Apr 26, 2021

shoenig self-assigned this Apr 26, 2021

shoenig added theme/consul/connect Consul Connect integration stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Apr 26, 2021

shoenig mentioned this issue Apr 26, 2021

consul connect is extremely unreliable at startup #9307

Closed

shoenig mentioned this issue Apr 26, 2021

connect: use exp backoff when waiting on consul envoy bootstrap #10453

Merged

jrasell unassigned shoenig Aug 30, 2021

tgross self-assigned this Jun 17, 2024

tgross mentioned this issue Jun 18, 2024

Consul: add preflight checks for Envoy bootstrap #23381

Merged

tgross closed this as completed in #23381 Jun 27, 2024

tgross closed this as completed in df67e74 Jun 27, 2024

hc-github-team-nomad-core mentioned this issue Jun 27, 2024

Backport of Consul: add preflight checks for Envoy bootstrap into release/1.8.x #23449

Merged

tgross added this to the 1.8.2 milestone Jun 27, 2024

github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect: improve envoy bootstrap coordination #10451

connect: improve envoy bootstrap coordination #10451

shoenig commented Apr 26, 2021

TimMensch commented May 4, 2023

colinbruner commented Apr 15, 2024

tgross commented Jun 17, 2024

colinbruner commented Jun 18, 2024

tgross commented Jun 27, 2024

github-actions bot commented Dec 21, 2024

connect: improve envoy bootstrap coordination #10451

connect: improve envoy bootstrap coordination #10451

Comments

shoenig commented Apr 26, 2021

TimMensch commented May 4, 2023

colinbruner commented Apr 15, 2024

tgross commented Jun 17, 2024

colinbruner commented Jun 18, 2024

tgross commented Jun 27, 2024

github-actions bot commented Dec 21, 2024