Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: improve envoy bootstrap coordination #10451

Closed
shoenig opened this issue Apr 26, 2021 · 6 comments · Fixed by #23381
Closed

connect: improve envoy bootstrap coordination #10451

shoenig opened this issue Apr 26, 2021 · 6 comments · Fixed by #23381
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug
Milestone

Comments

@shoenig
Copy link
Contributor

shoenig commented Apr 26, 2021

Currently the envoy bootstrap process needs Consul to accomplish 2 things before invoking consul envoy -bootstrap ...

  1. For sidecar proxies, the parent service needs to finish being registered into Consul catalog

  2. When ACLs are enabled, the minted Service Identity tokens need to be propagated through Consul

The implementation right now is to simply retry 3 times on 2 second intervals. Until the final failure, errors invoking Consul are ignored. This implies the above 2 pre-requsites must be met within 6 seconds of the envoy_bootstrap_hook running, which is not always the case.

Ideally we would have blocking queries to work with, but neither the agent service registration or acl token creation APIs support blocking queries.

https://www.consul.io/api-docs/agent/service#register-service
https://www.consul.io/api-docs/acl/tokens#create-a-token

I spent a little time looking into setting a watch on the Catalog waiting for the service to be registered, but AFAICT Consul will trip the watch on any service changes, not just the service we care about. In the end, using a watch was a lot more chatty over the wire than just polling every few seconds. For the time being, the best option here might be to just increase the amount of time the boostrap hook allows before giving up, maybe even making it a Nomad client config parameter. Experimenting locally it's hard to gauge how long to wait is "reasonable", so how about an increase from 6 to 60 seconds?

Ideas welcome.

@shoenig shoenig self-assigned this Apr 26, 2021
@shoenig shoenig added theme/consul/connect Consul Connect integration stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Apr 26, 2021
shoenig added a commit that referenced this issue Apr 26, 2021
This PR wraps the use of the consul envoy bootstrap command in
an expoenential backoff closure, configured to timeout after 60
seconds. This is an increase over the current behavior of making
3 attempts over 6 seconds.

Should help with #10451
shoenig added a commit that referenced this issue Apr 27, 2021
This PR wraps the use of the consul envoy bootstrap command in
an expoenential backoff closure, configured to timeout after 60
seconds. This is an increase over the current behavior of making
3 attempts over 6 seconds.

Should help with #10451
@TimMensch
Copy link

Looking at this error shortly after its two year anniversary of being posted with zero actual responses from the developers is not giving me warm-and-fuzzy feelings about the reliability of Consul/Nomad service meshes.

I'm still seeing frequent timeouts on high-end VMs with near zero load.

This really strikes me as fundamentally a Consul bug (or design flaw). Why isn't a registered task effectively immediately available, especially on a server with effectively zero load? Raising the timeout is a band-aid to mask the fact that it can take 60+ seconds to register a task on an otherwise idle Consul cluster, and it clearly isn't enough.

firefox_2023-05-04_09-57-29

The image above shows one of the intermittent failures I've seen. I've also seen tasks that Nomad says are 100% healthy take minutes to show up on Consul. Unless there's some Consul setting that I can change that will make Consul behave with reasonable latency, my only choice will be to abandon Nomad/Consul Service Mesh.

@colinbruner
Copy link

Switching recently to using Nomad workload identity (new in 1.7.x) to generate tokens from Consul, I've observed this is still relevant.

If possible it would be great to see this connectivity between Nomad <-> Consul further prioritized and strengthen with better solutions to problems like these.

@tgross
Copy link
Member

tgross commented Jun 17, 2024

Just a heads up that we've got a proposed fix being worked on. See #9307 (comment) for explanation.

@tgross tgross self-assigned this Jun 17, 2024
@colinbruner
Copy link

Thanks Tim, glad this is on the radar!

tgross added a commit that referenced this issue Jun 20, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
tgross added a commit that referenced this issue Jun 21, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
tgross added a commit that referenced this issue Jun 26, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
@tgross tgross closed this as completed in df67e74 Jun 27, 2024
@tgross
Copy link
Member

tgross commented Jun 27, 2024

I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here.

@tgross tgross added this to the 1.8.2 milestone Jun 27, 2024
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants