-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul: add preflight checks for Envoy bootstrap #23381
Conversation
9ab9983
to
754c08f
Compare
754c08f
to
f40bb0d
Compare
f40bb0d
to
2cc483c
Compare
5dd70b9
to
7fa895b
Compare
7fa895b
to
2245a4e
Compare
2245a4e
to
7db6a39
Compare
7db6a39
to
c643203
Compare
c643203
to
617a53d
Compare
617a53d
to
de1d286
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very excited to see envoy bootstrapping get more reliable! Thanks for following this through.
client/consul/consul.go
Outdated
// these node values will be evaluated at the time we create the hooks, so | ||
// we don't need to worry about them changing out from under us | ||
partition := node.Attributes["consul.partition"] | ||
preflightCheckTimeout := durationFromMeta( | ||
node, "consul.token_preflight_check.timeout", time.Second*10) | ||
preflightCheckBaseInterval := durationFromMeta( | ||
node, "consul.token_preflight_check.base", time.Millisecond*500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm not super familiar with this code, but it appears NewConsulClientFactory
will be called from NewClient
and therefore the timeout and interval node meta will be captured only during agent startup. Don't the durationFromMeta calls need to happen inside the closure and use client.Node()
instead of a captured Node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch. For partitions, we don't expect that to change because it's part of the fingerprint we get from the agent. But metadata can be changed via the node meta
command so yeah we'll need to move this into the closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! (And rebased on main) Also:
I'm not super familiar with this code
The various dependency injection tricks we're doing here to avoid circular references doesn't help. This code all got really janky during the big WI + Consul/Vault refresh. Once we remove the legacy workflow we should be able to refactor this in the process to be a little more comprehensible.
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul service for service mesh workloads, and this service needs to be present in Consul before we can bootstrap the Envoy proxy. Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request for a service that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the service and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll query the service from the local Consul agent before attempting to bootstrap Envoy. This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing.
When we removed the helper that wrapped `libtime`, we missed wiring up the sleeper function so that we could track iterations in testing. But the parameters for `testify`'s `Greater` assertion are backwards compared to its other assertions so we accidentally reverse the expectation and value as well. This caused a test for the iteration count to pass that should not have, but the ultimate effect was harmless because the iterations aren't tracked without the correct libtime setup anyways. Update the backoff to use the test helper field and update it to the current `libtime` API.
4384852
to
9c85377
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Nomad creates Consul ACL tokens and service registrations to support Consul service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a tokens or services that have not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding.
To workaround this, we'll use a method described by our friends over on
consul-k8s
where after creating the objects in Consul we try to read them from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust.This changset adds preflight checks for the objects we create in Consul:
We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing.
Fixes: #9307
Fixes: #10451
Fixes: #20516
Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Ref: https://hashicorp.atlassian.net/browse/NET-9273
Follow-up: https://hashicorp.atlassian.net/browse/NET-10138
Notes for reviewers: