Add a workaround to check that the ACL token is replicated to other Consul servers #887

ishustava · 2021-12-01T04:51:06Z

Fixes #862

A consul client may reach out to a follower instead of a leader to resolve the token during the
call to get services. This is because clients talk to servers in the stale consistency mode
to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale).
In that case, it's possible that the token isn't replicated
to that server instance yet. The client will then get an "ACL not found" error
and subsequently cache this not found response. Then our call
to get services from the agent will keep hitting the same "ACL not found" error
until the cache entry expires (determined by the acl_token_ttl which defaults to 30 seconds).
This is not great because it will delay app start up time by 30 seconds in most cases
(if you are running 3 servers, then the probability of ending up on a follower is close to 2/3).

To help with that, we try to first read the token in the stale consistency mode until we
get a successful response. This should not take more than 100ms because raft replication
should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning)
but we set the timeout to 2s to be sure.

Note though that this workaround does not eliminate this problem completely. It's still possible
for this call and the next call to reach different servers and those servers to have different
states from each other.
For example, this call can reach a leader and succeed, while the call below can go to a follower
that is still behind the leader and get an "ACL not found" error.
However, this is a pretty unlikely case because
clients have sticky connections to a server, and those connections get rebalanced only every 2-3min.
And so, this workaround should work in a vast majority of cases.

How I've tested this PR:

unit tests
manually in a cluster with 3 servers

How I expect reviewers to test this PR:

code review

Checklist:

Tests added
CHANGELOG entry added

HashiCorp engineers only, community PRs should not add a changelog entry.
Entries should use present tense (e.g. Add support for...)

…onsul servers Fixes #862 A consul client may reach out to a follower instead of a leader to resolve the token during the call to get services below. This is because clients talk to servers in the stale consistency mode to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale). In that case, it's possible that the token isn't replicated to that server instance yet. The client will then get an "ACL not found" error and subsequently cache this not found response. Then our call below to get services from the agent will keep hitting the same "ACL not found" error until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds). This is not great because it will delay app start up time by 30 seconds in most cases (if you are running 3 servers, then the probability of ending up on a follower is close to 2/3). To help with that, we try to first read the token in the stale consistency mode until we get a successful response. This should not take more than 100ms because raft replication should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning) but we set the timeout to 2s to be sure. Note though that this workaround does not eliminate this problem completely. It's still possible for this call and the next call to reach different servers and those servers to have different states from each other. For example, this call can reach a leader and succeed, while the call below can go to a follower that is still behind the leader and get an "ACL not found" error. However, this is a pretty unlikely case because clients have sticky connections to a server, and those connections get rebalanced only every 2-3min. And so, this workaround should work in a vast majority of cases.

t-eckert

Spectacular! I always learn something new reading through your code.

Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051

Nomad creates Consul ACL tokens and service registrations to support Consul service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a tokens or services that have not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the objects in Consul we try to read them from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. This changset adds preflight checks for the objects we create in Consul: * We add a preflight check for ACL tokens after we login via via Workload Identity and in the function we use to derive tokens in the legacy workflow. We do this check early because we also want to use this token for registering group services in the allocrunner hooks. * We add a preflight check for services right before we bootstrap Envoy in the taskrunner hook, so that we have time for our service client to batch updates to the local Consul agent in addition to the local agent sync. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #10451 Fixes: #20516 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051 Ref: https://hashicorp.atlassian.net/browse/NET-9273 Follow-up: https://hashicorp.atlassian.net/browse/NET-10138

ishustava force-pushed the ishustava/fix-acl-not-found branch from 2512866 to a60e6f6 Compare December 1, 2021 04:53

ishustava requested review from a team, kschoche and t-eckert and removed request for a team December 1, 2021 04:54

t-eckert approved these changes Dec 1, 2021

View reviewed changes

lkysow approved these changes Dec 1, 2021

View reviewed changes

ishustava merged commit 1af9eda into main Dec 1, 2021

ishustava deleted the ishustava/fix-acl-not-found branch December 1, 2021 18:19

ishustava mentioned this pull request Dec 1, 2021

Caching of ACL error responses causes longer start times of applications on Kubernetes hashicorp/consul#11704

Open

This was referenced Apr 27, 2022

Mitigate against token replication lag hashicorp/consul-ecs#79

Merged

Mitigate against token replication lag hashicorp/terraform-aws-consul-ecs#102

Merged

mikemorris mentioned this pull request Oct 4, 2022

There doesn't appear to be a way to create an API Gateway, or Gateway per cluster in a federated WAN hashicorp/consul-api-gateway#300

Closed

tgross mentioned this pull request May 10, 2024

consul connect is extremely unreliable at startup hashicorp/nomad#9307

Closed

tgross mentioned this pull request Jun 18, 2024

Consul: add preflight checks for Envoy bootstrap hashicorp/nomad#23381

Merged

hc-github-team-nomad-core mentioned this pull request Jun 27, 2024

Backport of Consul: add preflight checks for Envoy bootstrap into release/1.8.x hashicorp/nomad#23449

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a workaround to check that the ACL token is replicated to other Consul servers #887

Add a workaround to check that the ACL token is replicated to other Consul servers #887

ishustava commented Dec 1, 2021 •

edited

Loading

t-eckert left a comment

Add a workaround to check that the ACL token is replicated to other Consul servers #887

Add a workaround to check that the ACL token is replicated to other Consul servers #887

Conversation

ishustava commented Dec 1, 2021 • edited Loading

t-eckert left a comment

Choose a reason for hiding this comment

ishustava commented Dec 1, 2021 •

edited

Loading