Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a workaround to check that the ACL token is replicated to other Consul servers #887

Merged
merged 1 commit into from
Dec 1, 2021

Conversation

ishustava
Copy link
Contributor

@ishustava ishustava commented Dec 1, 2021

Fixes #862

A consul client may reach out to a follower instead of a leader to resolve the token during the
call to get services. This is because clients talk to servers in the stale consistency mode
to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale).
In that case, it's possible that the token isn't replicated
to that server instance yet. The client will then get an "ACL not found" error
and subsequently cache this not found response. Then our call
to get services from the agent will keep hitting the same "ACL not found" error
until the cache entry expires (determined by the acl_token_ttl which defaults to 30 seconds).
This is not great because it will delay app start up time by 30 seconds in most cases
(if you are running 3 servers, then the probability of ending up on a follower is close to 2/3).

To help with that, we try to first read the token in the stale consistency mode until we
get a successful response. This should not take more than 100ms because raft replication
should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning)
but we set the timeout to 2s to be sure.

Note though that this workaround does not eliminate this problem completely. It's still possible
for this call and the next call to reach different servers and those servers to have different
states from each other.
For example, this call can reach a leader and succeed, while the call below can go to a follower
that is still behind the leader and get an "ACL not found" error.
However, this is a pretty unlikely case because
clients have sticky connections to a server, and those connections get rebalanced only every 2-3min.
And so, this workaround should work in a vast majority of cases.

How I've tested this PR:

  • unit tests
  • manually in a cluster with 3 servers

How I expect reviewers to test this PR:

  • code review

Checklist:

  • Tests added
  • CHANGELOG entry added

    HashiCorp engineers only, community PRs should not add a changelog entry.
    Entries should use present tense (e.g. Add support for...)

…onsul servers

Fixes #862

A consul client may reach out to a follower instead of a leader to resolve the token during the
call to get services below. This is because clients talk to servers in the stale consistency mode
to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale).
In that case, it's possible that the token isn't replicated
to that server instance yet. The client will then get an "ACL not found" error
and subsequently cache this not found response. Then our call below
to get services from the agent will keep hitting the same "ACL not found" error
until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds).
This is not great because it will delay app start up time by 30 seconds in most cases
(if you are running 3 servers, then the probability of ending up on a follower is close to 2/3).

To help with that, we try to first read the token in the stale consistency mode until we
get a successful response. This should not take more than 100ms because raft replication
should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning)
but we set the timeout to 2s to be sure.

Note though that this workaround does not eliminate this problem completely. It's still possible
for this call and the next call to reach different servers and those servers to have different
states from each other.
For example, this call can reach a leader and succeed, while the call below can go to a follower
that is still behind the leader and get an "ACL not found" error.
However, this is a pretty unlikely case because
clients have sticky connections to a server, and those connections get rebalanced only every 2-3min.
And so, this workaround should work in a vast majority of cases.
@ishustava ishustava force-pushed the ishustava/fix-acl-not-found branch from 2512866 to a60e6f6 Compare December 1, 2021 04:53
@ishustava ishustava requested review from a team, kschoche and t-eckert and removed request for a team December 1, 2021 04:54
Copy link
Contributor

@t-eckert t-eckert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spectacular! I always learn something new reading through your code.

@ishustava ishustava merged commit 1af9eda into main Dec 1, 2021
@ishustava ishustava deleted the ishustava/fix-acl-not-found branch December 1, 2021 18:19
tgross added a commit to hashicorp/nomad that referenced this pull request Jun 20, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
tgross added a commit to hashicorp/nomad that referenced this pull request Jun 21, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
tgross added a commit to hashicorp/nomad that referenced this pull request Jun 26, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
tgross added a commit to hashicorp/nomad that referenced this pull request Jun 27, 2024
Nomad creates Consul ACL tokens and service registrations to support Consul
service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks
to the local Consul agent and never directly to the Consul servers. But the
local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a tokens or services that have not yet replicated to the
follower that the local client is connected to. This request gets a 404 on the
ACL token and that negative entry gets cached, preventing any retries from
succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the objects in Consul we try to read them from
the local agent in stale consistency mode (which prevents a failed read from
being cached). This cannot completely eliminate this source of error because
it's possible that Consul cluster replication is unhealthy at the time we need
it, but this should make Envoy bootstrap significantly more robust.

This changset adds preflight checks for the objects we create in Consul:
* We add a preflight check for ACL tokens after we login via via Workload
  Identity and in the function we use to derive tokens in the legacy
  workflow. We do this check early because we also want to use this token for
  registering group services in the allocrunner hooks.
* We add a preflight check for services right before we bootstrap Envoy in the
  taskrunner hook, so that we have time for our service client to batch updates
  to the local Consul agent in addition to the local agent sync.

We've added the timeouts to be configurable via node metadata rather than the
usual static configuration because for most cases, users should not need to
touch or even know these values are configurable; the configuration is mostly
available for testing.


Fixes: #9307
Fixes: #10451
Fixes: #20516

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Ref: https://hashicorp.atlassian.net/browse/NET-9273
Follow-up: https://hashicorp.atlassian.net/browse/NET-10138
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

403 (ACL not found) followed by successful deployment
3 participants