Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul: improve reliability of deregistration #24166

Merged
merged 1 commit into from
Oct 11, 2024
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Oct 10, 2024

When the local Consul agent receives a deregister request, it performs a pre-flight check using the locally cached ACL token. The agent then sends the request upstream to the Consul servers as part of anti-entropy, using its own token. This requires that the token we use for deregistration is valid even though that's not the token used to write to the Consul server.

There are several cases where the service identity token might no longer exist at the time of deregistration:

  • A race condition between the sync and destroying the allocation.
  • Misconfiguration of the Consul auth method with a TTL.
  • Out-of-band destruction of the token.

Additionally, Nomad's sync with Consul returns early if there are any errors, which means that a single broken token can prevent any other service on the Nomad agent from being registered or deregistered.

Update Nomad's sync with Consul to use the Nomad agent's own Consul token for deregistration, regardless of which token the service was registered with. Accumulate errors from the sync so that they no longer block deregistration of other services.

Fixes: #20159
Ref: https://hashicorp.atlassian.net/browse/NET-10286


I've got some documentation improvements coming as well: #24167. In addition to the usual tests, I've run the following scenario locally. Run the tproxy example job and verify that services are registered in the server's catalog:

$ nomad job run ./tproxy.nomad.hcl
...

$ consul catalog services
consul
count-api
count-api-sidecar-proxy
count-dashboard
count-dashboard-sidecar-proxy
nomad
nomad-client

Delete all the tokens associated with our workload identity login method:

for token in $(consul acl token list -format=json | jq -r '.[] | select(.AuthMethod == "nomad-workloads") | .AccessorID'); do consul acl token delete -accessor-id "$token"; done
Token "7900e53b-a649-4686-b258-3dde39bdd07d" deleted successfully
Token "ab02193b-e3f9-9f03-bbf6-77589db9a294" deleted successfully
Token "280719dc-1b55-9dcc-91ea-164d8dfcdf17" deleted successfully

Stop the job, and verify that all services are deregistered successfully from the server's catalog:

$ nomad job stop countdash

$ consul catalog services
consul
nomad
nomad-client

No unexpected errors show up on the Consul or Nomad side now. We do see the following in the logs once the allocation is GC'd, which we'd expect to see because we deleted the token so logging that token out will always fail:

    2024-10-10T13:47:16.696-0400 [WARN]  client.alloc_runner: error running destroy hooks: alloc_id=d4f61b00-d7e1-ac47-6ab6-9b5e6cc125ac
  error=
  | 1 error occurred:
  | \t* destroy hook "consul" failed: 1 error occurred:
  | \t* 1 error occurred:
  | \t* Unexpected response code: 403 (rpc error making call: ACL not found)

@tgross tgross force-pushed the b-consul-token-ttl-refresh branch from 300fb82 to 079bc0b Compare October 10, 2024 18:05
@tgross tgross added this to the 1.9.x milestone Oct 10, 2024
@tgross tgross added backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.8.x backport to 1.8.x release line backport/1.9.x backport to 1.9.x release line theme/consul theme/workload-identity and removed backport/1.8.x backport to 1.8.x release line labels Oct 10, 2024
@tgross tgross marked this pull request as ready for review October 10, 2024 18:52
tgross added a commit that referenced this pull request Oct 10, 2024
As of #24166, Nomad agents will use their own token to deregister services and
checks from Consul. This returns the deregistration path to the pre-Workload
Identity workflow. Expand the documentation to make clear why certain ACL
policies are required for clients.

Additionally, we did not explicitly call out that auth methods should not set an
expiration on Consul tokens. Nomad does not have a facility to refresh these
tokens if they expire. Even if Nomad could, there's no way to re-inject them
into Envoy sidecars for Consul Service Mesh without recreating the task anyways,
which is what happens today. Warn users that they should not set an expiration.

Closes: #20185 (wontfix)
Ref: https://hashicorp.atlassian.net/browse/NET-10262
// isOldNomadService returns true if the ID matches an old pattern managed by
// Nomad.
//
// Pre-0.7.1 task service IDs are of the form:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I was touching all the callers of this anyway... I think it's probably safe to remove pre-0.7.1 backwards compatibility code for service IDs at this point.

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -973,8 +974,7 @@ func (c *ServiceClient) merge(ops *operations) {
func (c *ServiceClient) sync(reason syncReason) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does sync run and would you want to re-run it faster after fails > 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without failure it's triggered every 30s or whenever the client pushes an update. On failure we drop down to 1s with backoff.

tgross added a commit that referenced this pull request Oct 11, 2024
As of #24166, Nomad agents will use their own token to deregister services and
checks from Consul. This returns the deregistration path to the pre-Workload
Identity workflow. Expand the documentation to make clear why certain ACL
policies are required for clients.

Additionally, we did not explicitly call out that auth methods should not set an
expiration on Consul tokens. Nomad does not have a facility to refresh these
tokens if they expire. Even if Nomad could, there's no way to re-inject them
into Envoy sidecars for Consul Service Mesh without recreating the task anyways,
which is what happens today. Warn users that they should not set an expiration.

Closes: #20185 (wontfix)
Ref: https://hashicorp.atlassian.net/browse/NET-10262
When the local Consul agent receives a deregister request, it performs a
pre-flight check using the locally cached ACL token. The agent then sends the
request upstream to the Consul servers as part of anti-entropy, using its own
token. This requires that the token we use for deregistration is valid even
though that's not the token used to write to the Consul server.

There are several cases where the service identity token might no longer exist
at the time of deregistration:
* A race condition between the sync and destroying the allocation.
* Misconfiguration of the Consul auth method with a TTL.
* Out-of-band destruction of the token.

Additionally, Nomad's sync with Consul returns early if there are any errors,
which means that a single broken token can prevent any other service on the
Nomad agent from being registered or deregistered.

Update Nomad's sync with Consul to use the Nomad agent's own Consul token for
deregistration, regardless of which token the service was registered
with. Accumulate errors from the sync so that they no longer block
deregistration of other services.

Fixes: #20159
@tgross tgross force-pushed the b-consul-token-ttl-refresh branch from 079bc0b to 1400913 Compare October 11, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line theme/consul theme/workload-identity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad Jobs unable to update Consul Services after undetermined amount of time.
3 participants