-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul: improve reliability of deregistration #24166
Conversation
300fb82
to
079bc0b
Compare
As of #24166, Nomad agents will use their own token to deregister services and checks from Consul. This returns the deregistration path to the pre-Workload Identity workflow. Expand the documentation to make clear why certain ACL policies are required for clients. Additionally, we did not explicitly call out that auth methods should not set an expiration on Consul tokens. Nomad does not have a facility to refresh these tokens if they expire. Even if Nomad could, there's no way to re-inject them into Envoy sidecars for Consul Service Mesh without recreating the task anyways, which is what happens today. Warn users that they should not set an expiration. Closes: #20185 (wontfix) Ref: https://hashicorp.atlassian.net/browse/NET-10262
// isOldNomadService returns true if the ID matches an old pattern managed by | ||
// Nomad. | ||
// | ||
// Pre-0.7.1 task service IDs are of the form: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: I was touching all the callers of this anyway... I think it's probably safe to remove pre-0.7.1 backwards compatibility code for service IDs at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -973,8 +974,7 @@ func (c *ServiceClient) merge(ops *operations) { | |||
func (c *ServiceClient) sync(reason syncReason) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often does sync
run and would you want to re-run it faster after fails > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without failure it's triggered every 30s or whenever the client pushes an update. On failure we drop down to 1s with backoff.
As of #24166, Nomad agents will use their own token to deregister services and checks from Consul. This returns the deregistration path to the pre-Workload Identity workflow. Expand the documentation to make clear why certain ACL policies are required for clients. Additionally, we did not explicitly call out that auth methods should not set an expiration on Consul tokens. Nomad does not have a facility to refresh these tokens if they expire. Even if Nomad could, there's no way to re-inject them into Envoy sidecars for Consul Service Mesh without recreating the task anyways, which is what happens today. Warn users that they should not set an expiration. Closes: #20185 (wontfix) Ref: https://hashicorp.atlassian.net/browse/NET-10262
When the local Consul agent receives a deregister request, it performs a pre-flight check using the locally cached ACL token. The agent then sends the request upstream to the Consul servers as part of anti-entropy, using its own token. This requires that the token we use for deregistration is valid even though that's not the token used to write to the Consul server. There are several cases where the service identity token might no longer exist at the time of deregistration: * A race condition between the sync and destroying the allocation. * Misconfiguration of the Consul auth method with a TTL. * Out-of-band destruction of the token. Additionally, Nomad's sync with Consul returns early if there are any errors, which means that a single broken token can prevent any other service on the Nomad agent from being registered or deregistered. Update Nomad's sync with Consul to use the Nomad agent's own Consul token for deregistration, regardless of which token the service was registered with. Accumulate errors from the sync so that they no longer block deregistration of other services. Fixes: #20159
079bc0b
to
1400913
Compare
When the local Consul agent receives a deregister request, it performs a pre-flight check using the locally cached ACL token. The agent then sends the request upstream to the Consul servers as part of anti-entropy, using its own token. This requires that the token we use for deregistration is valid even though that's not the token used to write to the Consul server.
There are several cases where the service identity token might no longer exist at the time of deregistration:
Additionally, Nomad's sync with Consul returns early if there are any errors, which means that a single broken token can prevent any other service on the Nomad agent from being registered or deregistered.
Update Nomad's sync with Consul to use the Nomad agent's own Consul token for deregistration, regardless of which token the service was registered with. Accumulate errors from the sync so that they no longer block deregistration of other services.
Fixes: #20159
Ref: https://hashicorp.atlassian.net/browse/NET-10286
I've got some documentation improvements coming as well: #24167. In addition to the usual tests, I've run the following scenario locally. Run the tproxy example job and verify that services are registered in the server's catalog:
Delete all the tokens associated with our workload identity login method:
Stop the job, and verify that all services are deregistered successfully from the server's catalog:
No unexpected errors show up on the Consul or Nomad side now. We do see the following in the logs once the allocation is GC'd, which we'd expect to see because we deleted the token so logging that token out will always fail: