Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of ACL errors due to SI tokens not being found #7441

Open
jorgemarey opened this issue Mar 12, 2020 · 8 comments
Open

Lots of ACL errors due to SI tokens not being found #7441

jorgemarey opened this issue Mar 12, 2020 · 8 comments
Labels
theme/consul-nomad Consul & Nomad shared usability type/bug Feature does not function as expected

Comments

@jorgemarey
Copy link
Contributor

Overview of the Issue

Hi, we have a cluster with consul 1.7.1 and nomad 0.10.4, and we're testing connect with ACL in that environment. After some tests (everything worked perfectly), we're seeing the following:

Client Logs:

Mar 12 07:26:07 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:26:07.660688Z","error":"rpc error making call: rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.16","Port":8300,"Zone":""}}
Mar 12 07:26:30 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:26:30.995703Z","error":"rpc error making call: rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.5","Port":8300,"Zone":""}}
Mar 12 07:26:38 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:26:38.383670Z","error":"rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.11","Port":8300,"Zone":""}}
Mar 12 07:26:53 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:26:53.285521Z","error":"rpc error making call: rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.5","Port":8300,"Zone":""}}
Mar 12 07:26:54 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:26:54.610362Z","error":"rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.11","Port":8300,"Zone":""}}
Mar 12 07:27:09 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:27:09.359415Z","error":"rpc error making call: rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.16","Port":8300,"Zone":""}}
Mar 12 07:27:31 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:27:31.795928Z","error":"rpc error making call: rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.5","Port":8300,"Zone":""}}
Mar 12 07:27:40 wrkr-dd379243-3b consul: {"@level":"error","@message":"RPC failed to server","@module":"agent.client","@timestamp":"2020-03-12T07:27:40.164076Z","error":"rpc error making call: ACL not found","method":"Intention.Match","server":{"IP":"10.172.0.11","Port":8300,"Zone":""}}

These log messages are appearing in some nodes at a rate of several per minute. I added some more log, compiled and changed a server node, and I'm seeing Intention.Match requests made with tokens that were already deleted (previous envoy nomad tasks):

log

Server logs:

Mar 12 07:32:42 ctrl-fa10eb42-3b consul: {"@level":"warn","@message":"Match information","@module":"agent.server.intentions","@timestamp":"2020-03-12T07:32:42.113348Z","intention":"","token":"723aefc1-855c-906d-9df5-cfad94c91a63"}
Mar 12 07:32:46 ctrl-fa10eb42-3b consul: {"@level":"warn","@message":"Match information","@module":"agent.server.intentions","@timestamp":"2020-03-12T07:32:46.093700Z","intention":"","token":"fb4a896d-86af-9837-addf-1777363edfea"}
Mar 12 07:33:20 ctrl-fa10eb42-3b consul: {"@level":"warn","@message":"Match information","@module":"agent.server.intentions","@timestamp":"2020-03-12T07:33:20.664108Z","intention":"","token":"f2c68a05-0436-4ba8-b3b0-0f8c0e3ee7cf"}
Mar 12 07:33:22 ctrl-fa10eb42-3b consul: {"@level":"warn","@message":"Match information","@module":"agent.server.intentions","@timestamp":"2020-03-12T07:33:22.373436Z","intention":"","token":"6175af82-dea3-2487-b4d7-7b66ef4ffe49"}

After rebooting consul these log messages stop appearing.

I don't know if this is related to the cache the agents maintain, but I don't think these messages should appear constantly.

Thanks!

@mkeeler
Copy link
Member

mkeeler commented Mar 12, 2020

@jorgemarey Could you provide a little more information about how the tokens are changing. I suspect this is related to the watch initiated here:

err = s.cache.Notify(s.ctx, cachetype.IntentionMatchName, &structs.IntentionQueryRequest{
Datacenter: s.source.Datacenter,
QueryOptions: structs.QueryOptions{Token: s.token},
Match: &structs.IntentionQueryMatch{
Type: structs.IntentionMatchDestination,
Entries: []structs.IntentionMatchEntry{
{
Namespace: s.proxyID.NamespaceOrDefault(),
Name: s.proxyCfg.DestinationServiceName,
},
},
},
}, intentionsWatchID, s.ch)
if err != nil {
return err
}

The token used for this request is the token that was used to register the service with the agent or the agents default token at the time of registration. It doesn't look like this token will ever change after we start tracking a proxy service though so if you were registering a service with a new token, or simply deleting that token then that could be the cause here.

Regardless, a little more insight into what you are doing with token management could help to track this down.

@mkeeler mkeeler added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Mar 12, 2020
@jorgemarey
Copy link
Contributor Author

Hi @mkeeler , thanks for the replay.

I'm not doing anything special. I just run nomad with consul and ACLs. I used the simple job file used as example on their examle

In the nomad code they request a token here.

For what I can see, the tokens that reach consul in the Intention.Match function are the ServiceIdentity tokens that nomad creates for the envoy containers. Every time it adds a new container, it creates a new ServiceIdentity token for that envoy, and when it destroys it, nomad deletes the token. The errors I'm seeing are with tokens that are alrealy deleted, but the consul client seems to be still making that request to the servers with those tokens.

I guess maybe this issue is on their side? But as I rebooted nomad and the error persisted, but when I did that with consul it stoped I thought this was more related to consul.

If I can provide more info I'll be happy to do so, but I think this is more related to what nomad do.

Thanks!

@ghost ghost removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels Mar 12, 2020
@jorgemarey
Copy link
Contributor Author

jorgemarey commented Mar 13, 2020

Hi again, @mkeeler

I was doing some debbuging and found that this is failing in a loop here

The fetch function calls refresh and then that calls fetch again without ever stopping.

consul/agent/cache/cache.go

Lines 664 to 688 in 2cf0a3c

func (c *Cache) refresh(opts *RegisterOptions, attempt uint, t string, key string, r Request) {
// Sanity-check, we should not schedule anything that has refresh disabled
if !opts.Refresh {
return
}
// Check if cache was stopped
if atomic.LoadUint32(&c.stopped) == 1 {
return
}
// If we're over the attempt minimum, start an exponential backoff.
if wait := backOffWait(attempt); wait > 0 {
time.Sleep(wait)
}
// If we have a timer, wait for it
if opts.RefreshTimer > 0 {
time.Sleep(opts.RefreshTimer)
}
// Trigger. The "allowNew" field is false because in the time we were
// waiting to refresh we may have expired and got evicted. If that
// happened, we don't want to create a new entry.
c.fetch(t, key, r, false, attempt, 0, true, true)
}

I compiled the consul code with some more debug lines and deployed this to our test environment and I could verify this.

Could a request to ConnectAuthorize trigger that behaviour?

@jorgemarey
Copy link
Contributor Author

Hi again, just to provide more info. The request for the same token are made during 3 days, After those 3 days errors for that ACL token stop. I guess that related to the cache duration that's configured in consul.

@mkeeler
Copy link
Member

mkeeler commented Mar 16, 2020

@jorgemarey Thanks for all the info. This definitely does seem to be related to the cache continuing its background refresh even when nothing is requesting it.

@mkeeler mkeeler added the type/bug Feature does not function as expected label Mar 16, 2020
@jsosulska jsosulska added the theme/consul-nomad Consul & Nomad shared usability label Apr 17, 2020
@jorgemarey
Copy link
Contributor Author

jorgemarey commented Jun 19, 2020

Hi @mkeeler any news on this? We just upgraded a prod cluster to 1.7.4 and keep seeing this error. As we also use nomad, this is a highly dynamic enviroment with a lot of tokens created and deleted. It doesn't affect the correct behaviour of the environment, but generates a lot of error logs.

I don't know if this is the same problem. As the one issued here (this is a permission denied)

2020-06-19T07:32:09.544Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.165:8300 error="rpc error making call: Permission denied"
2020-06-19T07:32:45.180Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.49:8300 error="rpc error making call: rpc error making call: Permission denied"
2020-06-19T07:34:05.749Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.79:8300 error="rpc error making call: rpc error making call: Permission denied"
2020-06-19T07:34:29.427Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.49:8300 error="rpc error making call: rpc error making call: Permission denied"
2020-06-19T07:35:43.070Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.79:8300 error="rpc error making call: rpc error making call: Permission denied"
2020-06-19T07:36:02.601Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.165:8300 error="rpc error making call: Permission denied"
2020-06-19T07:36:55.079Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.49:8300 error="rpc error making call: rpc error making call: Permission denied"
2020-06-19T07:37:42.277Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=XX.XX.XX.49:8300 error="rpc error making call: rpc error making call: Permission denied"

If the token making the requests is the agent one, we have:

node_prefix "" {
  policy = "write"
}
service_prefix "" {
  policy = "read"
}

I chaged service_prefix to "write" but the error persists.

@dnephin
Copy link
Contributor

dnephin commented Jun 30, 2020

It sounds like this may be partially related to #4968. It has to do with cache TTLs being 72h, which doesn't work well with tokens that change frequently.

#8092 made some changes in this area to allow the cache to stop waiting on a fetch if a context is cancelled. It may be possible to pass the context further into fetch() and have the fetch cycle stop once that context is cancelled. I'm not sure yet what the repercussions of that change might be.

@Himura2la
Copy link

Himura2la commented Oct 23, 2024

I'm not absolutely sure that we have the same problem, but we also experience a flood of ACL errors after migrating to Workload Identities and some random deployments in Nomad. First, it fails on method=GET url=/v1/acl/token/self?stale=
Then it keeps failing in a loop on method=ACL.TokenRead and the service deregister request...
Adding the service_prefix "" { policy = "write"} policy to a nomad leader's consul node identity seems fixing the issue, but we can't accept such fix due to security reasons...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/consul-nomad Consul & Nomad shared usability type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

5 participants