Connect: Leaf certs are re-created unnecessarily due to race #4479
Labels
theme/connect
Anything related to Consul Connect, Service Mesh, Side Car Proxies
type/bug
Feature does not function as expected
Milestone
The agent's leaf certificate fetcher runs a separate go-routine to watch for changes to the root CA so that it can proactively rotate certificates when the root changes.
The loop that handles the (cached) response though has a bug where even if it the roots didn't change, a nil err is sent on the channel.
consul/agent/cache-types/connect_ca_leaf.go
Lines 204 to 219 in b5abf61
That nil error is handled in the main leaf fetch routine above:
consul/agent/cache-types/connect_ca_leaf.go
Lines 84 to 90 in b5abf61
And if a
nil
response is returned, it's take as a signal that roots changed an a new leaf is needed dropping through that select and into the key gen and CSR signing code below.In the happy case this is rarely observed because the root watch goroutine is started after the timeout chan is setup:
consul/agent/cache-types/connect_ca_leaf.go
Lines 84 to 90 in b5abf61
And both the timeoutChan and the blocking cache get for the roots use the same timeout meaning that unless the root actually changes, the timeout usually wins and exits without generating a new certificate. But this is relying on timing not to work out wrong and it's a simple fix.
But if the roots request returned (and cached) an error response, then the same error is returned immediately when the background CA fetcher returns (which could be any time). In this case it causes the leaf CA to be re-generated even though the Roots didn't change (they just had an error cached which is another issue entirely but made this one apparent).
The text was updated successfully, but these errors were encountered: