Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect: Leaf certs are re-created unnecessarily due to race #4479

Closed
banks opened this issue Aug 1, 2018 · 0 comments · Fixed by #5091
Closed

Connect: Leaf certs are re-created unnecessarily due to race #4479

banks opened this issue Aug 1, 2018 · 0 comments · Fixed by #5091
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Milestone

Comments

@banks
Copy link
Member

banks commented Aug 1, 2018

The agent's leaf certificate fetcher runs a separate go-routine to watch for changes to the root CA so that it can proactively rotate certificates when the root changes.

The loop that handles the (cached) response though has a bug where even if it the roots didn't change, a nil err is sent on the channel.

for {
// We only set our index if its newer than what is previously set.
old := atomic.LoadUint64(&c.caIndex)
if old == roots.Index || old > roots.Index {
break
}
// Set the new index atomically. If the caIndex value changed
// in the meantime, retry.
if atomic.CompareAndSwapUint64(&c.caIndex, old, roots.Index) {
break
}
}
// Trigger the channel since we updated.
ch <- nil

That nil error is handled in the main leaf fetch routine above:

case err := <-newRootCACh:
// A new root CA triggers us to refresh the leaf certificate.
// If there was an error while getting the root CA then we return.
// Otherwise, we leave the select statement and move to generation.
if err != nil {
return result, err
}

And if a nil response is returned, it's take as a signal that roots changed an a new leaf is needed dropping through that select and into the key gen and CSR signing code below.

In the happy case this is rarely observed because the root watch goroutine is started after the timeout chan is setup:

case err := <-newRootCACh:
// A new root CA triggers us to refresh the leaf certificate.
// If there was an error while getting the root CA then we return.
// Otherwise, we leave the select statement and move to generation.
if err != nil {
return result, err
}

And both the timeoutChan and the blocking cache get for the roots use the same timeout meaning that unless the root actually changes, the timeout usually wins and exits without generating a new certificate. But this is relying on timing not to work out wrong and it's a simple fix.

But if the roots request returned (and cached) an error response, then the same error is returned immediately when the background CA fetcher returns (which could be any time). In this case it causes the leaf CA to be re-generated even though the Roots didn't change (they just had an error cached which is another issue entirely but made this one apparent).

@banks banks added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Aug 1, 2018
@banks banks added this to the Upcoming milestone Aug 1, 2018
@pearkes pearkes modified the milestones: Upcoming, 1.2.3 Aug 7, 2018
@pearkes pearkes modified the milestones: 1.2.3, 1.3.0 Sep 11, 2018
@mkeeler mkeeler modified the milestones: 1.3.0, 1.4.0 Oct 10, 2018
@banks banks modified the milestones: 1.4.0, Upcoming Oct 15, 2018
@banks banks modified the milestones: Upcoming, 1.4.1 Nov 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants