-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check if intermediate cert needs to be renewed. #6835
Conversation
a805de9
to
372ef03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the Connect CA, so I won't approve the PR, but I have a couple comments/questions:
- The definition for
waitForActiveCARoot
inTestLeader_SecondaryCA_IntermediateRenew
is missing - Would using
syscall.Getrlimit
cause problems if someone runs the test suite in a non-unix platform like Windows?
@freddygv thanks for the review
|
8d391e3
to
1855f53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @i0rek!
Overall I think this is great and super close. I have one significant suggestion about how we trigger this which I think will also help you solve the test flakyness by not relying on blocking semantics at all.
agent/consul/leader_connect.go
Outdated
intermediateCert.NotAfter) { | ||
needsNewIntermediate = true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently relies on our blocking query implementation which times out every 10 mins to implement a "periodic check". I'm not sure that's the best approach for a few reasons:
- When we upgrade this to streaming there will be no periodic timeouts
- We are using an implementation detail (that this method is called regularly even when there are no changes) in an unexpected/undocumented way. Even without streaming it would be easy to miss this assumption and for example add a line to the replication loop that skips calling this in the case of a timeout where the index is the same. I doubt that's plausibly going to happen in practice but is an example of why it feels wrong to implicitly assume regular iterations of this loop in the steady state.
Could we instead have a dedicated timer loop that somehow triggers the renewal? I realise it's more complicated having two different goroutines call this as they could do so concurrently but I don't think it's that hard to either trigger it via the same goroutine as now - just have it select on either watch update or timer update - or else jsut synchronize them with a mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a good point. It required me redoing most of the work, but it is better now!
logf("INFO", "blockSize %d too big for system limit %d. Adjusting...", blockSize, limit) | ||
blockSize = limit - 3 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IS this solving the darwin annoying warning in tests? If so cool :) I wasn't sure if it's relevant to this PR or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. But it solves the case where your system limits are lower than the lower bound of freeport, which makes its own test fail instantly on OSX.
e0c9e51
to
5e8bb04
Compare
5e8bb04
to
ea92d8c
Compare
Tracking test failures atm. |
63a1502
to
4839682
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Hans, Looking super good!
I think most of this is just naming bikesheds - I really like the split of some of the code that was getting kinda hairy though once names are clear!
The main blocker I see is that the renew can race with noticing a new root which could corrupt the CA config state just like the issue we had the other day. I don't think we need to do anything really sophisticated here - maybe just have a mutex held for the entire process of checking renewal and hold same mutex for any other secondary CA reconfiguration so we can be sure that there is only one reconfig going at once.
Thank you for the review @banks! The provider reconfiguration now also has its own lock. Despite the existence of |
56c44d2
to
88b90ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks Hans! 🎉
@@ -173,6 +177,9 @@ func (s *Server) initializeCA() error { | |||
if err != nil { | |||
return err | |||
} | |||
|
|||
s.caProviderReconfigurationLock.Lock() | |||
defer s.caProviderReconfigurationLock.Unlock() | |||
s.setCAProvider(provider, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good idea to document in the comment for setCAProvider
that it is often called while holding caProviderReconfigurationLock
and so must never take that or call anything that does.
I think it's unlikely it would but it's not super obvious that would cause a deadlock which might bite us eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that warning in a bunch of places.
If system limits are lower than blocksize, it outright fails. Even though it is adjusted now, it is still possible that it fails, but at least not that obvious anymore.
9bc417f
to
03a95a5
Compare
Fixes #6542.
Currently when using the built-in CA provider for Connect, root certificates are valid for 10 years, however secondary DCs get intermediates that are valid for only 1 year. There is no mechanism currently short of rotating the root in the primary that will cause the secondary DCs to renew their intermediates.
This PR adds a check that renews the cert if it is half way through its validity period.
In order to be able to test these changes, a new configuration option was added:
IntermediateCertTTL
which is set extremely low in the tests.The commits for this PR are done so that you could read them one by one and get a better overview. It contains some bugfixes and refactorings. Only the last commit has the new feature!