ops: add ttl metrics for certificate expiration#130110
ops: add ttl metrics for certificate expiration#130110craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
| ) | ||
|
|
||
| func makeMetrics() Metrics { | ||
| func expirationGauge(metadata metric.Metadata, ci *CertInfo) *metric.Gauge { |
There was a problem hiding this comment.
Should these lock the certificate manager while they're being read?
| // the correct expiration and ttl values are set on the metrics vars that are | ||
| // exposed to our collectors. It uses a single key pair to approximate the | ||
| // behavior across other key pairs. | ||
| func TestMetricsValues(t *testing.T) { |
There was a problem hiding this comment.
Only testing the CA certs for TTL and expiration, open to going beyond that as well
d1d4eaa to
029bb3f
Compare
| return "Client" | ||
| case TenantPem: | ||
| return "Tenant Client" | ||
| case TenantSigningPem: |
There was a problem hiding this comment.
Missing string value
xinhaoz
left a comment
There was a problem hiding this comment.
Nice! Overall changes look good, just some small comments on naming and commit hygiene.
Reviewed 5 of 6 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons)
-- commits line 2 at r2:
Please add the headling ops:... to the commit message too.
-- commits line 7 at r2:
I'm a little confused about this release note - not sure if this security: part is misleading. It seems like this release note is more suitable as the commit message body. We don't need to add any comments on refactoring in release notes - they are for documenting user facing changes.
https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages
pkg/security/certificate_metrics.go line 175 at r1 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
Should these lock the certificate manager while they're being read?
This fn seems fine, as it's not actually reading from the cm directly and the ci passed in isn't expected to change. Left a comment on the caller of this fn regarding the locking.
pkg/security/certificate_metrics.go line 132 at r2 (raw file):
metaClientTTL = metric.Metadata{ Name: "security.certificate.ttl.client", Help: "Seconds till expiration for the client certificates., labeled by SQL user. 0 means expired, no certificate or error.",
nit: extra . after certificates
pkg/security/certificate_metrics.go line 202 at r2 (raw file):
// metric to zero. // cm.mu must be held to protect the certificates. Metrics do their own atomicity. func (cm *CertificateManager) createMetrics() Metrics {
It seems like this has to be called when the cm is locked since it's accessing fields protected under mu. Do you mind just renaming this to createMetricsLocked to signal that? I see that it's being called in LoadCertificates which locks it so the usages are good!
pritesh-lahoti
left a comment
There was a problem hiding this comment.
Reviewed 4 of 6 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons and @xinhaoz)
pkg/security/certificate_manager.go line 115 at r2 (raw file):
tenantIdentifier uint64 // timeSource, if set, specifies the time source with which the metrics are set
nit
Suggestion:
set.pkg/security/certificate_manager.go line 132 at r2 (raw file):
// WithTimeSource allows the caller to pass a time source to be used // by the Metrics struct (mostly for testing)
nit
Suggestion:
(mostly for testing).pkg/security/certificate_metrics.go line 160 at r2 (raw file):
Unit: metric.Unit_TIMESTAMP_SEC, }
nit: extra new line
pkg/security/certificate_loader_test.go line 130 at r2 (raw file):
} var defaultExpiration = timeutil.Now().Add(time.Hour)
Should this initialization happen within the test?
This value might get set way before the test using it is actually being executed.
pkg/security/certificate_metrics_test.go line 46 at r2 (raw file):
{"ca.crt", 0777, caCert}, {"ca.key", 0777, []byte("ca.key")}, } {
Could this also have the other node/client/tenant certificates?
angles-n-daemons
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @pritesh-lahoti and @xinhaoz)
Previously, xinhaoz (Xin Hao Zhang) wrote…
Please add the headling
ops:...to the commit message too.
ah makes sense, will do
Previously, xinhaoz (Xin Hao Zhang) wrote…
I'm a little confused about this release note - not sure if this
security:part is misleading. It seems like this release note is more suitable as the commit message body. We don't need to add any comments on refactoring in release notes - they are for documenting user facing changes.
https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages
gotcha, thanks for pointing this out
pkg/security/certificate_manager.go line 115 at r2 (raw file):
Previously, pritesh-lahoti wrote…
nit
👀
pkg/security/certificate_manager.go line 132 at r2 (raw file):
Previously, pritesh-lahoti wrote…
nit
👀
pkg/security/certificate_metrics.go line 132 at r2 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
nit: extra
.after certificates
👀
pkg/security/certificate_metrics.go line 160 at r2 (raw file):
Previously, pritesh-lahoti wrote…
nit: extra new line
👀
pkg/security/certificate_metrics.go line 202 at r2 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
It seems like this has to be called when the cm is locked since it's accessing fields protected under
mu. Do you mind just renaming this tocreateMetricsLockedto signal that? I see that it's being called in LoadCertificates which locks it so the usages are good!
make sense! I'll make that naming change, definitely seems better
pkg/security/certificate_loader_test.go line 130 at r2 (raw file):
Previously, pritesh-lahoti wrote…
Should this initialization happen within the test?
This value might get set way before the test using it is actually being executed.
What's your preference? Right now this primarily used as a noop
pkg/security/certificate_metrics_test.go line 46 at r2 (raw file):
Previously, pritesh-lahoti wrote…
Could this also have the other node/client/tenant certificates?
yeah I'll give this a go
029bb3f to
842b982
Compare
|
Missing: update on the client certificate TTL aggregate gauge. Need to add that before merging |
842b982 to
5687239
Compare
pritesh-lahoti
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @angles-n-daemons, @arjunmahishi, and @xinhaoz)
pkg/security/cert_expiry_cache.go line 123 at r4 (raw file):
// GetTTL retrieves seconds till cert expiration for the given username, if it exists. // An expiration of 0 indicates an entry was not found.
Suggestion:
A TTL of 0pkg/security/cert_expiry_cache.go line 224 at r4 (raw file):
// timeNow returns the current time depending on the time source of the cache. func (c *ClientCertExpirationCache) timeNow() int64 {
nit: extra line
pkg/util/metric/aggmetric/gauge.go line 183 at r4 (raw file):
} // Updates the function on the gauge
nit
Suggestion:
// UpdateFn updates the function on the gauge.pkg/security/certificate_loader_test.go line 130 at r2 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
What's your preference? Right now this primarily used as a noop
I believe moving it within the test will be resilient to potential flakiness.
Currently, cockroach only exposes point in time certificate expiration metrics. If the certificate is to expire 1 day from now, we expose a gauge `security.certificate.expiration.<cert-type>` which is the unix timestamp when it will expire. This PR also exposes a ttl metric `security.certificate.ttl.<cert-type>` so that consumers of this information can run operations based on their distance to certificate expiration without additional transformations. Additionally, this PR refactors how the expiration gauges are set, so that reads of the gauge directly reference the value of the certificate. Epic: CRDB-40209 Fixes: cockroachdb#77376 Release note (ops change): new metrics which expose the ttl for various certificates
5687239 to
c830973
Compare
pritesh-lahoti
left a comment
There was a problem hiding this comment.
Thank you for all the changes!
Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @angles-n-daemons, @arjunmahishi, and @xinhaoz)
|
bors r+ |
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates. This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup. This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values. Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running.
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates. This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup. This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values. Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running.
135564: build/teamcity: add changes to enable openmetrics in nightly roachtests r=nameisbhaskar a=sambhav-jain-16 There was a regression in #135239. The change was reverted and this PR aims to fix the regressions Epic: https://cockroachlabs.atlassian.net/browse/CRDB-41852 Release note: None 135596: security: bugfix, ensure cert expiry metrics reflect reloaded certs r=angles-n-daemons a=angles-n-daemons security: bugfix, ensure cert expiry metrics reflect reloaded certs The PR #130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates. This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup. This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values. Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running. Epic: none Fixes: #135093 136122: crosscluster/physical: update tokens when altering topology r=dt a=dt Release note: none. Epic: none. This regressed in #135637 which was assigning all conusmer sub-partitions the whole partition due to sharing the original token. 136182: kvserver: use leader leases in various flow control tests r=kvoli a=arulajmani See individual commits for details. Co-authored-by: Sambhav Jain <sambhav.jain@cockroachlabs.com> Co-authored-by: Brian Dillmann <brian.dillmann@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates. This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup. This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values. Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running.
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates. This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup. This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values. Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running.
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates. This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup. This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values. Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running.
Changes in cockroachdb#130110 were added to add labelled ttl metrics to client certificates. It achieved this by changing the system which cached certificate expiries to cache on a composite struct of two metrics, rather than just an expiration metric. The struct itself housed the metrics as inline values, rather than pointers, so updates were registered in the cached values only, and not the registry in which they were reporting. This means that updates to client certificate expirations would not be reflected by the ttl or expiration metrics. This ticket modifies those elements so that they are not copied when they are pulled from the cache. Fixes: cockroachdb#142681 Epic: CRDB-40209 Release note (bug fix): Fixes bug in client certificate expiration metrics.
142682: security: cache certificate expiration metrics as pointers r=angles-n-daemons a=angles-n-daemons security: cache certificate expiration metrics as pointers Changes in #130110 were added to add labelled ttl metrics to client certificates. It achieved this by changing the system which cached certificate expiries to cache on a composite struct of two metrics, rather than just an expiration metric. The struct itself housed the metrics as inline values, rather than pointers, so updates were registered in the cached values only, and not the registry in which they were reporting. This means that updates to client certificate expirations would not be reflected by the ttl or expiration metrics. This ticket modifies those elements so that they are not copied when they are pulled from the cache. Fixes: #142681 Epic: CRDB-40209 Release note (bug fix): Fixes bug in client certificate expiration metrics. Co-authored-by: Brian Dillmann <brian.dillmann@cockroachlabs.com>
Changes in #130110 were added to add labelled ttl metrics to client certificates. It achieved this by changing the system which cached certificate expiries to cache on a composite struct of two metrics, rather than just an expiration metric. The struct itself housed the metrics as inline values, rather than pointers, so updates were registered in the cached values only, and not the registry in which they were reporting. This means that updates to client certificate expirations would not be reflected by the ttl or expiration metrics. This ticket modifies those elements so that they are not copied when they are pulled from the cache. Fixes: #142681 Epic: CRDB-40209 Release note (bug fix): Fixes bug in client certificate expiration metrics.
Changes in cockroachdb#130110 were added to add labelled ttl metrics to client certificates. It achieved this by changing the system which cached certificate expiries to cache on a composite struct of two metrics, rather than just an expiration metric. The struct itself housed the metrics as inline values, rather than pointers, so updates were registered in the cached values only, and not the registry in which they were reporting. This means that updates to client certificate expirations would not be reflected by the ttl or expiration metrics. This ticket modifies those elements so that they are not copied when they are pulled from the cache. Fixes: cockroachdb#142681 Epic: CRDB-40209 Release note (bug fix): Fixes bug in client certificate expiration metrics.
Changes in cockroachdb#130110 were added to add labelled ttl metrics to client certificates. It achieved this by changing the system which cached certificate expiries to cache on a composite struct of two metrics, rather than just an expiration metric. The struct itself housed the metrics as inline values, rather than pointers, so updates were registered in the cached values only, and not the registry in which they were reporting. This means that updates to client certificate expirations would not be reflected by the ttl or expiration metrics. This ticket modifies those elements so that they are not copied when they are pulled from the cache. Fixes: cockroachdb#142681 Epic: CRDB-40209 Release note (bug fix): Fixes bug in client certificate expiration metrics.
ops: add ttl metrics for certificate expiration
Currently, cockroach only exposes point in time certificate
expiration metrics. If the certificate is to expire 1 day from now,
we expose a gauge
security.certificate.expiration.<cert-type>which is the unix timestamp when it will expire. This PR also
exposes a ttl metric
security.certificate.ttl.<cert-type>so thatconsumers of this information can run operations based on their
distance to certificate expiration without additional
transformations.
Additionally, this PR refactors how the expiration gauges are set,
so that reads of the gauge directly reference the value of the
certificate.
Epic: CRDB-40209
Fixes: #77376
Release note (ops change): new metrics which expose the ttl for various
certificates