Skip to content

Comments

ops: add ttl metrics for certificate expiration#130110

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
angles-n-daemons:certificate-ttl-metrics
Sep 13, 2024
Merged

ops: add ttl metrics for certificate expiration#130110
craig[bot] merged 1 commit intocockroachdb:masterfrom
angles-n-daemons:certificate-ttl-metrics

Conversation

@angles-n-daemons
Copy link
Contributor

@angles-n-daemons angles-n-daemons commented Sep 4, 2024

ops: add ttl metrics for certificate expiration

Currently, cockroach only exposes point in time certificate
expiration metrics. If the certificate is to expire 1 day from now,
we expose a gauge security.certificate.expiration.<cert-type>
which is the unix timestamp when it will expire. This PR also
exposes a ttl metric security.certificate.ttl.<cert-type> so that
consumers of this information can run operations based on their
distance to certificate expiration without additional
transformations.

Additionally, this PR refactors how the expiration gauges are set,
so that reads of the gauge directly reference the value of the
certificate.

Epic: CRDB-40209
Fixes: #77376

Release note (ops change): new metrics which expose the ttl for various
certificates

@angles-n-daemons angles-n-daemons requested a review from a team September 4, 2024 20:29
@angles-n-daemons angles-n-daemons requested a review from a team as a code owner September 4, 2024 20:29
@angles-n-daemons angles-n-daemons requested a review from a team September 4, 2024 20:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

)

func makeMetrics() Metrics {
func expirationGauge(metadata metric.Metadata, ci *CertInfo) *metric.Gauge {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these lock the certificate manager while they're being read?

// the correct expiration and ttl values are set on the metrics vars that are
// exposed to our collectors. It uses a single key pair to approximate the
// behavior across other key pairs.
func TestMetricsValues(t *testing.T) {
Copy link
Contributor Author

@angles-n-daemons angles-n-daemons Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only testing the CA certs for TTL and expiration, open to going beyond that as well

return "Client"
case TenantPem:
return "Tenant Client"
case TenantSigningPem:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing string value

Copy link
Contributor

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Overall changes look good, just some small comments on naming and commit hygiene.

Reviewed 5 of 6 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons)


-- commits line 2 at r2:
Please add the headling ops:... to the commit message too.


-- commits line 7 at r2:
I'm a little confused about this release note - not sure if this security: part is misleading. It seems like this release note is more suitable as the commit message body. We don't need to add any comments on refactoring in release notes - they are for documenting user facing changes.
https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages


pkg/security/certificate_metrics.go line 175 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

Should these lock the certificate manager while they're being read?

This fn seems fine, as it's not actually reading from the cm directly and the ci passed in isn't expected to change. Left a comment on the caller of this fn regarding the locking.


pkg/security/certificate_metrics.go line 132 at r2 (raw file):

	metaClientTTL = metric.Metadata{
		Name:        "security.certificate.ttl.client",
		Help:        "Seconds till expiration for the client certificates., labeled by SQL user. 0 means expired, no certificate or error.",

nit: extra . after certificates


pkg/security/certificate_metrics.go line 202 at r2 (raw file):

// metric to zero.
// cm.mu must be held to protect the certificates. Metrics do their own atomicity.
func (cm *CertificateManager) createMetrics() Metrics {

It seems like this has to be called when the cm is locked since it's accessing fields protected under mu. Do you mind just renaming this to createMetricsLocked to signal that? I see that it's being called in LoadCertificates which locks it so the usages are good!

Copy link
Contributor

@pritesh-lahoti pritesh-lahoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons and @xinhaoz)


pkg/security/certificate_manager.go line 115 at r2 (raw file):

	tenantIdentifier uint64

	// timeSource, if set, specifies the time source with which the metrics are set

nit

Suggestion:

set.

pkg/security/certificate_manager.go line 132 at r2 (raw file):

// WithTimeSource allows the caller to pass a time source to be used
// by the Metrics struct (mostly for testing)

nit

Suggestion:

(mostly for testing).

pkg/security/certificate_metrics.go line 160 at r2 (raw file):

		Unit:        metric.Unit_TIMESTAMP_SEC,
	}

nit: extra new line


pkg/security/certificate_loader_test.go line 130 at r2 (raw file):

}

var defaultExpiration = timeutil.Now().Add(time.Hour)

Should this initialization happen within the test?
This value might get set way before the test using it is actually being executed.


pkg/security/certificate_metrics_test.go line 46 at r2 (raw file):

		{"ca.crt", 0777, caCert},
		{"ca.key", 0777, []byte("ca.key")},
	} {

Could this also have the other node/client/tenant certificates?

@angles-n-daemons angles-n-daemons changed the title ops: add ttl metrics for certificate expiration obs: add ttl metrics for certificate expiration Sep 10, 2024
@angles-n-daemons angles-n-daemons changed the title obs: add ttl metrics for certificate expiration ops: add ttl metrics for certificate expiration Sep 10, 2024
Copy link
Contributor Author

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pritesh-lahoti and @xinhaoz)


-- commits line 2 at r2:

Previously, xinhaoz (Xin Hao Zhang) wrote…

Please add the headling ops:... to the commit message too.

ah makes sense, will do


-- commits line 7 at r2:

Previously, xinhaoz (Xin Hao Zhang) wrote…

I'm a little confused about this release note - not sure if this security: part is misleading. It seems like this release note is more suitable as the commit message body. We don't need to add any comments on refactoring in release notes - they are for documenting user facing changes.
https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages

gotcha, thanks for pointing this out


pkg/security/certificate_manager.go line 115 at r2 (raw file):

Previously, pritesh-lahoti wrote…

nit

👀


pkg/security/certificate_manager.go line 132 at r2 (raw file):

Previously, pritesh-lahoti wrote…

nit

👀


pkg/security/certificate_metrics.go line 132 at r2 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

nit: extra . after certificates

👀


pkg/security/certificate_metrics.go line 160 at r2 (raw file):

Previously, pritesh-lahoti wrote…

nit: extra new line

👀


pkg/security/certificate_metrics.go line 202 at r2 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

It seems like this has to be called when the cm is locked since it's accessing fields protected under mu. Do you mind just renaming this to createMetricsLocked to signal that? I see that it's being called in LoadCertificates which locks it so the usages are good!

make sense! I'll make that naming change, definitely seems better


pkg/security/certificate_loader_test.go line 130 at r2 (raw file):

Previously, pritesh-lahoti wrote…

Should this initialization happen within the test?
This value might get set way before the test using it is actually being executed.

What's your preference? Right now this primarily used as a noop


pkg/security/certificate_metrics_test.go line 46 at r2 (raw file):

Previously, pritesh-lahoti wrote…

Could this also have the other node/client/tenant certificates?

yeah I'll give this a go

@angles-n-daemons
Copy link
Contributor Author

Missing: update on the client certificate TTL aggregate gauge. Need to add that before merging

@angles-n-daemons angles-n-daemons requested review from a team as code owners September 12, 2024 01:56
@angles-n-daemons angles-n-daemons requested review from aa-joshi and arjunmahishi and removed request for a team September 12, 2024 01:56
Copy link
Contributor

@pritesh-lahoti pritesh-lahoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @angles-n-daemons, @arjunmahishi, and @xinhaoz)


pkg/security/cert_expiry_cache.go line 123 at r4 (raw file):

// GetTTL retrieves seconds till cert expiration for the given username, if it exists.
// An expiration of 0 indicates an entry was not found.

Suggestion:

A TTL of 0

pkg/security/cert_expiry_cache.go line 224 at r4 (raw file):

// timeNow returns the current time depending on the time source of the cache.
func (c *ClientCertExpirationCache) timeNow() int64 {

nit: extra line


pkg/util/metric/aggmetric/gauge.go line 183 at r4 (raw file):

}

// Updates the function on the gauge

nit

Suggestion:

// UpdateFn updates the function on the gauge.

pkg/security/certificate_loader_test.go line 130 at r2 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

What's your preference? Right now this primarily used as a noop

I believe moving it within the test will be resilient to potential flakiness.

Currently, cockroach only exposes point in time certificate
expiration metrics. If the certificate is to expire 1 day from now,
we expose a gauge `security.certificate.expiration.<cert-type>`
which is the unix timestamp when it will expire. This PR also
exposes a ttl metric `security.certificate.ttl.<cert-type>` so that
consumers of this information can run operations based on their
distance to certificate expiration without additional
transformations.

Additionally, this PR refactors how the expiration gauges are set,
so that reads of the gauge directly reference the value of the
certificate.

Epic: CRDB-40209
Fixes: cockroachdb#77376

Release note (ops change): new metrics which expose the ttl for various
certificates
Copy link
Contributor

@pritesh-lahoti pritesh-lahoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the changes!

Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @angles-n-daemons, @arjunmahishi, and @xinhaoz)

Copy link
Contributor

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: !
Remember to update your PR description to sync with the commit message .

@angles-n-daemons
Copy link
Contributor Author

bors r+

@craig craig bot merged commit 66edff7 into cockroachdb:master Sep 13, 2024
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Nov 19, 2024
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing
expiration metrics. Prior to that change, the certificate metrics values
were updated on each metrics load. Afterwards, new metrics objects were
created for each load of certificates.

This created a bug in that the new expiration values would not be
found in any of the system exhaust (metrics scrape or tsdb) because the
registered metrics objects were the ones created on startup.

This new change instead allows the metrics to close the whole
CertificateManager object, so that they only need to be created once, and
therefore the initial registration of metrics reflects persistently
valid values.

Release note (bug fix): security.certificate.* metrics will now be
updated if a node loads new certificates while running.
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Nov 22, 2024
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing
expiration metrics. Prior to that change, the certificate metrics values
were updated on each metrics load. Afterwards, new metrics objects were
created for each load of certificates.

This created a bug in that the new expiration values would not be
found in any of the system exhaust (metrics scrape or tsdb) because the
registered metrics objects were the ones created on startup.

This new change instead allows the metrics to close the whole
CertificateManager object, so that they only need to be created once, and
therefore the initial registration of metrics reflects persistently
valid values.

Release note (bug fix): security.certificate.* metrics will now be
updated if a node loads new certificates while running.
craig bot pushed a commit that referenced this pull request Nov 26, 2024
135564: build/teamcity: add changes to enable openmetrics in nightly roachtests r=nameisbhaskar a=sambhav-jain-16

There was a regression in #135239.
The change was reverted and this PR aims to fix the regressions

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-41852

Release note: None

135596: security: bugfix, ensure cert expiry metrics reflect reloaded certs r=angles-n-daemons a=angles-n-daemons

security: bugfix, ensure cert expiry metrics reflect reloaded certs

The PR #130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates.

This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup.

This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values.

Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running.

Epic: none
Fixes: #135093

136122: crosscluster/physical: update tokens when altering topology r=dt a=dt

Release note: none.
Epic: none.

This regressed in #135637 which was assigning all conusmer sub-partitions the whole partition due to sharing the original token.

136182: kvserver: use leader leases in various flow control tests r=kvoli a=arulajmani

See individual commits for details. 

Co-authored-by: Sambhav Jain <sambhav.jain@cockroachlabs.com>
Co-authored-by: Brian Dillmann <brian.dillmann@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Nov 26, 2024
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing
expiration metrics. Prior to that change, the certificate metrics values
were updated on each metrics load. Afterwards, new metrics objects were
created for each load of certificates.

This created a bug in that the new expiration values would not be
found in any of the system exhaust (metrics scrape or tsdb) because the
registered metrics objects were the ones created on startup.

This new change instead allows the metrics to close the whole
CertificateManager object, so that they only need to be created once, and
therefore the initial registration of metrics reflects persistently
valid values.

Release note (bug fix): security.certificate.* metrics will now be
updated if a node loads new certificates while running.
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Nov 26, 2024
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing
expiration metrics. Prior to that change, the certificate metrics values
were updated on each metrics load. Afterwards, new metrics objects were
created for each load of certificates.

This created a bug in that the new expiration values would not be
found in any of the system exhaust (metrics scrape or tsdb) because the
registered metrics objects were the ones created on startup.

This new change instead allows the metrics to close the whole
CertificateManager object, so that they only need to be created once, and
therefore the initial registration of metrics reflects persistently
valid values.

Release note (bug fix): security.certificate.* metrics will now be
updated if a node loads new certificates while running.
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Jan 9, 2025
The PR cockroachdb#130110 added certificate TTL metrics alongside our existing
expiration metrics. Prior to that change, the certificate metrics values
were updated on each metrics load. Afterwards, new metrics objects were
created for each load of certificates.

This created a bug in that the new expiration values would not be
found in any of the system exhaust (metrics scrape or tsdb) because the
registered metrics objects were the ones created on startup.

This new change instead allows the metrics to close the whole
CertificateManager object, so that they only need to be created once, and
therefore the initial registration of metrics reflects persistently
valid values.

Release note (bug fix): security.certificate.* metrics will now be
updated if a node loads new certificates while running.
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Mar 12, 2025
Changes in cockroachdb#130110 were added to add labelled ttl metrics to
client certificates. It achieved this by changing the system
which cached certificate expiries to cache on a composite
struct of two metrics, rather than just an expiration
metric.

The struct itself housed the metrics as inline values,
rather than pointers, so updates were registered in the
cached values only, and not the registry in which they were
reporting. This means that updates to client certificate
expirations would not be reflected by the ttl or expiration
metrics.

This ticket modifies those elements so that they are not
copied when they are pulled from the cache.

Fixes: cockroachdb#142681
Epic: CRDB-40209

Release note (bug fix): Fixes bug in client certificate expiration metrics.
craig bot pushed a commit that referenced this pull request Mar 13, 2025
142682: security:	cache certificate expiration metrics as pointers r=angles-n-daemons a=angles-n-daemons

security:	cache certificate expiration metrics as pointers

Changes in #130110 were added to add labelled ttl metrics to client certificates. It achieved this by changing the system which cached certificate expiries to cache on a composite struct of two metrics, rather than just an expiration metric.

The struct itself housed the metrics as inline values, rather than pointers, so updates were registered in the cached values only, and not the registry in which they were reporting. This means that updates to client certificate expirations would not be reflected by the ttl or expiration metrics.

This ticket modifies those elements so that they are not copied when they are pulled from the cache.

Fixes: #142681
Epic: CRDB-40209

Release note (bug fix): Fixes bug in client certificate expiration metrics.

Co-authored-by: Brian Dillmann <brian.dillmann@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Mar 13, 2025
Changes in #130110 were added to add labelled ttl metrics to
client certificates. It achieved this by changing the system
which cached certificate expiries to cache on a composite
struct of two metrics, rather than just an expiration
metric.

The struct itself housed the metrics as inline values,
rather than pointers, so updates were registered in the
cached values only, and not the registry in which they were
reporting. This means that updates to client certificate
expirations would not be reflected by the ttl or expiration
metrics.

This ticket modifies those elements so that they are not
copied when they are pulled from the cache.

Fixes: #142681
Epic: CRDB-40209

Release note (bug fix): Fixes bug in client certificate expiration metrics.
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Mar 14, 2025
Changes in cockroachdb#130110 were added to add labelled ttl metrics to
client certificates. It achieved this by changing the system
which cached certificate expiries to cache on a composite
struct of two metrics, rather than just an expiration
metric.

The struct itself housed the metrics as inline values,
rather than pointers, so updates were registered in the
cached values only, and not the registry in which they were
reporting. This means that updates to client certificate
expirations would not be reflected by the ttl or expiration
metrics.

This ticket modifies those elements so that they are not
copied when they are pulled from the cache.

Fixes: cockroachdb#142681
Epic: CRDB-40209

Release note (bug fix): Fixes bug in client certificate expiration metrics.
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this pull request Mar 14, 2025
Changes in cockroachdb#130110 were added to add labelled ttl metrics to
client certificates. It achieved this by changing the system
which cached certificate expiries to cache on a composite
struct of two metrics, rather than just an expiration
metric.

The struct itself housed the metrics as inline values,
rather than pointers, so updates were registered in the
cached values only, and not the registry in which they were
reporting. This means that updates to client certificate
expirations would not be reflected by the ttl or expiration
metrics.

This ticket modifies those elements so that they are not
copied when they are pulled from the cache.

Fixes: cockroachdb#142681
Epic: CRDB-40209

Release note (bug fix): Fixes bug in client certificate expiration metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

linter: Add PR check to encourage including issue or epic refs in the PR/commit messages

4 participants