Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: batch and use a WaitGroup for loadCerts #4

Merged
merged 16 commits into from
Aug 25, 2023

Conversation

wbollock
Copy link
Collaborator

This is a large batch of changes, but most of them help the main goal, which was to significantly help performance of loadCerts on large Vault installations (180,000 certificates) through concurrency, fan-out, and batching.

I saw anywhere from 7x to 30x performance improvements, especially on remote exporter installations.

I did include some technically unrelated fixes but think they're fine to bunch together.

There are mostly atomic commits but significant changes include:

  • 5fbe79d - added histogram metrics x509_load_certs_duration_seconds_bucket and x509_watch_certs_duration_seconds_bucket to have a better understanding of performance
  • abf36aa - fixed forever increasing expired cert count metric
  • 5ee3b38 - fixed metric resetting on new watch cert runs
  • 1d6b1d2 - use batching and wait groups to significantly speed up Vault
  • 85eb9dc - new --batch-size-percent float CLI argument (default 1%)

This will make it easier to see the impact of performance fixes on these
endpoints and help debug slow vault pki exporter runs
This prevents `x509_expired_cert_count` from creeping up despite certs
not continuing to expire
There's no reason to reset the gauges and this hurts the exporter by
clearing all metrics between runs leading to empty scrapes and stale
time series
This is meant to increase the efficiency of loadCerts by batching them
and using a WaitGroup to further utilize goRoutines with these lookups.
It's a ton of API requests especially per certificate so I hope this
will speed things up.
Adds a lot more checks to protect against bad or incomplete requests
this makes the loadCert batch size configuration via the
`--batch-size-percent` variable with a default value of 1%. I found that
to be pretty good on my large Vault installation
updates go to 1.21.0 and all dependecies
no need to push docker images on every PR
cmd/main.go Outdated Show resolved Hide resolved
pkg/vault-mon/pki.go Outdated Show resolved Hide resolved
pkg/vault-mon/prometheus.go Outdated Show resolved Hide resolved
wbollock and others added 4 commits August 25, 2023 10:05
so we don't have to pass around the reference
Co-authored-by: Will Hegedus <will@wbhegedus.me>
@wbollock wbollock requested a review from wbh1 August 25, 2023 14:33
@wbollock wbollock merged commit 532ed38 into linode-obs:master Aug 25, 2023
@wbollock wbollock deleted the fix/metric_expiry_ branch August 25, 2023 16:15
@wbollock wbollock restored the fix/metric_expiry_ branch August 25, 2023 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants