Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy #577

Merged
merged 2 commits into from
Mar 1, 2024

Conversation

v0lkan
Copy link
Contributor

@v0lkan v0lkan commented Mar 1, 2024

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy

Description

Until now, VSecM Sentinel init command stanza was executing as soon as VSecM Sentinel booted up. That resulted a race condition because it was highly likely that VSecM Safe was not ready to receive messages yet. We solved out this as introducing the sleep: pragma as a temporary hack:

For example sleep:30000 would instruct VSecM Sentinel to wait 30 more seconds before running the init scripts; which saves the day but is suboptimal.

This PR makes VSecM Sentinel try acquiring an identity from the workload API (hence ensure that VSecM Safe is in good shape) before proceeding with the rest of the commands.

This still is “not” enough, but gives a very optimistic guarantee that by the time Sentinel fetches and ID, Safe will already be ready. — That is a step in the right direction.

As a follow-up PR, I’ll enable a direct health check that will call a /health endpoint of VSecM Safe to 100% ensure that VSecM Safe is ready before proceeding — but I wanted to get this in first.

Changes

List the major changes you have made in bullet points:

  • A loop in RunInitCommands() that waits for the workload API to be ready, and gives up after a timeout (default 300secs)
  • Move the “acquire source” logic to “core” (instead of “internal”) as others might leverage that too.
  • Method renaming for clarity and consistency.

Test Policy Compliance

Unit tests need to be added, I haven’t added them.

Code Quality

  • I have followed the coding standards for this project.
  • I have performed a self-review of my code.
  • My code is well-commented, particularly in areas that may be difficult
    to understand.

Documentation

  • I have made corresponding changes to the documentation (if applicable).
  • I have updated any relevant READMEs or wiki pages.

Checklist

Before you submit this PR, please make sure:

  • You have read the contributing guidelines and
    especially the test policy.
  • You have thoroughly tested your changes.
  • You have followed all the contributing guidelines for this project.
  • You understand and agree that your contributions will be publicly available
    under the project’s license.

By submitting this pull request, you confirm that my contribution is made under
the terms of the project’s license and that you have the authority to grant
these rights.


Thank you for your contribution to VMware Secrets Manager
🐢⚡️!

Signed-off-by: Volkan Özçelik <ovolkan@vmware.com>
Signed-off-by: Volkan Özçelik <ovolkan@vmware.com>
@@ -37,7 +37,7 @@ func rootKeyTriplet(content string) (string, string, string) {
}

func keys() (string, string, string) {
p := env.KeyGenRootKeyPath()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the ones for Safe because Safe was looking like an adjective in say SafeSpiffeId() implying spiffie id is safe/secure/etc; which is not the case; the spiffe id is an id destined to safe it is a SpiffeIdForSafe()

Once renaming Safe env getters, the rest were inconsistent; so I renamed these too, for consistency and alignment.

@@ -53,8 +55,38 @@ import (
func RunInitCommands(ctx context.Context) {
cid := ctx.Value("correlationId").(*string)

src, acquired := spiffe.AcquireSourceForSentinel(ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix.

for {
select {
case <-timeoutCtx.Done():
log.ErrorLn(cid, "Failed to acquire source at RunInitCommands (1)")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of using Fatalln() here and kill the app; but even if it cannot run init commands, it might still function, so I made a sacrifice of stability over reliability.

This case should ideally never happen anyways.

"github.com/vmware-tanzu/secrets-manager/core/env"
log "github.com/vmware-tanzu/secrets-manager/core/log/rpc"
"github.com/vmware-tanzu/secrets-manager/core/validation"
)

func acquireSource(ctx context.Context) (*workloadapi.X509Source, bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this method public.

func InitCommandRunnerWaitTimeoutForSentinel() time.Duration {
p := os.Getenv("VSECM_SENTINEL_INIT_COMMAND_RUNNER_WAIT_TIMEOUT")
if p == "" {
p = "300000"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300 secs is the default timeout for Sentinel to way.

It typically takes 30 seconds for everything to reconcile, so 300 secs is more than graceful of a wait time.

@v0lkan v0lkan merged commit 8fd4b59 into main Mar 1, 2024
@v0lkan v0lkan deleted the ovolkan/init-command-wait branch March 1, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant