Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy #577

v0lkan · 2024-03-01T16:38:48Z

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy

Description

Until now, VSecM Sentinel init command stanza was executing as soon as VSecM Sentinel booted up. That resulted a race condition because it was highly likely that VSecM Safe was not ready to receive messages yet. We solved out this as introducing the sleep: pragma as a temporary hack:

For example sleep:30000 would instruct VSecM Sentinel to wait 30 more seconds before running the init scripts; which saves the day but is suboptimal.

This PR makes VSecM Sentinel try acquiring an identity from the workload API (hence ensure that VSecM Safe is in good shape) before proceeding with the rest of the commands.

This still is “not” enough, but gives a very optimistic guarantee that by the time Sentinel fetches and ID, Safe will already be ready. — That is a step in the right direction.

As a follow-up PR, I’ll enable a direct health check that will call a /health endpoint of VSecM Safe to 100% ensure that VSecM Safe is ready before proceeding — but I wanted to get this in first.

Changes

List the major changes you have made in bullet points:

A loop in RunInitCommands() that waits for the workload API to be ready, and gives up after a timeout (default 300secs)
Move the “acquire source” logic to “core” (instead of “internal”) as others might leverage that too.
Method renaming for clarity and consistency.

Test Policy Compliance

Unit tests need to be added, I haven’t added them.

Code Quality

I have followed the coding standards for this project.
I have performed a self-review of my code.
My code is well-commented, particularly in areas that may be difficult
to understand.

Documentation

I have made corresponding changes to the documentation (if applicable).
I have updated any relevant READMEs or wiki pages.

Checklist

Before you submit this PR, please make sure:

You have read the contributing guidelines and
especially the test policy.
You have thoroughly tested your changes.
You have followed all the contributing guidelines for this project.
You understand and agree that your contributions will be publicly available
under the project’s license.

By submitting this pull request, you confirm that my contribution is made under
the terms of the project’s license and that you have the authority to grant
these rights.

Thank you for your contribution to VMware Secrets Manager
🐢⚡️!

Signed-off-by: Volkan Özçelik <ovolkan@vmware.com>

v0lkan · 2024-03-01T16:40:46Z

app/keygen/cmd/decrypt.go

@@ -37,7 +37,7 @@ func rootKeyTriplet(content string) (string, string, string) {
 }

 func keys() (string, string, string) {
- p := env.KeyGenRootKeyPath()


I renamed the ones for Safe because Safe was looking like an adjective in say SafeSpiffeId() implying spiffie id is safe/secure/etc; which is not the case; the spiffe id is an id destined to safe it is a SpiffeIdForSafe()

Once renaming Safe env getters, the rest were inconsistent; so I renamed these too, for consistency and alignment.

v0lkan · 2024-03-01T16:41:16Z

app/sentinel/busywait/initialization/run.go

@@ -53,8 +55,38 @@ import (
 func RunInitCommands(ctx context.Context) {
 cid := ctx.Value("correlationId").(*string)

+ src, acquired := spiffe.AcquireSourceForSentinel(ctx)


This is the actual fix.

v0lkan · 2024-03-01T16:42:44Z

app/sentinel/busywait/initialization/run.go

+ for {
+ select {
+ case <-timeoutCtx.Done():
+ log.ErrorLn(cid, "Failed to acquire source at RunInitCommands (1)")


I thought of using Fatalln() here and kill the app; but even if it cannot run init commands, it might still function, so I made a sacrifice of stability over reliability.

This case should ideally never happen anyways.

v0lkan · 2024-03-01T16:43:06Z

app/sentinel/internal/safe/get.go

 "github.com/vmware-tanzu/secrets-manager/core/env"
 log "github.com/vmware-tanzu/secrets-manager/core/log/rpc"
 "github.com/vmware-tanzu/secrets-manager/core/validation"
 )

-func acquireSource(ctx context.Context) (*workloadapi.X509Source, bool) {


made this method public.

v0lkan · 2024-03-01T16:44:06Z

core/env/sentinel.go

+func InitCommandRunnerWaitTimeoutForSentinel() time.Duration {
+ p := os.Getenv("VSECM_SENTINEL_INIT_COMMAND_RUNNER_WAIT_TIMEOUT")
+ if p == "" {
+ p = "300000"


300 secs is the default timeout for Sentinel to way.

It typically takes 30 seconds for everything to reconcile, so 300 secs is more than graceful of a wait time.

v0lkan added 2 commits March 1, 2024 00:31

🌟 enhancement(VSecM Sentinel): RunInitCommand() waits for Safe

8e2e84b

Signed-off-by: Volkan Özçelik <ovolkan@vmware.com>

use timeout from env

19226b4

Signed-off-by: Volkan Özçelik <ovolkan@vmware.com>

v0lkan requested a review from farhan-pasha as a code owner March 1, 2024 16:38

v0lkan commented Mar 1, 2024

View reviewed changes

v0lkan merged commit 8fd4b59 into main Mar 1, 2024

v0lkan deleted the ovolkan/init-command-wait branch March 1, 2024 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy #577

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy #577

v0lkan commented Mar 1, 2024

v0lkan Mar 1, 2024

v0lkan Mar 1, 2024

v0lkan Mar 1, 2024

v0lkan Mar 1, 2024

v0lkan Mar 1, 2024

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy #577

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy #577

Conversation

v0lkan commented Mar 1, 2024

Enable VSecM Sentinel Init Command to Wait Until VSecM Safe is Healthy

Description

Changes

Test Policy Compliance

Code Quality

Documentation

Checklist

v0lkan Mar 1, 2024

Choose a reason for hiding this comment

v0lkan Mar 1, 2024

Choose a reason for hiding this comment

v0lkan Mar 1, 2024

Choose a reason for hiding this comment

v0lkan Mar 1, 2024

Choose a reason for hiding this comment

v0lkan Mar 1, 2024

Choose a reason for hiding this comment