Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LRU cache enabled agent, healthcheck API does not respond the status if the attestor plugin returns error #4827

Closed
hiyosi opened this issue Jan 23, 2024 · 4 comments · Fixed by #4852
Assignees
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Milestone

Comments

@hiyosi
Copy link
Contributor

hiyosi commented Jan 23, 2024

  • Version: nightly
  • Platform: linux
  • Subsystem: agent

In case of the agent with LRU cache enabled, healthcheck API does not return response if the workload attestation is failed.

> spire-agent healthcheck -verbose
Checking agent health...

// never returned response

My quick debug, it seems to be blocked because updates are not passed at the following codes.

for {
select {
case update := <-subscriber.Updates():
update.Identities = filterIdentities(update.Identities, log)
if err := sendX509SVIDResponse(update, stream, log, quietLogging); err != nil {
return err
}
case <-ctx.Done():
return nil
}
}

In v1.8.7( or LRU cache disabled), even if the workload attestor returns error, healthcheck endpoint returns healthy.

Should we avoid at least a situation the response is blocked?

@evan2645 evan2645 added the triage/in-progress Issue triage is in progress label Jan 23, 2024
@evan2645
Copy link
Member

Thank you very much @hiyosi for tracking nightlies and reporting this 🙏

@amartinezfayo amartinezfayo added this to the 1.9.0 milestone Jan 23, 2024
@amartinezfayo amartinezfayo added priority/urgent Issue is approved and is must be completed in the assigned milestone and removed triage/in-progress Issue triage is in progress labels Jan 23, 2024
@rturner3
Copy link
Collaborator

rturner3 commented Jan 24, 2024

A few ideas I can think of to consider off the top of my head:

  • Change the agent health check implementation to no longer depend on FetchX509SVID, since workload attestation is not guaranteed to produce selectors that match a registration entry authorized to the agent anyway. There is already a difference in behavior between the /live and /ready health check endpoints from the gRPC health API used by spire-agent healthcheck. The former APIs rely on FetchX509Bundles instead of FetchX509SVID, which I believe shouldn't exhibit the same behavior you described. Perhaps we should just consolidate both implementations to depend on FetchX509Bundles.
  • Preserve the old behavior of FetchX509SVID to return earlier by pushing updates over the cache subscriber channel using some predefined timeout in the agent code. This could be trickier to implement correctly for the reasons explained here

@azdagron Any thoughts on this since you've been more actively involved with health checking in SPIRE?

@rturner3
Copy link
Collaborator

Another observation I wanted to point out is that I was unable to reproduce on Linux or macOS by building/running the SPIRE binaries locally, but this does seem to be reproducible when running on a local kind K8s cluster.

@MarcosDY
Copy link
Collaborator

LRU cache is unable to work properly when attestation result in no selectors,
It is possible to reproduce changing agent config to work with a Workloadattestor that must not generate selectors like:

diff --git a/conf/agent/agent.conf b/conf/agent/agent.conf
index cf1fcb353..32a1ef9d0 100644
--- a/conf/agent/agent.conf
+++ b/conf/agent/agent.conf
@@ -18,7 +18,7 @@ plugins {
             directory = "./.data"
         }
     }
-    WorkloadAttestor "unix" {
+    WorkloadAttestor "docker" {
         plugin_data {
         }
     }

and then run a fetch x509

./bin/spire-agent api fetch
rpc error: code = DeadlineExceeded desc = context deadline exceeded

that will result in LRU never returning an update and cause timouts,
It is possible to reproduce in SUITES="suites/k8s" make integration and getting timeouts.
It is not clear to me why that test is passing in CI...

As side note...

Old cache was notifying as soon as a subscriber is created link,
but in LRU that is not happening link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants