LRU cache enabled agent, healthcheck API does not respond the status if the attestor plugin returns error #4827

hiyosi · 2024-01-23T07:09:33Z

Version: nightly
Platform: linux
Subsystem: agent

In case of the agent with LRU cache enabled, healthcheck API does not return response if the workload attestation is failed.

> spire-agent healthcheck -verbose
Checking agent health...

// never returned response

My quick debug, it seems to be blocked because updates are not passed at the following codes.

spire/pkg/agent/endpoints/workload/handler.go

Lines 232 to 242 in 2d8555c

    
           for { 
        
           	select { 
        
           	case update := <-subscriber.Updates(): 
        
           		update.Identities = filterIdentities(update.Identities, log) 
        
           		if err := sendX509SVIDResponse(update, stream, log, quietLogging); err != nil { 
        
           			return err 
        
           		} 
        
           	case <-ctx.Done(): 
        
           		return nil 
        
           	} 
        
           }

In v1.8.7( or LRU cache disabled), even if the workload attestor returns error, healthcheck endpoint returns healthy.

Should we avoid at least a situation the response is blocked?

The text was updated successfully, but these errors were encountered:

evan2645 · 2024-01-23T07:57:27Z

Thank you very much @hiyosi for tracking nightlies and reporting this 🙏

rturner3 · 2024-01-24T04:13:28Z

A few ideas I can think of to consider off the top of my head:

Change the agent health check implementation to no longer depend on FetchX509SVID, since workload attestation is not guaranteed to produce selectors that match a registration entry authorized to the agent anyway. There is already a difference in behavior between the /live and /ready health check endpoints from the gRPC health API used by spire-agent healthcheck. The former APIs rely on FetchX509Bundles instead of FetchX509SVID, which I believe shouldn't exhibit the same behavior you described. Perhaps we should just consolidate both implementations to depend on FetchX509Bundles.
Preserve the old behavior of FetchX509SVID to return earlier by pushing updates over the cache subscriber channel using some predefined timeout in the agent code. This could be trickier to implement correctly for the reasons explained here

@azdagron Any thoughts on this since you've been more actively involved with health checking in SPIRE?

rturner3 · 2024-01-29T19:57:55Z

Another observation I wanted to point out is that I was unable to reproduce on Linux or macOS by building/running the SPIRE binaries locally, but this does seem to be reproducible when running on a local kind K8s cluster.

MarcosDY · 2024-01-30T20:03:24Z

LRU cache is unable to work properly when attestation result in no selectors,
It is possible to reproduce changing agent config to work with a Workloadattestor that must not generate selectors like:

diff --git a/conf/agent/agent.conf b/conf/agent/agent.conf
index cf1fcb353..32a1ef9d0 100644
--- a/conf/agent/agent.conf
+++ b/conf/agent/agent.conf
@@ -18,7 +18,7 @@ plugins {
             directory = "./.data"
         }
     }
-    WorkloadAttestor "unix" {
+    WorkloadAttestor "docker" {
         plugin_data {
         }
     }

and then run a fetch x509

./bin/spire-agent api fetch
rpc error: code = DeadlineExceeded desc = context deadline exceeded

that will result in LRU never returning an update and cause timouts,
It is possible to reproduce in SUITES="suites/k8s" make integration and getting timeouts.
It is not clear to me why that test is passing in CI...

As side note...

Old cache was notifying as soon as a subscriber is created link,
but in LRU that is not happening link

evan2645 added the triage/in-progress Issue triage is in progress label Jan 23, 2024

amartinezfayo added this to the 1.9.0 milestone Jan 23, 2024

amartinezfayo added priority/urgent Issue is approved and is must be completed in the assigned milestone and removed triage/in-progress Issue triage is in progress labels Jan 23, 2024

amartinezfayo assigned rturner3 Jan 23, 2024

MarcosDY mentioned this issue Jan 31, 2024

LRU subscribers failed to start when no selector was provided #4852

Merged

rturner3 closed this as completed in #4852 Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LRU cache enabled agent, healthcheck API does not respond the status if the attestor plugin returns error #4827

LRU cache enabled agent, healthcheck API does not respond the status if the attestor plugin returns error #4827

hiyosi commented Jan 23, 2024

evan2645 commented Jan 23, 2024

rturner3 commented Jan 24, 2024 •

edited

Loading

rturner3 commented Jan 29, 2024

MarcosDY commented Jan 30, 2024

LRU cache enabled agent, healthcheck API does not respond the status if the attestor plugin returns error #4827

LRU cache enabled agent, healthcheck API does not respond the status if the attestor plugin returns error #4827

Comments

hiyosi commented Jan 23, 2024

evan2645 commented Jan 23, 2024

rturner3 commented Jan 24, 2024 • edited Loading

rturner3 commented Jan 29, 2024

MarcosDY commented Jan 30, 2024

rturner3 commented Jan 24, 2024 •

edited

Loading