[RFC] SPIRE Agent Cache Redesign Proposal #2940

prasadborole1 · 2022-04-12T18:20:39Z

Background

The RFC provides the background and discusses the scaling problem caused due to current SPIRE Agent cache design which stores X.509-SVIDs. This proposal builds on top of approaches discussed in RFC and provides a more detailed redesign proposal.

Existing Behavior

All SPIRE Agent RPCs and SPIRE Agent-Server sync implementations are based on the fact that all authorized entries and corresponding SVIDs are cached locally. The existing cache implementation is as following:

type Cache struct {
	…
	// records holds the records for registration entries, keyed by registration entry ID
	records map[string]*cacheRecord

	// selectors holds the selector indices, keyed by a selector key
	selectors map[selector]*selectorIndex
}

type selectorIndex struct {
	// subs holds the subscriptions related to this selector
	subs map[*subscriber]struct{}

	// records holds the cache records related to this selector
	records map[*cacheRecord]struct{}
}

type cacheRecord struct {
	entry *common.RegistrationEntry
	svid  *X509SVID

	// this is unused today
	subs  map[*subscriber]struct{}
}

SPIRE Agent-Server sync

During the periodic sync, SPIRE Agent fetches all authorized entries and bundles from the SPIRE Server. It calculates the missing/deleted entries and expiring SVIDs and updates them in the cache. The subscribers of these affected SVIDs are notified accordingly.

SPIRE Agent RPCs

SPIRE Agent supports 2 types of RPCs:

Streaming RPCs: The RPCs like FetchX509SVID subscribe to cache changes based on the selector set and wait for updates to the stream until it is closed.
FetchJWTSVID: The RPCs like FetchJWTSVID(unary) retrieve SPIFFE IDs from cache based on the selector set so that they can prepare relevant JWT-SVIDs

In a nutshell, SVIDs get added/updated/removed to/from cache during SPIRE Agent-Server sync. The RPCs which need SVIDs, are listening on grpc streams for SVID updates. Some RPCs(like FetchJWTSVID) need details about SPIFFE IDs/ authorized registration entries which the Agent is responsible for in order to validate request parameters.

Proposal

Today the existing cache model stores registration entries along with X509-SVIDs. If registrationEntryCache is separated from x509SVIDCache then it will allow us to store all authorized registration entries and a limited number of X509-SVIDs in Agent memory. The X509SVIDCacheSize will have a high default value which can be overridden via a new experimental configuration field of SPIRE Agent.

FetchJWTSVID

Since all the authorized registration entries will be cached, FetchJWTSVID implementation will mostly remain the same.

FetchX509SVID

As not all SVIDs will necessarily be cached, there will be a delay in fetching an SVID for the first time, in the event of cache miss.
In order to reduce this delay, we have to increase SPIRE Server-Agent sync interval. (SVIDs get fetched only during sync).
Since sync today involves fetching of registration entries and bundles from SPIRE Server, a new registration sync module can be introduced to handle fetch operation from SPIRE Server. This will allow SVIDCacheSync to be performed more frequently (~500 msec).

RegistrationEntryAndBundleSync

This module could run at the present cadence of 5 sec.
It will be responsible for maintaining entryCache and bundleCache by fetching authorized registration entries and trust bundles from the SPIRE Server.
If the trust domain bundle changes then this module will notify all subscribers

SVIDCacheSync

This module will retain most of the SPIRE Server-Agent sync actions except fetching of entries and trust bundle.
It will prioritize caching of SVIDs corresponding to entries that have subscribers.
SVIDs that need to be removed from the SVID cache due to their corresponding entry(ies) being removed from the entry cache will be detected using a mapping of entry ID -> SVID cache record.
For the remaining cache size (X509SVIDCacheSize - currentCacheSize), we can implement an LRU deletion policy with preference to SVIDs which had subscribers most recently.
4.1. The most recently-used statistic is not necessarily an indicator of an identity being needed again soon. However, there could be cases where some workloads run on dedicated hardware and are more likely to be scheduled on the same host. This heuristic is better when compared against randomly adding entries for the remainder of size of cache.
4.2. If the remaining cache size is negative after all SVID records with active subscribers are accounted for; inactive SVID records containing lowest “last subscription timestamp” will be removed to get the cache size down to the configured limit. This timestamp will be 0 for SVID records which never had subscribers since joining the cache.
4.3. If the remaining cache size is positive, then we request SVID signings for Registration Entries not represented in the SVID cache, up to the cache limit size.
It will also update the SVID cache record of entries which were updated in SPIRE Server or SVIDs whose TTL passed half life.
Most of the actions performed in this sync are in memory (except fetchSVID which will be executed for new authorized entries event or new subscription resulting in cache miss), hence we can execute this module more frequently than 5 sec.
Justification for this approach:
7.1. Prevents potential DDoS concerns for the scenario when a large number of workloads across the infrastructure are launched around the same time and don't have their identities cached in the local agent. If we made the SVID signing calls to the Server synchronous in the Agent handlers for this case, we could have a potentially unbounded number of signing requests to SPIRE Server.
7.2. Server requests happen in a different context than the Agent Workload API handler contexts, which eliminates some potential retry complexity in the client code for the case when the Server APIs return an error.

New Cache Models

All SVIDs corresponding to subscribed entries will be cached therefore X509SVIDCacheSize is a soft limit. Not caching SVIDs with active subscribers is not an option.
If authorizedEntries < X509SVIDCacheSize then there is no functional change to any of RPC behavior compared at present. There may be a short wait for first fetch SVID due to new delay introduced between the entry sync and SVID sync.

Please find the following high level approach of how we can split the current Cache struct.

type RegistrationCache struct {
	// records holds the records for registration entries, keyed by registration entry ID
	records map[string]*common.RegistrationEntry
	selectors map[selector]*common.RegistrationEntry
}

type BundleCache struct {
	// Bundles is a set of ALL trust bundles available to the agent, keyed by trust domain
	Bundles map[spiffeid.TrustDomain]*bundleutil.Bundle
}

type SVIDCache struct {
	…
	// records holds the records for registration entries, keyed by registration entry ID
	records map[string]*cacheRecord

	// subscribers holds the selector indices, keyed by a selector key
	subscribers map[selector]*selectorIndex
}

type cacheRecord struct {
	…
	svid  *X509SVID
}

type selectorIndex struct {
	// subs holds the subscriptions related to this selector
	subs map[*subscriber]struct{}
	// records holds the cache records related to this selector
	records map[*cacheRecord]struct{}
}

Request For Comments

Does this overall cache split design make sense?
What should be SVIDCacheSync frequency?

The text was updated successfully, but these errors were encountered:

prasadborole1 · 2022-04-19T16:45:41Z

Expanding on changes required to FetchX509SVID rpc:
FetchX509SVID will need to be blocked, until all SVIDs a subscriber is entitled to, are fetched into cache (which will happen in subsequent SVIDSync iterations). This will ensure that only a complete list of active SVIDs are returned to the user as the first grpc event.
It eliminates the case when users close the stream after the first event (e.g. cli cmd: spire-agent api fetch x509) which would have an incomplete list of SVIDs if we don't wait for next SVID sync to update the cache.

rturner3 · 2023-06-05T18:43:01Z

This has been implemented behind a feature flag in #3181. The follow-up to make this the default behavior is tracked in #4224.

prasadborole1 mentioned this issue Jun 22, 2022

Implement LRU cache for storing SVIDs in SPIRE Agent #3181

Merged

rturner3 mentioned this issue May 15, 2023

[RFC] Configurable SPIRE Agent LRU X.509-SVID caching strategy #2593

Closed

rturner3 closed this as completed Jun 5, 2023

rturner3 mentioned this issue Jan 24, 2024

LRU cache enabled agent, healthcheck API does not respond the status if the attestor plugin returns error #4827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] SPIRE Agent Cache Redesign Proposal #2940

[RFC] SPIRE Agent Cache Redesign Proposal #2940

prasadborole1 commented Apr 12, 2022 •

edited

Loading

prasadborole1 commented Apr 19, 2022 •

edited

Loading

rturner3 commented Jun 5, 2023

[RFC] SPIRE Agent Cache Redesign Proposal #2940

[RFC] SPIRE Agent Cache Redesign Proposal #2940

Comments

prasadborole1 commented Apr 12, 2022 • edited Loading

Background

Existing Behavior

SPIRE Agent-Server sync

SPIRE Agent RPCs

Proposal

FetchJWTSVID

FetchX509SVID

RegistrationEntryAndBundleSync

SVIDCacheSync

New Cache Models

Request For Comments

prasadborole1 commented Apr 19, 2022 • edited Loading

rturner3 commented Jun 5, 2023

prasadborole1 commented Apr 12, 2022 •

edited

Loading

prasadborole1 commented Apr 19, 2022 •

edited

Loading