Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] SPIRE Agent Cache Redesign Proposal #2940

Closed
prasadborole1 opened this issue Apr 12, 2022 · 2 comments
Closed

[RFC] SPIRE Agent Cache Redesign Proposal #2940

prasadborole1 opened this issue Apr 12, 2022 · 2 comments

Comments

@prasadborole1
Copy link
Contributor

prasadborole1 commented Apr 12, 2022

Background

The RFC provides the background and discusses the scaling problem caused due to current SPIRE Agent cache design which stores X.509-SVIDs. This proposal builds on top of approaches discussed in RFC and provides a more detailed redesign proposal.

Existing Behavior

All SPIRE Agent RPCs and SPIRE Agent-Server sync implementations are based on the fact that all authorized entries and corresponding SVIDs are cached locally. The existing cache implementation is as following:

type Cache struct {
	…
	// records holds the records for registration entries, keyed by registration entry ID
	records map[string]*cacheRecord

	// selectors holds the selector indices, keyed by a selector key
	selectors map[selector]*selectorIndex
}

type selectorIndex struct {
	// subs holds the subscriptions related to this selector
	subs map[*subscriber]struct{}

	// records holds the cache records related to this selector
	records map[*cacheRecord]struct{}
}

type cacheRecord struct {
	entry *common.RegistrationEntry
	svid  *X509SVID

	// this is unused today
	subs  map[*subscriber]struct{}
}

SPIRE Agent-Server sync

During the periodic sync, SPIRE Agent fetches all authorized entries and bundles from the SPIRE Server. It calculates the missing/deleted entries and expiring SVIDs and updates them in the cache. The subscribers of these affected SVIDs are notified accordingly.

SPIRE Agent RPCs

SPIRE Agent supports 2 types of RPCs:

  1. Streaming RPCs: The RPCs like FetchX509SVID subscribe to cache changes based on the selector set and wait for updates to the stream until it is closed.
  2. FetchJWTSVID: The RPCs like FetchJWTSVID(unary) retrieve SPIFFE IDs from cache based on the selector set so that they can prepare relevant JWT-SVIDs

SPIRE Server-Agent Sync

In a nutshell, SVIDs get added/updated/removed to/from cache during SPIRE Agent-Server sync. The RPCs which need SVIDs, are listening on grpc streams for SVID updates. Some RPCs(like FetchJWTSVID) need details about SPIFFE IDs/ authorized registration entries which the Agent is responsible for in order to validate request parameters.

Proposal

Today the existing cache model stores registration entries along with X509-SVIDs. If registrationEntryCache is separated from x509SVIDCache then it will allow us to store all authorized registration entries and a limited number of X509-SVIDs in Agent memory. The X509SVIDCacheSize will have a high default value which can be overridden via a new experimental configuration field of SPIRE Agent.

New SPIRE Server-Agent Sync with new cahe

FetchJWTSVID

Since all the authorized registration entries will be cached, FetchJWTSVID implementation will mostly remain the same.

FetchX509SVID

  1. As not all SVIDs will necessarily be cached, there will be a delay in fetching an SVID for the first time, in the event of cache miss.
  2. In order to reduce this delay, we have to increase SPIRE Server-Agent sync interval. (SVIDs get fetched only during sync).
  3. Since sync today involves fetching of registration entries and bundles from SPIRE Server, a new registration sync module can be introduced to handle fetch operation from SPIRE Server. This will allow SVIDCacheSync to be performed more frequently (~500 msec).

RegistrationEntryAndBundleSync

  1. This module could run at the present cadence of 5 sec.
  2. It will be responsible for maintaining entryCache and bundleCache by fetching authorized registration entries and trust bundles from the SPIRE Server.
  3. If the trust domain bundle changes then this module will notify all subscribers

SVIDCacheSync

  1. This module will retain most of the SPIRE Server-Agent sync actions except fetching of entries and trust bundle.
  2. It will prioritize caching of SVIDs corresponding to entries that have subscribers.
  3. SVIDs that need to be removed from the SVID cache due to their corresponding entry(ies) being removed from the entry cache will be detected using a mapping of entry ID -> SVID cache record.
  4. For the remaining cache size (X509SVIDCacheSize - currentCacheSize), we can implement an LRU deletion policy with preference to SVIDs which had subscribers most recently.
    4.1. The most recently-used statistic is not necessarily an indicator of an identity being needed again soon. However, there could be cases where some workloads run on dedicated hardware and are more likely to be scheduled on the same host. This heuristic is better when compared against randomly adding entries for the remainder of size of cache.
    4.2. If the remaining cache size is negative after all SVID records with active subscribers are accounted for; inactive SVID records containing lowest “last subscription timestamp” will be removed to get the cache size down to the configured limit. This timestamp will be 0 for SVID records which never had subscribers since joining the cache.
    4.3. If the remaining cache size is positive, then we request SVID signings for Registration Entries not represented in the SVID cache, up to the cache limit size.
  5. It will also update the SVID cache record of entries which were updated in SPIRE Server or SVIDs whose TTL passed half life.
  6. Most of the actions performed in this sync are in memory (except fetchSVID which will be executed for new authorized entries event or new subscription resulting in cache miss), hence we can execute this module more frequently than 5 sec.
  7. Justification for this approach:
    7.1. Prevents potential DDoS concerns for the scenario when a large number of workloads across the infrastructure are launched around the same time and don't have their identities cached in the local agent. If we made the SVID signing calls to the Server synchronous in the Agent handlers for this case, we could have a potentially unbounded number of signing requests to SPIRE Server.
    7.2. Server requests happen in a different context than the Agent Workload API handler contexts, which eliminates some potential retry complexity in the client code for the case when the Server APIs return an error.

New Cache Models

  1. All SVIDs corresponding to subscribed entries will be cached therefore X509SVIDCacheSize is a soft limit. Not caching SVIDs with active subscribers is not an option.
  2. If authorizedEntries < X509SVIDCacheSize then there is no functional change to any of RPC behavior compared at present. There may be a short wait for first fetch SVID due to new delay introduced between the entry sync and SVID sync.

Please find the following high level approach of how we can split the current Cache struct.

type RegistrationCache struct {
	// records holds the records for registration entries, keyed by registration entry ID
	records map[string]*common.RegistrationEntry
	selectors map[selector]*common.RegistrationEntry
}

type BundleCache struct {
	// Bundles is a set of ALL trust bundles available to the agent, keyed by trust domain
	Bundles map[spiffeid.TrustDomain]*bundleutil.Bundle
}

type SVIDCache struct {
	…
	// records holds the records for registration entries, keyed by registration entry ID
	records map[string]*cacheRecord

	// subscribers holds the selector indices, keyed by a selector key
	subscribers map[selector]*selectorIndex
}

type cacheRecord struct {
	…
	svid  *X509SVID
}

type selectorIndex struct {
	// subs holds the subscriptions related to this selector
	subs map[*subscriber]struct{}
	// records holds the cache records related to this selector
	records map[*cacheRecord]struct{}
}

Request For Comments

  1. Does this overall cache split design make sense?
  2. What should be SVIDCacheSync frequency?
@prasadborole1
Copy link
Contributor Author

prasadborole1 commented Apr 19, 2022

Expanding on changes required to FetchX509SVID rpc:
FetchX509SVID will need to be blocked, until all SVIDs a subscriber is entitled to, are fetched into cache (which will happen in subsequent SVIDSync iterations). This will ensure that only a complete list of active SVIDs are returned to the user as the first grpc event.
It eliminates the case when users close the stream after the first event (e.g. cli cmd: spire-agent api fetch x509) which would have an incomplete list of SVIDs if we don't wait for next SVID sync to update the cache.

@rturner3
Copy link
Collaborator

rturner3 commented Jun 5, 2023

This has been implemented behind a feature flag in #3181. The follow-up to make this the default behavior is tracked in #4224.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants