Expired Agent data left in SPIRE database indefinitely #1836

rturner3 · 2020-09-11T19:35:47Z

Data in the attested_node_entries and node_resolver_map_entries tables is left in the database indefinitely. There are cases when this data may no longer be valuable to retain, such as when the rows correspond to an expired Agent that attested using a NodeAttestor plugin that does not have a trust on first use security model, such as sshpop or x509pop. In those cases, the Agents corresponding to the spiffe_id in those rows can safely re-attest with SPIRE Server even if this data is deleted from the database.

In long-running environments, this can result in a lot of stale data in the database that ultimately hurts SPIRE Server query performance. When the overall size of these tables expands beyond the maximum message size in gRPC, the only way to clean up this old data seems to be manually purging data in the database with SQL DELETE FROM statements. It is not necessarily intuitive to an operator of SPIRE that they might need to manually clean up this data from the database.

It would be nice if SPIRE Server did some periodic purging of this old data for expired Agents that attested with a non-TOFU NodeAttestor plugin to simplify the operational burden of monitoring the table sizes and manually purging the data on an ongoing basis.

The text was updated successfully, but these errors were encountered:

azdagron · 2020-09-15T22:23:16Z

Thanks for bring this up @rturner3! This is a very tricky problem, especially in the face of TOFU node attestation. A solution will likely be fairly involved, possibly querying node attestor plugins about whether or not a particular attested agent is still "alive" in order to determine if the associated data can be safely cleared.

Unfortunately, a "dirty" datastore with lots of stale information does put some extra burden on certain solutions for entry authorization crawl performance that we've been pursuing, like the full in-memory cache. There was some discussion about a short term bandaid that would prevent an attested node with an expired SVID from being loaded in the cache, which would solve the memory usage issue but wouldn't help with the time it takes to hydrate the cache.

I think this issue warrants deeper thought. Certainly we need to come up with something for this.

bri365 · 2021-04-08T17:30:54Z

Can we start with the non-TOFU agents? add to SIG-SPIRE agenda?

evan2645 · 2021-04-08T19:23:41Z

Core doesn't currently distinguish or know the difference between tofu and non-tofu attestors... but, it may need to in the future: #2203

Teaching core about this somehow is probably a pre-requisite for starting this cleanup?

bri365 · 2021-05-06T18:02:05Z

@evan2645 can tofu/non-tofu attestation be inferred from AttestationDataType? aws_iid, azure_msi, and gcp_iit are explicitly called out in the documentation as tofu. If/when we no longer require trust on first use on cloud nodes I suspect we would deliver that functionality as new node attestation plugins with new AttestationDataType?

evan2645 · 2021-05-07T23:06:11Z

@bri365 Hmm yes I think that could work for builtins, but we wouldn't be able to clean up after non-builtins. I also think there are likely to be cases where a plugin may have configurable TOFU based on how its deployed.

I do think it makes sense for core to know some stuff about a plugin, in general. @azdagron @amartinezfayo any thoughts here on how to accomplish this? Perhaps a TOFU bool on the node attestor response? We can track this in attested node entries, and it will be very useful if we decide to tackle #2203

bri365 · 2021-05-10T18:31:43Z

Do we expect a given node attestor to be configured for TOFU or not for all agents, or is this something that would ever be determined on a per agent basis? If the former (which I believe feels safer), adding a TOFU bool to nodeattestor.NodeAttestor and retrieving that from GetNodeAttestorNamed() should work well, and we would not need to store it in every agent record. If the latter, then we do need to store it for every agent.

azdagron · 2021-05-10T18:50:58Z

We've talked about TOFU being configurable on some node attestors (i.e. x509pop), so I don't think we could reasonably rely on a runtime property of the current "TOFU-ness" of a nodeattestor to evaluate the "TOFU-ness" of agents previously attested with that node attestor (since it could have changed).

rturner3 · 2021-11-15T21:15:14Z

Re-opening this because it was unintentionally closed by GitHub when the linked spire-plugin-sdk PR was merged. GitHub tried to be a little too clever and closed this issue because it picked up the phrase "resolve #1836" in the PR.

dfeldman · 2021-11-19T21:03:22Z

Just an update, here's what's left to do on this task:

Add CanReattest field in the spire-plugin-sdk
Add the migration so CanReattest is stored in the database alongside the other agent attestation data (in progress)
Wire together the plugins and the datastore layer so that the proper CanReattest flag is stored for each plugin
Add a command line for deleting old data where CanReattest is true
I'll update this as we resolve these.

dfeldman · 2022-05-25T21:56:21Z

update: we now have all the parts of this except an actual command line for deleting the old entries

azdagron · 2022-10-11T20:23:08Z

Hey @dfeldman, are you working on this?

guilhermocc · 2023-02-13T18:31:32Z

Hey there, I'm interested in resolving this issue. Can someone provide me with a bit of context about the work that's left to be done? is it just the new command line for deleting stale attested node entries?

guilhermocc · 2023-02-13T21:38:46Z

I have a question about what was the consensus that we got here; from what I understood, we will proceed by creating a CLI command that will enable the operator to trigger the non-TOFU stale node agent entries cleanup, right? @rturner3 mentioned a periodic purging process, but from what I can see here, this periodic purge logic would be delegated to the operator. Is this correct?

guilhermocc · 2023-02-15T14:30:28Z

So, we discussed this issue in the last contributor sync (2/14/23), and here is what we have: This issue will continue with the closed scope of implementing the action that purges non-TOFU stale agent entries. The periodic process for purging and the stale TOFU node attestation will be treated separately (in other issues).

azdagron self-assigned this Sep 15, 2020

evan2645 mentioned this issue Apr 14, 2021

Audit infinitely growing datasets in the datastore #397

Closed

dfeldman mentioned this issue Oct 6, 2021

Add "repeatable" field to AgentAttributes spiffe/spire-plugin-sdk#17

Merged

rturner3 closed this as completed in spiffe/spire-plugin-sdk#17 Nov 15, 2021

rturner3 reopened this Nov 15, 2021

dfeldman mentioned this issue Dec 2, 2021

Add CanReattest flag in node attestor plugins and in datastore #2646

Merged

azdagron added help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog labels Oct 11, 2022

This was referenced Feb 15, 2023

Add missing CanReAttest agent listing filter and output mask spiffe/spire-api-sdk#36

Merged

Attested node deletion doesn't clean node resolver data #3872

Closed

Update agents listing service with "CanReAttest" filter #3880

Merged

guilhermocc mentioned this issue Mar 15, 2023

Add agent purge command #3982

Merged

3 tasks

evan2645 closed this as completed in #3982 Apr 25, 2023

rturner3 mentioned this issue May 8, 2023

Feature Request: Filter ListAgent using the value of X509SvidExpiresAt #2367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expired Agent data left in SPIRE database indefinitely #1836

Expired Agent data left in SPIRE database indefinitely #1836

rturner3 commented Sep 11, 2020

azdagron commented Sep 15, 2020

bri365 commented Apr 8, 2021

evan2645 commented Apr 8, 2021

bri365 commented May 6, 2021

evan2645 commented May 7, 2021

bri365 commented May 10, 2021

azdagron commented May 10, 2021

rturner3 commented Nov 15, 2021

dfeldman commented Nov 19, 2021 •

edited

Loading

dfeldman commented May 25, 2022

azdagron commented Oct 11, 2022

guilhermocc commented Feb 13, 2023

guilhermocc commented Feb 13, 2023

guilhermocc commented Feb 15, 2023 •

edited

Loading

Expired Agent data left in SPIRE database indefinitely #1836

Expired Agent data left in SPIRE database indefinitely #1836

Comments

rturner3 commented Sep 11, 2020

azdagron commented Sep 15, 2020

bri365 commented Apr 8, 2021

evan2645 commented Apr 8, 2021

bri365 commented May 6, 2021

evan2645 commented May 7, 2021

bri365 commented May 10, 2021

azdagron commented May 10, 2021

rturner3 commented Nov 15, 2021

dfeldman commented Nov 19, 2021 • edited Loading

dfeldman commented May 25, 2022

azdagron commented Oct 11, 2022

guilhermocc commented Feb 13, 2023

guilhermocc commented Feb 13, 2023

guilhermocc commented Feb 15, 2023 • edited Loading

dfeldman commented Nov 19, 2021 •

edited

Loading

guilhermocc commented Feb 15, 2023 •

edited

Loading