Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expired Agent data left in SPIRE database indefinitely #1836

Closed
rturner3 opened this issue Sep 11, 2020 · 14 comments · Fixed by spiffe/spire-plugin-sdk#17 or #3982
Closed

Expired Agent data left in SPIRE database indefinitely #1836

rturner3 opened this issue Sep 11, 2020 · 14 comments · Fixed by spiffe/spire-plugin-sdk#17 or #3982
Assignees
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog

Comments

@rturner3
Copy link
Collaborator

Data in the attested_node_entries and node_resolver_map_entries tables is left in the database indefinitely. There are cases when this data may no longer be valuable to retain, such as when the rows correspond to an expired Agent that attested using a NodeAttestor plugin that does not have a trust on first use security model, such as sshpop or x509pop. In those cases, the Agents corresponding to the spiffe_id in those rows can safely re-attest with SPIRE Server even if this data is deleted from the database.

In long-running environments, this can result in a lot of stale data in the database that ultimately hurts SPIRE Server query performance. When the overall size of these tables expands beyond the maximum message size in gRPC, the only way to clean up this old data seems to be manually purging data in the database with SQL DELETE FROM statements. It is not necessarily intuitive to an operator of SPIRE that they might need to manually clean up this data from the database.

It would be nice if SPIRE Server did some periodic purging of this old data for expired Agents that attested with a non-TOFU NodeAttestor plugin to simplify the operational burden of monitoring the table sizes and manually purging the data on an ongoing basis.

@azdagron azdagron self-assigned this Sep 15, 2020
@azdagron
Copy link
Member

Thanks for bring this up @rturner3! This is a very tricky problem, especially in the face of TOFU node attestation. A solution will likely be fairly involved, possibly querying node attestor plugins about whether or not a particular attested agent is still "alive" in order to determine if the associated data can be safely cleared.

Unfortunately, a "dirty" datastore with lots of stale information does put some extra burden on certain solutions for entry authorization crawl performance that we've been pursuing, like the full in-memory cache. There was some discussion about a short term bandaid that would prevent an attested node with an expired SVID from being loaded in the cache, which would solve the memory usage issue but wouldn't help with the time it takes to hydrate the cache.

I think this issue warrants deeper thought. Certainly we need to come up with something for this.

@bri365
Copy link
Contributor

bri365 commented Apr 8, 2021

Can we start with the non-TOFU agents? add to SIG-SPIRE agenda?

@evan2645
Copy link
Member

evan2645 commented Apr 8, 2021

Core doesn't currently distinguish or know the difference between tofu and non-tofu attestors... but, it may need to in the future: #2203

Teaching core about this somehow is probably a pre-requisite for starting this cleanup?

@bri365
Copy link
Contributor

bri365 commented May 6, 2021

@evan2645 can tofu/non-tofu attestation be inferred from AttestationDataType? aws_iid, azure_msi, and gcp_iit are explicitly called out in the documentation as tofu. If/when we no longer require trust on first use on cloud nodes I suspect we would deliver that functionality as new node attestation plugins with new AttestationDataType?

@evan2645
Copy link
Member

evan2645 commented May 7, 2021

@bri365 Hmm yes I think that could work for builtins, but we wouldn't be able to clean up after non-builtins. I also think there are likely to be cases where a plugin may have configurable TOFU based on how its deployed.

I do think it makes sense for core to know some stuff about a plugin, in general. @azdagron @amartinezfayo any thoughts here on how to accomplish this? Perhaps a TOFU bool on the node attestor response? We can track this in attested node entries, and it will be very useful if we decide to tackle #2203

@bri365
Copy link
Contributor

bri365 commented May 10, 2021

Do we expect a given node attestor to be configured for TOFU or not for all agents, or is this something that would ever be determined on a per agent basis? If the former (which I believe feels safer), adding a TOFU bool to nodeattestor.NodeAttestor and retrieving that from GetNodeAttestorNamed() should work well, and we would not need to store it in every agent record. If the latter, then we do need to store it for every agent.

@azdagron
Copy link
Member

We've talked about TOFU being configurable on some node attestors (i.e. x509pop), so I don't think we could reasonably rely on a runtime property of the current "TOFU-ness" of a nodeattestor to evaluate the "TOFU-ness" of agents previously attested with that node attestor (since it could have changed).

@rturner3
Copy link
Collaborator Author

Re-opening this because it was unintentionally closed by GitHub when the linked spire-plugin-sdk PR was merged. GitHub tried to be a little too clever and closed this issue because it picked up the phrase "resolve #1836" in the PR.

@dfeldman
Copy link
Member

dfeldman commented Nov 19, 2021

Just an update, here's what's left to do on this task:

  • Add CanReattest field in the spire-plugin-sdk
  • Add the migration so CanReattest is stored in the database alongside the other agent attestation data (in progress)
  • Wire together the plugins and the datastore layer so that the proper CanReattest flag is stored for each plugin
  • Add a command line for deleting old data where CanReattest is true
    I'll update this as we resolve these.

@dfeldman
Copy link
Member

update: we now have all the parts of this except an actual command line for deleting the old entries

@azdagron azdagron added help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog labels Oct 11, 2022
@azdagron
Copy link
Member

Hey @dfeldman, are you working on this?

@guilhermocc
Copy link
Contributor

Hey there, I'm interested in resolving this issue. Can someone provide me with a bit of context about the work that's left to be done? is it just the new command line for deleting stale attested node entries?

@guilhermocc
Copy link
Contributor

I have a question about what was the consensus that we got here; from what I understood, we will proceed by creating a CLI command that will enable the operator to trigger the non-TOFU stale node agent entries cleanup, right? @rturner3 mentioned a periodic purging process, but from what I can see here, this periodic purge logic would be delegated to the operator. Is this correct?

@guilhermocc
Copy link
Contributor

guilhermocc commented Feb 15, 2023

So, we discussed this issue in the last contributor sync (2/14/23), and here is what we have: This issue will continue with the closed scope of implementing the action that purges non-TOFU stale agent entries. The periodic process for purging and the stale TOFU node attestation will be treated separately (in other issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog
Projects
None yet
6 participants