Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Serverless architecture support #1843

Closed
amartinezfayo opened this issue Sep 16, 2020 · 41 comments
Closed

[RFC] Serverless architecture support #1843

amartinezfayo opened this issue Sep 16, 2020 · 41 comments

Comments

@amartinezfayo
Copy link
Member

[RFC] Serverless architecture support

Co-authored by @MarcosDY.

Background

Serverless computing allows to build applications eliminating the need to manage infrastructure. With serverless applications, the cloud service provider automatically provisions, scales, and manages the infrastructure required to run the code, eliminating the need for server software and hardware management by the developer.
The current way of workload attestation in SPIRE does not completely fit in this software design pattern, where the execution context is a temporary runtime environment and is not suitable to have SPIRE Agent running to expose the Workload API alongside the serverless function.

Proposal

In order to allow the issuance of SVIDs to workloads in a serverless environment, we need to provide a way to issue identities to the workload without using the Workload API to obtain an identity. The workload would attest directly to SPIRE Server to obtain its identity. This means that we would go through an attestation process in a similar fashion than node attestation but without yielding an agent role to the attested serverless (and agentless) environment.
The attestation process would proceed similarly as the current AttestAgent server RPC, but performed through a new call that would provide an "agentless" identity instead of a node identity in SPIRE.
The renewal process would also proceed similarly to the current RenewAgent RPC, where the caller would present an active "agentless" SVID returned by the attestation call or the most recent one from a previous renewal call. This would allow to avoid going through a complete attestation process when the environment already has a valid SVID that needs to be renewed. The criteria to decide if the SVID should be rotated can be similar to the current criteria adopted in SPIRE, i.e.: rotate the SVID if it has less than half of its lifetime left.
The proposed solution should facilitate the issuance of identities in a performant manner, focusing on optimizing the usage of resources, otherwise the advantages of the serverless architecture could be seen reduced by the identity issuance process. To that end, this proposal tries to leverage some of the common features available in the cloud providers that aim to solve performance problems, like reusing the execution context if one is available from a previous function call.
The proposed process to obtain an identity in a serverless architecture is as follows:

  • Check if there is already a valid SVID available in the execution context.
    • If there is no valid SVID, call the "agentless" attestation RPC to get an identity.

      • Store the obtained identity in a variable declared outside of the function's handler method so it remains initialized, providing additional optimization when the function is invoked again.
    • If there is already a valid SVID, calculate its lifetime left.

      • if it has more than half of its lifetime left, just use it.
      • if it has less than half of its lifetime left, call the renewal RPC and store the obtained identity in a variable declared outside of the function's handler method.

Sample implementation

The following is a description of a sample implementation of the proposed process, including the changes needed in SPIRE and the components required in the serverless environment in order to be able to issue identities without having a SPIRE Agent deployed in the serverless environment.

SPIRE

  • Add new plugin types to perform the "agentless" attestation in SPIRE Server. Have a new plugin for each provider that has a serverless architecture. For example, there will be a plugin to support AWS Lambda, a plugin for Google Cloud Functions, a plugin for Microsoft Azure Functions and other plugins for any other platform. These are some possible workflows for the implementations:

    • AWS Lambda: the function signs a GetCallerIdentity query for the AWS Security Token Service (STS) using the AWS Signature v4 algorithm and sends it to SPIRE Server. The credentials used to sign the GetCallerIdentity request come from the AWS Lambda runtime environment variables which avoids the need for an operator to manually provision credentials first. To attest the "agentless" workload, SPIRE Server sends the query to the AWS STS service to validate it and issues an SVID with a SPIFFE ID constructed from attributes extracted from the parsed signed query.

    • Google Cloud Functions: the function fetches its identity token using the Compute Metadata Server. The attestor plugin in SPIRE Server validates the token provided and issues an SVID with a SPIFFE ID constructed from attributes extracted from the parsed token.

    • Microsoft Azure: the function obtains its access token from the local token service. The attestor plugin in SPIRE Server validates the token provided and issues an SVID with a SPIFFE ID constructed from attributes extracted from the parsed token.

  • Attestation data structs are usually shared from github.com/spiffe/spire/pkg/common/plugin/<plugin_name>, which would be inconvenient to consume externally. Instead, the types required could be exposed through Protocol Buffers definitions under the proto/spire hierarchy.

  • It would be good to expose a library that can be used to facilitate the attestation process from the serverless environment. This library should expose interfaces to construct the attestation material, call the "agentless" attestation RPC in SPIRE Server and ease the reuse of the issued SVID in case that the state of the environment is preserved in a future invocation. It should also provide functionality to perform the SVID renewal process.

Serverless environment

The workload running in the serverless environment needs to be able to be attested without a running SPIRE Agent that exposes the Workload API. Instead, it calls an exposed RPC in SPIRE Server with attestation data that retrieves from the execution runtime. As mentioned above, it would be convenient to have a library that can be consumed in the serverless environment to aid the attestation and identity issuance process.
With aim of facilitating the implementation, this proposal recommends implementing a mechanism to externally package dependencies that can be shared across multiple functions. One possible way to achieve this is to have a common interface that can be used to retrieve the identity of the "agentless" workload, that can be called from the running function and is exposed through the runtime environment. For example, in the case of AWS Lambda, the "agentless" attestation functionality can be packaged in a layer. The function that needs to be attested can be configured to use this layer, so it does not need to have it implemented in the function. This layer can also be updated with fixes or improvements without the need of updating the function itself.

Request for Comments

This proposal tries to layout changes needed in SPIRE and possible implementation scenarios to provide support to serverless architectures, focusing on providing a solution for AWS Lambda, Google Cloud Functions and Microsoft Azure Functions. Any feedback on the general direction of this proposal, any missing points, suggestions or thoughts in general is greatly appreciated.

@evan2645
Copy link
Member

Thank you @amartinezfayo and @MarcosDY for putting this together - it is a badly needed feature.

There is some prior art from Square here: https://developer.squareup.com/blog/providing-mtls-identities-to-lambdas/

Also an old(er) issue: #986

I am wondering if you have considered a "push" approach rather than a "pull" approach, e.g. by pushing SVIDs into platform-specific secret stores rather than having functions pull SVIDs from the SPIRE infrastructure. I see a few advantages to push:

  • Greatly reduces complexity/responsibility on the consumer
  • Preserves serverless performance advantages by moving SVID management out of the execution timeframe
  • Reduces the availability requirements of the SPIRE infrastructure

@bigdefect
Copy link

I'm not particularly well versed on the auth side of things, but I did want to make a comment on the approach for AWS Lambda for sake of more complete information. The aws-iam-authenticator does something similar, and there have been some learnings about its shortcomings that I recently read. I omitted a couple kubernetes/EKS-oriented items.

  • STS role responses do not include IAM role paths if present, so if the IAM role ARN is arn:aws:iam::111122223333:role/some-path/my-role, the STS response would be arn:aws:sts::111122223333:assumed-role/my-role/session-name. Users cannot have multiple roles of the same name with different paths at the same time
  • Since the implementation is responsible for crafting the pre-signed URL, they determine which STS endpoint is used. This can cause the authenticating webhook to execute STS requests that may use either the global or regional endpoint in a different region
  • The pre-signed URL does contain an AccessKeyId which is logged so the customer can differentiate role sessions, but this requires the customer to look up the key ID in a log and then query CloudTrail to identify which entity assumed the role for that key ID

I believe first-class alternatives will require features from IAM, so this may still need to be the path forward while accepting caveats.

Is it fair to say this also begins to address non-unique node attestation #558 ?

@amartinezfayo
Copy link
Member Author

amartinezfayo commented Sep 18, 2020

Thank you @evan2645 for your feedback.

I am wondering if you have considered a "push" approach rather than a "pull" approach, e.g. by pushing SVIDs into platform-specific secret stores rather than having functions pull SVIDs from the SPIRE infrastructure.

We discussed this topic during the SIG-SPIRE call on 9/17/2020, but I wanted to summarize my thinking here.
We considered the push approach and we felt that it would imply to have a solution that wouldn't really be a native support in SPIRE, involving the introduction of interactions with other systems like secret stores which we felt that would not be completely desirable, and it would be better to take the pull approach with an "agentless" attestation or credentials exchange mechanism. The push approach also introduces some challenges like the availability of the secrets store and extending the trust to other components.

Both approaches have certainly pros and cons, making them a good choice under certain circumstances or a bad / impossible choice in others. Since SPIRE runs in a broad range of environments, we believe that there is room for exploring both types of implementations. The ultimate goal of this RFC is to collect feedback that can tell us if the proposed approach is useful for a variety of use cases, and if it is, be able to work on an implementation based on this.

@amartinezfayo
Copy link
Member Author

Thank you @efe-selcuk for your observations.

The aws-iam-authenticator does something similar, and there have been some learnings about its shortcomings that I recently read.

We will be working on a more detailed proposal so what you point is all valuable information. We may explore ways to work around those points.

Is it fair to say this also begins to address non-unique node attestation #558?

Yes, I think that this proposal goes towards the direction of addressing use cases discussed in that issue.

@mweissbacher
Copy link
Contributor

Thank you @amartinezfayo and @MarcosDY for working on this, looking forward to serverless support in SPIRE!

I worked on bringing SPIFFE certificates to Lambda for Square (@evan2645 mentioned the blog post) and I wanted to expand on some of the reasons that made us pick push over pull. Blocking on SPIRE server to issue identity is both a performance as well as an availability concern.

We asked ourselves whether there would be a strong security benefit to attesting on startup vs. issuing ahead of time and using a locked down secure storage mechanism. The conclusion we came to is that these would be equivalent and there was no upside to attestation on pull. By storing identity in secrets manager one can use IAM policies and/or SCPs to restrict access.

The reasons developers choose serverless are performance and scalability (within others). Expanding cold start time could be prohibitive for some workloads, not to mention downtime of the SPIRE server would impact availability.

The push approach also introduces some challenges like the availability of the secrets store [...]

It's not uncommon for serverless functions to rely on a secrets store, i.e. if it is down the function might not be able to perform regardless, I don't think this would necessarily be a new dependency.

Another minor point: you listed the main platforms, but K8s has multiple serverless implementations, supporting them all could be tricky. On the other hand: writing to k8s secrets and making secrets accessible to functions could solve for all of these.

To summarize: I think the push model would make for a more reliable and performant solution, be equivalent from a security perspective and come with less code complexity.

@amartinezfayo
Copy link
Member Author

Thank you @mweissbacher for the feedback, it's really helpful!

We are exploring all the options, including how a push model would look like. We should be able to share an updated proposal in the upcoming days.

@mweissbacher
Copy link
Contributor

Looking forward to read it! Also, while we tried to be verbose in the serverless identity article, I'm happy to discuss on a call or here if you have questions.

@mtitolo
Copy link

mtitolo commented Oct 13, 2020

Hi folks, chiming in from an infrastructure and serverless perspective. I worked with @mweissbacher on our Lambda mTLS implementation and drove many of our design decisions that happened within the function itself.

The first thing I'd like to do is clear up the misconception that by using the pull model, you will be able to get away with not having some sort of "agent" run inside the function. Developers will want to wrap the logic to pull the certificate in a library or framework. No one is going to bespoke write that RPC logic in every single function they own. Additionally, with the recent release of AWS Lambda Extensions, these libraries can run out-of-process for a huge performance boost (our tests had a 30% reduction in cold start time). So even if there is a pull model, serverless developers will gravitate towards something like background processes or libraries. My team owns libraries for doing mTLS, and other things, within lambda and we actively choose solutions that reduce the amount of code we need to write to do the same thing in various programming languages. While the SPIRE developers may not develop an agent, the community will because it makes sense. Serverless applications are all about abstracting away these kinds of concerns, not adding additional boilerplate code to every single function.

I'd also like to ask some questions about the mechanics of how this pull mechanism would work.

Firstly, what identity would the SPIRE server return based on the assumed role? How is that controlled? I ask because we have each AWS "account" tie to an application identity, so lambdas within that account are treated the same when doing mTLS. Will SPIRE support multiple IAM roles being given the same identity? We encourage teams to customize the execution role for each function to adhere to the principle of least privilege. We do not want every lambda in an account to execute as the same role. We also have several hundred accounts, so mapping this by hand is a non-starter.

Secondly, this proposal sounds like it expects the functions to be running in the same VPC as the SPIRE server? Many companies have a vpc-per-account model, where networking and permissions must be explicitly setup to cross those boundaries. We have a Shared VPC, but we currently restrict lambda traffic to envoy, AWS APIs, and our internal proxy. We would need to automate the setup of networking rules to allow traffic to hit the SPIRE server, which lives in a separate account. By choosing to add the SPIRE server to the function's critical path, it complicates setup and debugging (cross account debugging is particularly painful). Can you talk more about the networking considerations you made in this design?

Lastly, a comment. When it comes to availability, I expect native cloud provider tools to have better uptime than most things developers deploy on top of a cloud provider. Explicitly adding a not cloud native dependency, when a cloud native one exists and works, is not something we see a lot of. Serverless apps rely heavily on cloud native tools already, so adding one more is not a big deal.

Like Michael, I am really excited to see this proposal and the resulting technology. Being able to perform mTLS within lambdas has been a huge win for us, so I look forward to this being easy for other organizations and projects to benefit from.

@amartinezfayo
Copy link
Member Author

Thank you @mtitolo for you comment, it's very valuable for us to get this kind of feedback.

The first thing I'd like to do is clear up the misconception that by using the pull model, you will be able to get away with not having some sort of "agent" run inside the function. Developers will want to wrap the logic to pull the certificate in a library or framework. No one is going to bespoke write that RPC logic in every single function they own.

It is the intention of this proposal to leverage the use of any mechanism that improves the performance and facilitates the development experience, like it can be done with AWS Lambda Extensions. In the pull model we contemplated the use of layers that can package the agentless attestation functionality (including caching), and I think that the use of AWS Lambda Extensions could also be very beneficial. I don't think that the pull model precludes the use of these mechanisms.

Firstly, what identity would the SPIRE server return based on the assumed role? How is that controlled? I ask because we have each AWS "account" tie to an application identity, so lambdas within that account are treated the same when doing mTLS. Will SPIRE support multiple IAM roles being given the same identity? We encourage teams to customize the execution role for each function to adhere to the principle of least privilege. We do not want every lambda in an account to execute as the same role. We also have several hundred accounts, so mapping this by hand is a non-starter.

The model for the identity issuance that we have in mind is similar to the agent attestation model, where the agent gets an identity based on the attested properties (selectors). In the case of AWS Lambda, the automatically issued identities may have a form like this: spiffe://<trust-domain>/agentless/aws_lambda/<account_id>/<iam_role>/<function_name>.
This adheres to the model of having a different role for each lambda function, which would provide different identities. We are also considering that you may optionally pre-register entries so the function can get an additional identity if the function matches the defined selectors.

Secondly, this proposal sounds like it expects the functions to be running in the same VPC as the SPIRE server?

This is a good point. We were wondering if this could be really an issue for most users. Looks like in fact having to add SPIRE Server to the function's critical path is a real concern. We may think of ways to workaround this problem, but it seems that it's intrinsic to the the pull model and must be annotated as part of the cons of the model.

We are actively working on enhancing / adding more details to the original proposal and include the push model so we can compare pros and cons between them.

@amartinezfayo
Copy link
Member Author

Based on all the feedback received, we explored some alternatives using a push model. We explored options that include the development of external helper programs and also options that introduce built-in support in SPIRE. We are particularly optimistic with the latter, so we created a proof of concept that adds support to serverless computing (like AWS Lambda) in SPIRE through SPIRE Agent plugins (SVIDStore plugins ) that use entry selectors to know about the identities that must be stored in secrets management services offered by cloud providers. The stored identity can then be retrieved by the functions.
The serverless functions are registered in SPIRE in the same way that regular workloads are registered through registration entries. The svidstore key is used to distinguish the "storable" entries, and SVIDStore plugins receive updates of those entries only, which indicates that the issued SVID and key must be securely stored in a location accessible by the serverless function, like AWS Secrets Manager. This way, selectors provide a flexible way to describe the attributes needed to store the corresponding issued SVID and key, like the type of store, name to provide to the secret, and any specific attribute needed by the specific service used.

This is just a proof of concept of how this can be implemented. Any feedback is greatly appreciated!

@mweissbacher
Copy link
Contributor

@amartinezfayo thank you for the update! Agree that the built-in version implemented as plugin seems favorable over stand-alone helper program. The video looks great. I'm assuming rotation of the certificates is same as with other plugin types, at half-life? Thank you for working on this!

@amartinezfayo
Copy link
Member Author

I'm assuming rotation of the certificates is same as with other plugin types, at half-life?

@mweissbacher Correct. This is designed to run on top of the cache manager implementation, so it just looks at the selectors to know which SVIDs must be pushed to an external store when they are being updated. This is completely agnostic of the SVID update logic.

@azdagron
Copy link
Member

Thanks for putting this together @amartinezfayo and @MarcosDY !

This proposal has been reviewed by the maintainers and is pretty well received. Solving this problem is going to be an awesome boon to SPIRE adoption and flexibility. I think the general concensus at this point is to let this proposal marinate in our minds for a bit to make sure there isn't anything we're missing. Thank you for your patience.

@amartinezfayo
Copy link
Member Author

@azdagron
Copy link
Member

I wonder if we should introduce a new field (e.g. export_to) in the registration entry that the agent can use to route the entry to an exporter (i.e. SVIDStore) plugin instead of relying on parsing selectors...

@azdagron
Copy link
Member

Any comments on the above thought (i.e. export_to)? If we do this, it means a new field on the Entry protobuf, which will require a database migration. If we're solid on that plan, we might want to introduce the new column in 1.0.0 so that this feature can ship in 1.1.0....

@evan2645 @amartinezfayo @rturner3 @APTy

@JackOfMostTrades
Copy link

Does the serverless architecture support still aim to provide a way for a workload to "attest directly to SPIRE Server to obtain its identity" as in the proposal at the top, or has that been dropped in favor of just using SVIDStore plugins? (I was hoping those endpoints would also give me a way to solve #1784 .)

For us, the underlying problem with relying on a secret store our non-cloud container orchestration first establishes container identity and then (inside the container's namespace) bootstraps secrets using that identity. So we of course wouldn't be able to reverse that to establish our secret distribution first.

@evan2645
Copy link
Member

Does the serverless architecture support still aim to provide a way for a workload to "attest directly to SPIRE Server to obtain its identity" as in the proposal at the top, or has that been dropped in favor of just using SVIDStore plugins?

I think that we still want to do this, however community feedback has steered prioritization towards the SVIDStore solution first... so while I can say relatively confidently that the project wants the ability to (easily) attest directly to SPIRE Server, I don't know when that work might be picked up. I think we need to scope it first. We certainly want to make sure your use case is supported... any chance you or someone you know would be willing to contribute it @JackOfMostTrades?

@JackOfMostTrades
Copy link

I think that we still want to do this, however community feedback has steered prioritization towards the SVIDStore solution first...

Cool, definitely understand the prioritization of solutions targeting the more common cloud provider use-cases, just wanted to check if the project would still be supportive of an architecture that would solve for a direct attestation use case.

any chance you or someone you know would be willing to contribute it @JackOfMostTrades?

I'm lining some short-to-medium term tasks now, so depending on the timing it might be something we take on. :)

@evan2645
Copy link
Member

Cool, definitely understand the prioritization of solutions targeting the more common cloud provider use-cases, just wanted to check if the project would still be supportive of an architecture that would solve for a direct attestation use case.

Yes. We are learning of other interesting direct attestation use cases too, like confidential computing.

I'm lining some short-to-medium term tasks now, so depending on the timing it might be something we take on. :)

Awesome, please do let us know, I'm happy to coordinate such efforts on the SPIFFE/SPIRE side of the house

@azdagron
Copy link
Member

@amartinezfayo @MarcosDY. While reviewing #2176 and thinking about different ways to structure the cache, I realized that I've been under the impression that everything needed to store a single identity would be represented in a single selector, but that maybe the proposal was advocating for something else?

Can you shed some light on the proposed shape of the selectors? I think there are some clear implementation wins if a selector is self contained but want to make sure I'm not missing something.

@MarcosDY
Copy link
Collaborator

My initial idea was to relay on multiple electors, that can be useful in case we want to provide something more than a name,
a possible example is to have something like:

{
   ParentID: "spiffe://example.org/agent",
   SpiffeID: "spiffe://example.org/awsidentity",
   Selectors: []string {
         "name:secret1",
         "kmskeyid:SOME_ID",
         "region": "SOME_REGION"
    }
}

With something like that we can 'configure' secret when creating it and creating in a specific region, intead of all configured regions on aws plugin.

At the same time each platform have different configurations that can be useful, and allowing multiple selectors allow more customizations.

However we can of course add all that information into a single selector and separate value with :.
that will make things easier on implementation.

@MarcosDY
Copy link
Collaborator

But what is better/easier for a user? in case we put all in a single selector how can users filter by selector?
setting multiples values allows easier filtering for example search for all entries that uses kmskeyid=1234

@evan2645 evan2645 mentioned this issue Apr 20, 2021
@nzxdrexel
Copy link

I am very excited to see the serverless support for spire. Do you have an estimate timeline when this feature will be ready ?

@amartinezfayo
Copy link
Member Author

Hi @nzxdrexel! We plan to be able to include this feature in SPIRE 1.1.0. We are close to release 1.0.0 and after that we should be able to start merging the different pieces of this work.

@MarcosDY
Copy link
Collaborator

MarcosDY commented Aug 23, 2021

On POC initially we start using X509SVIDResponse and store it as a proto binary.
It has the downside that the clients of those secrets must understand proto messages, and parse it in order to get DER certificates and keys.

Trying to make users' lives easier, I’m thinking in 2 ways to simplify it.

X509SVIDResponse as JSON
Using X509SVIDResponse save it as JSON and then store it.

PROS

  • We already have a proto that we can use.
  • Using JSON user can get certificates and keys using jq
  • Single []byte

CONS

  • X509SVIDResponse provides “DER” certificates/keys, users may use PEM, and they will need to parse it

New JSON
Create a new JSON that we can use to provide Identities but with PEM format.

PROS

  • we can provide certificates/keys in PEM format
  • Using JSON user can get certificates and keys using jq
  • Single []byte

CONS

  • New schema we must take care

Proposal

{
   # List of x.509 SVIDs, with private key and bundle for its trust domain
   'svids': {
      {
         # The SPIFFE ID of that identify this SVID
         spiffeID: "spiffe://example.org/workload",
         # PEM encoded certificate chain. MAY invlude intermediates,
         # the leaf certificate (or SVID itself) MUST come first
         x509SVID: "CERT_PEM",
         # PEM encoded PKCS#8 private key
         x509SvidKey: "KEY_PEM",
         # PEM encoded X.509 bundle for the trust domain
         bundle: "BUNDLE_PEM"
      }
   },
   # CA certificate bundles belonging to foreign trust domains that the workload should trust,
   # keyed by trust domain. Bundles are in encoded in PEM format.
   federatedBundles: {
      "spiffe://federated.test": "PEM_CERT",
      "spiffe://another.test": "PEM_CERT"
   }
}

@MarcosDY
Copy link
Collaborator

Update JSON proposal:

  • allow a single SVID by JSON
  • keep trust domain Bundle separated from federated bundles.
{
   # The SPIFFE ID of that identify this SVID
   spiffeID: "spiffe://example.org/workload",
   
   # PEM encoded certificate chain. MAY invlude intermediates,
   # the leaf certificate (or SVID itself) MUST come first
   x509SVID: "CERT_PEM",
   
   # PEM encoded PKCS#8 private key
   x509SvidKey: "KEY_PEM",
   
   # PEM encoded X.509 bundle for the trust domain
   bundle: "BUNDLE_PEM",

   # CA certificate bundles belonging to foreign trust domains that the workload should trust,
   # keyed by trust domain. Bundles are in encoded in PEM format.
   federatedBundles: {
      "spiffe://federated.test": "PEM_CERT",
      "spiffe://another.test": "PEM_CERT"
   }
}

@SilvaMatteus
Copy link
Contributor

Hi @MarcosDY, thank you for your effort on this.

I would say the update in the JSON proposal is even better and helpful for users. I think it's ok to assume that if you want another SVID, you can push another secret.

About the constraints for the Agent's initialization, today at least one workload attestor plugin is needed, right?
But now that we will have the storeSVID plugins...

Would be a good time to change the restrictions to something like "at least one workload attestor plugin or one storeSVID plugin?"

@hellerda
Copy link

Has there been any thought to supporting JWT SVID for SVIDStore, in addition to x509? It seems doable, but I didn't see any mention in the discussion surrounding SVIDStore.

A difference here is that x509 SVID are issued to immediately upon creating the workload entry at the server, whereas JWT SVID are not issued until a workload does a fetch to the Agent. This allows the workload to request any "aud" claim, but does not provision the SVID in advance. A workaround would be to do a "fetch" for any audience the workload will require, after the entry is created, to cause the SVID(s) to be stored for that workload. A better solution would be to be able to specify an "-audiences" list via a command line option for the "entry create" to force those JWT SVID to be provisioned at that time. Does this seem doable?

If there is a better place to inquire on this pls let me know.

@hellerda
Copy link

I was also was wondering if any thought to adding a NodeAttestor plugin based on the IAM Role auth / signed GetCallerIdentity request mentioned here, and surrounding issues (#558, #1784). It would be use the method developed by Vault (https://www.vaultproject.io/docs/auth/aws), and also used by Kubernetes (https://github.com/kubernetes-sigs/aws-iam-authenticator).

I know this was discussed in the context of Serverless Pull, but it seems valuable to have as a simple NodeAttestor as well. It would support Agents running in any AWS runtime, not just EC2. It could also support on-prem nodes, as long as they have access to AWS API and have some AWS Identity configured to start.

There are some pitfalls mentioned here (https://googleprojectzero.blogspot.com/2020/10/enter-the-vault-auth-issues-hashicorp-vault.html), along with a couple of CVE, but it looks like this has bee solved. But it points out the server-side code would have to be done carefully to avoid any vulnerability.

Does this seem like a worthwile endeavor? Or is there sufficient functionality in the existing NodeAttestors, to support agents running in Amazon ECS or Lambda?

@amartinezfayo
Copy link
Member Author

Has there been any thought to supporting JWT SVID for SVIDStore, in addition to x509?

We have been thinking about ways to support JWT-SVIDs in SVIDStore plugins since the introduction of the plugin type, but we haven't found yet how that could be achieved without breaking some core SPIRE's principles for JWT issuance.

A workaround would be to do a "fetch" for any audience the workload will require, after the entry is created, to cause the SVID(s) to be stored for that workload.

This fetch action would be presumably done by a regular workload that's being attested by an agent. One design principle that we have is that registration entries that are marked with the storeSVID property are only pushed to the designated store by the agent, they are not elegible to be fetched by workloads (which would also require to have selectors that match the workload).

A better solution would be to be able to specify an "-audiences" list via a command line option for the "entry create" to force those JWT SVID to be provisioned at that time.

Adding this kind of information to a registration entry, that is specific to an SVID format would also break SPIRE's model. In any case, having an enumeration of all the possible audiences doesn't seem to be great.

Unfortunately, we don't have yet a clear path forward to support the issuance of JWT-SVIDs through SVIDStore plugins. The main challenges are intrinsic to the format itself. We encourage the use of a single audience in JWT-SVID to limit the scope of replay-ability. We also think that JWT-SVIDs must have an aggressive expiration value. Handling rotation in the context of SVIDStore would also be challenging.

@hellerda
Copy link

hellerda commented Feb 8, 2022

Thanks for your reply, @amartinezfayo. Since you said you were considering it I took the liberty of opening #2752 as a Feature Request.

On fetching the JWT SVID from the workload:

This fetch action would be presumably done by a regular workload that's being attested by an agent. One design principle that we have is that registration entries that are marked with the storeSVID property are only pushed to the designated store by the agent, they are not elegible to be fetched by workloads (which would also require to have selectors that match the workload).

Yes and for this reason, this would not be the best way. It would also require some code at the workload to drive the Workload API, and one of the advantages of SVIDStore is that there is no additional code at the workload, except that required to pull the secret from the Cloud provider Secrets Manager.

On passing an "-audiences" list to the "entry create" command: this seems like the better way.

Adding this kind of information to a registration entry, that is specific to an SVID format would also break SPIRE's model. In any case, having an enumeration of all the possible audiences doesn't seem to be great.

Can you elaborate on your concern here? It doesn't seem so odd to have a configured "audiences" list associated with the reg entry. In fact I can envision other uses for such a list, like as a "whitelist" of permitted audiences to request. In the case of SVIDStore, it's all about pre-poplulating the SVIDs the Workload will need in the cloud provider's Secrets Manager. Does this approach not fit?

@dpogorzelski
Copy link

Hey everyone, a question on the "push" approach. From a practical standpoint of view, is there any preference/idea as to how the agent should be handled in this scenario? Should it be running on a dedicated node? On the same node as the server? Or perhaps is it's ok to leverage an existing agent instance rather than creating a "dedicated" one for serverless purposes?
Feels a bit like the agent is somewhat of a "necessary inconvenience" in this scenario so that there is some mechanism that can push SVIDs.
If I misunderstood something feel free to correct me :)

@amartinezfayo
Copy link
Member Author

Hi @dpogorzelski!

Should it be running on a dedicated node? On the same node as the server? Or perhaps is it's ok to leverage an existing agent instance rather than creating a "dedicated" one for serverless purposes?

Agents should certainly not run in the same node as the server. The server is the most sensitive component of the system and special attention should be taken when considering the placement of them. In particular, if the agent would provide identities to other workloads in addition to the serverless workloads. There is an agent/server security boundary, so servers should be placed on hardware that is distinct from the untrusted workloads they are meant to manage.

I think that is ok to leverage an existing agent instance to push identities to a designated store. It all depends on the level of separation that you wish for the identities. Agents are authorized to manage an identity when they are referenced as a parent of that identity (the registered entry has the Parent ID of the agent). It is a good idea to scope registration entry Parent IDs as tightly as is reasonably possible. That may mean that you may or may not need a dedicated agent to issue identities to serverless workloads. My personal preference for this scenario would be to have a dedicated agent, which I think makes things easier in terms of administration and tracing possible issues.

@evan2645
Copy link
Member

Feels a bit like the agent is somewhat of a "necessary inconvenience" in this scenario so that there is some mechanism that can push SVIDs.

I have been wondering if it makes sense to support SVIDStore (or something like it) on the server too.

@hellerda
Copy link

I think the current approach of storing via the agent is fine. It seems to make sense to let the server be the issuer, and the agent the provisioner.

But the agent deployment strategy is clearly different with SVID Store. In the conventional case, the goal is place the agent close to the workload, so the workload attestor can function. With SVID store this constraint is relieved, and agent proximity to workload doesn't really matter (as long as both can access the same secrets store).

For SVID Store I am thinking of a strategy with one agent per cloud provider region. If agents are grouped (by selectors) such that the agent alias represents an agent in one region, all workload entries associated with that alias would be stored in that region only. If agents are grouped such that the alias represents agents in more than one region, the associated set of SVID would be stored in each region. Is that correct?

@amoore877
Copy link
Member

amoore877 commented May 2, 2022

It seems to make sense to let the server be the issuer, and the agent the provisioner.
...
But the agent deployment strategy is clearly different with SVID Store.

Agreed, though for me this also opens up revisiting the idea of "why not specifically have a new type of Agent for SVIDStore". In my own potential usages, I don't think I will ever have an agent both acting "traditionally" and executing the SVIDStore feature, for the separation-of-security-responsibilities aspects previously mentioned and to de-couple these disparate behaviors.

To me this feels like an opportunity for slimming down the amount of potential moving parts in both cases. I know that then creates a new SPIRE binary requiring maintainer support, though I'd be curious how different that is from the support needed to maintain this feature in general inside a now more complex Agent.

@hellerda
Copy link

hellerda commented May 5, 2022

I agree that due to the different deployment strategies, conventional vs. SVIDstore, it's unlikely an agent would serve both purposes well. That said, I can see some cases where it would be useful, for smaller deployments, or when an agent is serving conventional but there may be a few workloads that are preferred (for whatever reason) to get their SVID via the cloud-provider store. Maybe the workload can't easily add the SPIRE client code, or maybe it just fits into some usage pattern where the deployer prefers to use SVIDstore.

This is probably a topic for a different thread, but where I see the value in slimming down is for deployments where an agent might be squeezed into an ECS sidecar, or possibly a Lambda extension. I'd be interested to know if anyone has tested that, or considered it.

But I would like to dig deeper on @dpogorzelski question of how to handle the agent with SVIDstore. I am thinking of a deployment strategy with one agent per cloud provider region. Alternately, an agent could be deployed on a per-resource group basis, where it is spun up when the resource group is created, and torn down when removed. While it's not necessary to do this for an SVIDStore agent, there may be value in logically associating an agent with a RG and the workloads in that group it will server. @amartinezfayo, would you care to chime in?

@amartinezfayo
Copy link
Member Author

I am thinking of a deployment strategy with one agent per cloud provider region. Alternately, an agent could be deployed on a per-resource group basis, where it is spun up when the resource group is created, and torn down when removed.

I think that the strategy of having one agent per cloud provider region makes sense. It seems reasonable to me to have that kind of separation / organization. As a side note, the current implementation restricts pushing to a single region for the same plugin name (i.e. the aws_secretsmanager plugin has configured the region at the plugin level, and you can't have multiple SVIDStore "aws_secretsmanager" {...} sections in your agent config, which means that this agent will be tied to a single AWS region.

Deploying on a per-resource group basis also makes sense to me. It all depends on the specific requirements of each deployment.

Thank you for bringing this to discussion. I think that the community in general is still learning about the best approaches to take when leveraging SVIDStore plugins. If you can share your learnings along the way, with whatever approach you take, that would be awesome!

@amartinezfayo
Copy link
Member Author

Agreed, though for me this also opens up revisiting the idea of "why not specifically have a new type of Agent for SVIDStore". In my own potential usages, I don't think I will ever have an agent both acting "traditionally" and executing the SVIDStore feature, for the separation-of-security-responsibilities aspects previously mentioned and to de-couple these disparate behaviors.

Thank you @amoore877 for providing your perspective. One of the alternatives that we considered at the beginning of the exploration of serverless support in SPIRE was the option to have a helper program like that.
Our assessment of that approach by that time was that there would be a re-implementation of the agent related with functionality like attestation with the server, and the logic around the set of identities that would be authorized to manage. It would also need to implement some kind of pluggable mechanism to be able to work with different vendors where the secrets can be stored. But that was only an assessment based on some possible designs that we had in mind.

To me this feels like an opportunity for slimming down the amount of potential moving parts in both cases. I know that then creates a new SPIRE binary requiring maintainer support, though I'd be curious how different that is from the support needed to maintain this feature in general inside a now more complex Agent.

IMO the agent didn't gain a lot of complexity with the addition of the SVIDStore plugin type. Most of the logic is encapsulated in the storecache package. But I think that the separation of duties is a fair point.
In terms of maintenance, it's kind of difficult to assess what would be more involved to maintain. My personal feeling is that the current implementation doesn't add a lot of overhead to the agent, and having to maintain a separate project feels more involved. Even things like updating dependencies due to security issues in them, releasing artifacts, changelog, and documentation count.

Having said that, I think that it is worth to explore deeper that avenue. It's probably better to have a separate issue for that, so if you would like to see that option explored, it would be great if you can open an issue with your thoughts there.

@azdagron
Copy link
Member

This work has been completed for some time. If there are new insights or issues with the serverless support in SPIRE, please file new issues to track those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests