-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Serverless architecture support #1843
Comments
Thank you @amartinezfayo and @MarcosDY for putting this together - it is a badly needed feature. There is some prior art from Square here: https://developer.squareup.com/blog/providing-mtls-identities-to-lambdas/ Also an old(er) issue: #986 I am wondering if you have considered a "push" approach rather than a "pull" approach, e.g. by pushing SVIDs into platform-specific secret stores rather than having functions pull SVIDs from the SPIRE infrastructure. I see a few advantages to push:
|
I'm not particularly well versed on the auth side of things, but I did want to make a comment on the approach for AWS Lambda for sake of more complete information. The aws-iam-authenticator does something similar, and there have been some learnings about its shortcomings that I recently read. I omitted a couple kubernetes/EKS-oriented items.
I believe first-class alternatives will require features from IAM, so this may still need to be the path forward while accepting caveats. Is it fair to say this also begins to address non-unique node attestation #558 ? |
Thank you @evan2645 for your feedback.
We discussed this topic during the SIG-SPIRE call on 9/17/2020, but I wanted to summarize my thinking here. Both approaches have certainly pros and cons, making them a good choice under certain circumstances or a bad / impossible choice in others. Since SPIRE runs in a broad range of environments, we believe that there is room for exploring both types of implementations. The ultimate goal of this RFC is to collect feedback that can tell us if the proposed approach is useful for a variety of use cases, and if it is, be able to work on an implementation based on this. |
Thank you @efe-selcuk for your observations.
We will be working on a more detailed proposal so what you point is all valuable information. We may explore ways to work around those points.
Yes, I think that this proposal goes towards the direction of addressing use cases discussed in that issue. |
Thank you @amartinezfayo and @MarcosDY for working on this, looking forward to serverless support in SPIRE! I worked on bringing SPIFFE certificates to Lambda for Square (@evan2645 mentioned the blog post) and I wanted to expand on some of the reasons that made us pick push over pull. Blocking on SPIRE server to issue identity is both a performance as well as an availability concern. We asked ourselves whether there would be a strong security benefit to attesting on startup vs. issuing ahead of time and using a locked down secure storage mechanism. The conclusion we came to is that these would be equivalent and there was no upside to attestation on pull. By storing identity in secrets manager one can use IAM policies and/or SCPs to restrict access. The reasons developers choose serverless are performance and scalability (within others). Expanding cold start time could be prohibitive for some workloads, not to mention downtime of the SPIRE server would impact availability.
It's not uncommon for serverless functions to rely on a secrets store, i.e. if it is down the function might not be able to perform regardless, I don't think this would necessarily be a new dependency. Another minor point: you listed the main platforms, but K8s has multiple serverless implementations, supporting them all could be tricky. On the other hand: writing to k8s secrets and making secrets accessible to functions could solve for all of these. To summarize: I think the push model would make for a more reliable and performant solution, be equivalent from a security perspective and come with less code complexity. |
Thank you @mweissbacher for the feedback, it's really helpful! We are exploring all the options, including how a push model would look like. We should be able to share an updated proposal in the upcoming days. |
Looking forward to read it! Also, while we tried to be verbose in the serverless identity article, I'm happy to discuss on a call or here if you have questions. |
Hi folks, chiming in from an infrastructure and serverless perspective. I worked with @mweissbacher on our Lambda mTLS implementation and drove many of our design decisions that happened within the function itself. The first thing I'd like to do is clear up the misconception that by using the pull model, you will be able to get away with not having some sort of "agent" run inside the function. Developers will want to wrap the logic to pull the certificate in a library or framework. No one is going to bespoke write that RPC logic in every single function they own. Additionally, with the recent release of AWS Lambda Extensions, these libraries can run out-of-process for a huge performance boost (our tests had a 30% reduction in cold start time). So even if there is a pull model, serverless developers will gravitate towards something like background processes or libraries. My team owns libraries for doing mTLS, and other things, within lambda and we actively choose solutions that reduce the amount of code we need to write to do the same thing in various programming languages. While the SPIRE developers may not develop an agent, the community will because it makes sense. Serverless applications are all about abstracting away these kinds of concerns, not adding additional boilerplate code to every single function. I'd also like to ask some questions about the mechanics of how this pull mechanism would work. Firstly, what identity would the SPIRE server return based on the assumed role? How is that controlled? I ask because we have each AWS "account" tie to an application identity, so lambdas within that account are treated the same when doing mTLS. Will SPIRE support multiple IAM roles being given the same identity? We encourage teams to customize the execution role for each function to adhere to the principle of least privilege. We do not want every lambda in an account to execute as the same role. We also have several hundred accounts, so mapping this by hand is a non-starter. Secondly, this proposal sounds like it expects the functions to be running in the same VPC as the SPIRE server? Many companies have a vpc-per-account model, where networking and permissions must be explicitly setup to cross those boundaries. We have a Shared VPC, but we currently restrict lambda traffic to envoy, AWS APIs, and our internal proxy. We would need to automate the setup of networking rules to allow traffic to hit the SPIRE server, which lives in a separate account. By choosing to add the SPIRE server to the function's critical path, it complicates setup and debugging (cross account debugging is particularly painful). Can you talk more about the networking considerations you made in this design? Lastly, a comment. When it comes to availability, I expect native cloud provider tools to have better uptime than most things developers deploy on top of a cloud provider. Explicitly adding a not cloud native dependency, when a cloud native one exists and works, is not something we see a lot of. Serverless apps rely heavily on cloud native tools already, so adding one more is not a big deal. Like Michael, I am really excited to see this proposal and the resulting technology. Being able to perform mTLS within lambdas has been a huge win for us, so I look forward to this being easy for other organizations and projects to benefit from. |
Thank you @mtitolo for you comment, it's very valuable for us to get this kind of feedback.
It is the intention of this proposal to leverage the use of any mechanism that improves the performance and facilitates the development experience, like it can be done with AWS Lambda Extensions. In the pull model we contemplated the use of layers that can package the agentless attestation functionality (including caching), and I think that the use of AWS Lambda Extensions could also be very beneficial. I don't think that the pull model precludes the use of these mechanisms.
The model for the identity issuance that we have in mind is similar to the agent attestation model, where the agent gets an identity based on the attested properties (selectors). In the case of AWS Lambda, the automatically issued identities may have a form like this:
This is a good point. We were wondering if this could be really an issue for most users. Looks like in fact having to add SPIRE Server to the function's critical path is a real concern. We may think of ways to workaround this problem, but it seems that it's intrinsic to the the pull model and must be annotated as part of the cons of the model. We are actively working on enhancing / adding more details to the original proposal and include the push model so we can compare pros and cons between them. |
Based on all the feedback received, we explored some alternatives using a push model. We explored options that include the development of external helper programs and also options that introduce built-in support in SPIRE. We are particularly optimistic with the latter, so we created a proof of concept that adds support to serverless computing (like AWS Lambda) in SPIRE through SPIRE Agent plugins (
This is just a proof of concept of how this can be implemented. Any feedback is greatly appreciated! |
@amartinezfayo thank you for the update! Agree that the built-in version implemented as plugin seems favorable over stand-alone helper program. The video looks great. I'm assuming rotation of the certificates is same as with other plugin types, at half-life? Thank you for working on this! |
@mweissbacher Correct. This is designed to run on top of the cache manager implementation, so it just looks at the selectors to know which SVIDs must be pushed to an external store when they are being updated. This is completely agnostic of the SVID update logic. |
Thanks for putting this together @amartinezfayo and @MarcosDY ! This proposal has been reviewed by the maintainers and is pretty well received. Solving this problem is going to be an awesome boon to SPIRE adoption and flexibility. I think the general concensus at this point is to let this proposal marinate in our minds for a bit to make sure there isn't anything we're missing. Thank you for your patience. |
This is a fork of SPIRE with the POC that is being developed: |
I wonder if we should introduce a new field (e.g. |
Any comments on the above thought (i.e. |
Does the serverless architecture support still aim to provide a way for a workload to "attest directly to SPIRE Server to obtain its identity" as in the proposal at the top, or has that been dropped in favor of just using For us, the underlying problem with relying on a secret store our non-cloud container orchestration first establishes container identity and then (inside the container's namespace) bootstraps secrets using that identity. So we of course wouldn't be able to reverse that to establish our secret distribution first. |
I think that we still want to do this, however community feedback has steered prioritization towards the SVIDStore solution first... so while I can say relatively confidently that the project wants the ability to (easily) attest directly to SPIRE Server, I don't know when that work might be picked up. I think we need to scope it first. We certainly want to make sure your use case is supported... any chance you or someone you know would be willing to contribute it @JackOfMostTrades? |
Cool, definitely understand the prioritization of solutions targeting the more common cloud provider use-cases, just wanted to check if the project would still be supportive of an architecture that would solve for a direct attestation use case.
I'm lining some short-to-medium term tasks now, so depending on the timing it might be something we take on. :) |
Yes. We are learning of other interesting direct attestation use cases too, like confidential computing.
Awesome, please do let us know, I'm happy to coordinate such efforts on the SPIFFE/SPIRE side of the house |
@amartinezfayo @MarcosDY. While reviewing #2176 and thinking about different ways to structure the cache, I realized that I've been under the impression that everything needed to store a single identity would be represented in a single selector, but that maybe the proposal was advocating for something else? Can you shed some light on the proposed shape of the selectors? I think there are some clear implementation wins if a selector is self contained but want to make sure I'm not missing something. |
My initial idea was to relay on multiple electors, that can be useful in case we want to provide something more than a name,
With something like that we can 'configure' secret when creating it and creating in a specific region, intead of all configured regions on aws plugin. At the same time each platform have different configurations that can be useful, and allowing multiple selectors allow more customizations. However we can of course add all that information into a single selector and separate value with |
But what is better/easier for a user? in case we put all in a single selector how can users filter by selector? |
I am very excited to see the serverless support for spire. Do you have an estimate timeline when this feature will be ready ? |
Hi @nzxdrexel! We plan to be able to include this feature in SPIRE 1.1.0. We are close to release 1.0.0 and after that we should be able to start merging the different pieces of this work. |
On POC initially we start using X509SVIDResponse and store it as a proto binary. Trying to make users' lives easier, I’m thinking in 2 ways to simplify it. X509SVIDResponse as JSON PROS
CONS
New JSON PROS
CONS
Proposal
|
Update JSON proposal:
|
Hi @MarcosDY, thank you for your effort on this. I would say the update in the JSON proposal is even better and helpful for users. I think it's ok to assume that if you want another SVID, you can push another secret. About the constraints for the Agent's initialization, today at least one workload attestor plugin is needed, right? Would be a good time to change the restrictions to something like "at least one workload attestor plugin or one storeSVID plugin?" |
Has there been any thought to supporting JWT SVID for SVIDStore, in addition to x509? It seems doable, but I didn't see any mention in the discussion surrounding SVIDStore. A difference here is that x509 SVID are issued to immediately upon creating the workload entry at the server, whereas JWT SVID are not issued until a workload does a fetch to the Agent. This allows the workload to request any "aud" claim, but does not provision the SVID in advance. A workaround would be to do a "fetch" for any audience the workload will require, after the entry is created, to cause the SVID(s) to be stored for that workload. A better solution would be to be able to specify an "-audiences" list via a command line option for the "entry create" to force those JWT SVID to be provisioned at that time. Does this seem doable? If there is a better place to inquire on this pls let me know. |
I was also was wondering if any thought to adding a NodeAttestor plugin based on the IAM Role auth / signed GetCallerIdentity request mentioned here, and surrounding issues (#558, #1784). It would be use the method developed by Vault (https://www.vaultproject.io/docs/auth/aws), and also used by Kubernetes (https://github.com/kubernetes-sigs/aws-iam-authenticator). I know this was discussed in the context of Serverless Pull, but it seems valuable to have as a simple NodeAttestor as well. It would support Agents running in any AWS runtime, not just EC2. It could also support on-prem nodes, as long as they have access to AWS API and have some AWS Identity configured to start. There are some pitfalls mentioned here (https://googleprojectzero.blogspot.com/2020/10/enter-the-vault-auth-issues-hashicorp-vault.html), along with a couple of CVE, but it looks like this has bee solved. But it points out the server-side code would have to be done carefully to avoid any vulnerability. Does this seem like a worthwile endeavor? Or is there sufficient functionality in the existing NodeAttestors, to support agents running in Amazon ECS or Lambda? |
We have been thinking about ways to support JWT-SVIDs in SVIDStore plugins since the introduction of the plugin type, but we haven't found yet how that could be achieved without breaking some core SPIRE's principles for JWT issuance.
This fetch action would be presumably done by a regular workload that's being attested by an agent. One design principle that we have is that registration entries that are marked with the
Adding this kind of information to a registration entry, that is specific to an SVID format would also break SPIRE's model. In any case, having an enumeration of all the possible audiences doesn't seem to be great. Unfortunately, we don't have yet a clear path forward to support the issuance of JWT-SVIDs through SVIDStore plugins. The main challenges are intrinsic to the format itself. We encourage the use of a single audience in JWT-SVID to limit the scope of replay-ability. We also think that JWT-SVIDs must have an aggressive expiration value. Handling rotation in the context of SVIDStore would also be challenging. |
Thanks for your reply, @amartinezfayo. Since you said you were considering it I took the liberty of opening #2752 as a Feature Request. On fetching the JWT SVID from the workload:
Yes and for this reason, this would not be the best way. It would also require some code at the workload to drive the Workload API, and one of the advantages of SVIDStore is that there is no additional code at the workload, except that required to pull the secret from the Cloud provider Secrets Manager. On passing an "-audiences" list to the "entry create" command: this seems like the better way.
Can you elaborate on your concern here? It doesn't seem so odd to have a configured "audiences" list associated with the reg entry. In fact I can envision other uses for such a list, like as a "whitelist" of permitted audiences to request. In the case of SVIDStore, it's all about pre-poplulating the SVIDs the Workload will need in the cloud provider's Secrets Manager. Does this approach not fit? |
Hey everyone, a question on the "push" approach. From a practical standpoint of view, is there any preference/idea as to how the agent should be handled in this scenario? Should it be running on a dedicated node? On the same node as the server? Or perhaps is it's ok to leverage an existing agent instance rather than creating a "dedicated" one for serverless purposes? |
Hi @dpogorzelski!
Agents should certainly not run in the same node as the server. The server is the most sensitive component of the system and special attention should be taken when considering the placement of them. In particular, if the agent would provide identities to other workloads in addition to the serverless workloads. There is an agent/server security boundary, so servers should be placed on hardware that is distinct from the untrusted workloads they are meant to manage. I think that is ok to leverage an existing agent instance to push identities to a designated store. It all depends on the level of separation that you wish for the identities. Agents are authorized to manage an identity when they are referenced as a parent of that identity (the registered entry has the Parent ID of the agent). It is a good idea to scope registration entry Parent IDs as tightly as is reasonably possible. That may mean that you may or may not need a dedicated agent to issue identities to serverless workloads. My personal preference for this scenario would be to have a dedicated agent, which I think makes things easier in terms of administration and tracing possible issues. |
I have been wondering if it makes sense to support SVIDStore (or something like it) on the server too. |
I think the current approach of storing via the agent is fine. It seems to make sense to let the server be the issuer, and the agent the provisioner. But the agent deployment strategy is clearly different with SVID Store. In the conventional case, the goal is place the agent close to the workload, so the workload attestor can function. With SVID store this constraint is relieved, and agent proximity to workload doesn't really matter (as long as both can access the same secrets store). For SVID Store I am thinking of a strategy with one agent per cloud provider region. If agents are grouped (by selectors) such that the agent alias represents an agent in one region, all workload entries associated with that alias would be stored in that region only. If agents are grouped such that the alias represents agents in more than one region, the associated set of SVID would be stored in each region. Is that correct? |
Agreed, though for me this also opens up revisiting the idea of "why not specifically have a new type of Agent for SVIDStore". In my own potential usages, I don't think I will ever have an agent both acting "traditionally" and executing the SVIDStore feature, for the separation-of-security-responsibilities aspects previously mentioned and to de-couple these disparate behaviors. To me this feels like an opportunity for slimming down the amount of potential moving parts in both cases. I know that then creates a new SPIRE binary requiring maintainer support, though I'd be curious how different that is from the support needed to maintain this feature in general inside a now more complex Agent. |
I agree that due to the different deployment strategies, conventional vs. SVIDstore, it's unlikely an agent would serve both purposes well. That said, I can see some cases where it would be useful, for smaller deployments, or when an agent is serving conventional but there may be a few workloads that are preferred (for whatever reason) to get their SVID via the cloud-provider store. Maybe the workload can't easily add the SPIRE client code, or maybe it just fits into some usage pattern where the deployer prefers to use SVIDstore. This is probably a topic for a different thread, but where I see the value in slimming down is for deployments where an agent might be squeezed into an ECS sidecar, or possibly a Lambda extension. I'd be interested to know if anyone has tested that, or considered it. But I would like to dig deeper on @dpogorzelski question of how to handle the agent with SVIDstore. I am thinking of a deployment strategy with one agent per cloud provider region. Alternately, an agent could be deployed on a per-resource group basis, where it is spun up when the resource group is created, and torn down when removed. While it's not necessary to do this for an SVIDStore agent, there may be value in logically associating an agent with a RG and the workloads in that group it will server. @amartinezfayo, would you care to chime in? |
I think that the strategy of having one agent per cloud provider region makes sense. It seems reasonable to me to have that kind of separation / organization. As a side note, the current implementation restricts pushing to a single region for the same plugin name (i.e. the Deploying on a per-resource group basis also makes sense to me. It all depends on the specific requirements of each deployment. Thank you for bringing this to discussion. I think that the community in general is still learning about the best approaches to take when leveraging SVIDStore plugins. If you can share your learnings along the way, with whatever approach you take, that would be awesome! |
Thank you @amoore877 for providing your perspective. One of the alternatives that we considered at the beginning of the exploration of serverless support in SPIRE was the option to have a helper program like that.
IMO the agent didn't gain a lot of complexity with the addition of the SVIDStore plugin type. Most of the logic is encapsulated in the storecache package. But I think that the separation of duties is a fair point. Having said that, I think that it is worth to explore deeper that avenue. It's probably better to have a separate issue for that, so if you would like to see that option explored, it would be great if you can open an issue with your thoughts there. |
This work has been completed for some time. If there are new insights or issues with the serverless support in SPIRE, please file new issues to track those. |
[RFC] Serverless architecture support
Co-authored by @MarcosDY.
Background
Serverless computing allows to build applications eliminating the need to manage infrastructure. With serverless applications, the cloud service provider automatically provisions, scales, and manages the infrastructure required to run the code, eliminating the need for server software and hardware management by the developer.
The current way of workload attestation in SPIRE does not completely fit in this software design pattern, where the execution context is a temporary runtime environment and is not suitable to have SPIRE Agent running to expose the Workload API alongside the serverless function.
Proposal
In order to allow the issuance of SVIDs to workloads in a serverless environment, we need to provide a way to issue identities to the workload without using the Workload API to obtain an identity. The workload would attest directly to SPIRE Server to obtain its identity. This means that we would go through an attestation process in a similar fashion than node attestation but without yielding an agent role to the attested serverless (and agentless) environment.
The attestation process would proceed similarly as the current
AttestAgent
server RPC, but performed through a new call that would provide an "agentless" identity instead of a node identity in SPIRE.The renewal process would also proceed similarly to the current
RenewAgent
RPC, where the caller would present an active "agentless" SVID returned by the attestation call or the most recent one from a previous renewal call. This would allow to avoid going through a complete attestation process when the environment already has a valid SVID that needs to be renewed. The criteria to decide if the SVID should be rotated can be similar to the current criteria adopted in SPIRE, i.e.: rotate the SVID if it has less than half of its lifetime left.The proposed solution should facilitate the issuance of identities in a performant manner, focusing on optimizing the usage of resources, otherwise the advantages of the serverless architecture could be seen reduced by the identity issuance process. To that end, this proposal tries to leverage some of the common features available in the cloud providers that aim to solve performance problems, like reusing the execution context if one is available from a previous function call.
The proposed process to obtain an identity in a serverless architecture is as follows:
If there is no valid SVID, call the "agentless" attestation RPC to get an identity.
If there is already a valid SVID, calculate its lifetime left.
Sample implementation
The following is a description of a sample implementation of the proposed process, including the changes needed in SPIRE and the components required in the serverless environment in order to be able to issue identities without having a SPIRE Agent deployed in the serverless environment.
SPIRE
Add new plugin types to perform the "agentless" attestation in SPIRE Server. Have a new plugin for each provider that has a serverless architecture. For example, there will be a plugin to support AWS Lambda, a plugin for Google Cloud Functions, a plugin for Microsoft Azure Functions and other plugins for any other platform. These are some possible workflows for the implementations:
AWS Lambda: the function signs a GetCallerIdentity query for the AWS Security Token Service (STS) using the AWS Signature v4 algorithm and sends it to SPIRE Server. The credentials used to sign the GetCallerIdentity request come from the AWS Lambda runtime environment variables which avoids the need for an operator to manually provision credentials first. To attest the "agentless" workload, SPIRE Server sends the query to the AWS STS service to validate it and issues an SVID with a SPIFFE ID constructed from attributes extracted from the parsed signed query.
Google Cloud Functions: the function fetches its identity token using the Compute Metadata Server. The attestor plugin in SPIRE Server validates the token provided and issues an SVID with a SPIFFE ID constructed from attributes extracted from the parsed token.
Microsoft Azure: the function obtains its access token from the local token service. The attestor plugin in SPIRE Server validates the token provided and issues an SVID with a SPIFFE ID constructed from attributes extracted from the parsed token.
Attestation data structs are usually shared from
github.com/spiffe/spire/pkg/common/plugin/<plugin_name>
, which would be inconvenient to consume externally. Instead, the types required could be exposed through Protocol Buffers definitions under the proto/spire hierarchy.It would be good to expose a library that can be used to facilitate the attestation process from the serverless environment. This library should expose interfaces to construct the attestation material, call the "agentless" attestation RPC in SPIRE Server and ease the reuse of the issued SVID in case that the state of the environment is preserved in a future invocation. It should also provide functionality to perform the SVID renewal process.
Serverless environment
The workload running in the serverless environment needs to be able to be attested without a running SPIRE Agent that exposes the Workload API. Instead, it calls an exposed RPC in SPIRE Server with attestation data that retrieves from the execution runtime. As mentioned above, it would be convenient to have a library that can be consumed in the serverless environment to aid the attestation and identity issuance process.
With aim of facilitating the implementation, this proposal recommends implementing a mechanism to externally package dependencies that can be shared across multiple functions. One possible way to achieve this is to have a common interface that can be used to retrieve the identity of the "agentless" workload, that can be called from the running function and is exposed through the runtime environment. For example, in the case of AWS Lambda, the "agentless" attestation functionality can be packaged in a layer. The function that needs to be attested can be configured to use this layer, so it does not need to have it implemented in the function. This layer can also be updated with fixes or improvements without the need of updating the function itself.
Request for Comments
This proposal tries to layout changes needed in SPIRE and possible implementation scenarios to provide support to serverless architectures, focusing on providing a solution for AWS Lambda, Google Cloud Functions and Microsoft Azure Functions. Any feedback on the general direction of this proposal, any missing points, suggestions or thoughts in general is greatly appreciated.
The text was updated successfully, but these errors were encountered: