-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should SPIRE support non-unique node attestation? #558
Comments
Use cases could be: Containerised microservice, with an IAM Task Role allocated (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html), running within a AWS ECS Cluster wishes to receive a SPIFFE id to communicate with other microservices. Using the recommended ECS agent deployment model containers do not have direct access to the underlying host instance metadata service. Containerised microservice, with an IAM Task Role allocated (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html), running within a AWS ECS Cluster in Fargate launch mode wishes to receive a SPIFFE id to communicate with other microservices. By design Fargate launched containers do not have access to the underlying host instance metadata service, nor can any agent be installed on the underlying host. Lambda microservice, with an IAM execution role allocated (https://docs.aws.amazon.com/lambda/latest/dg/intro-permission-model.html#lambda-intro-execution-role) wishes to receive a SPIFFE id to communicate with other microservices. Lambda functions do not have access to any underlying host instance metadata service (by design). |
Some more detail on why this exists in SPIRE today, and possible paths for solving for it. In the current design of SPIRE, when an agent performs node attestation it generates a CSR (which includes the name of SPIFFE ID that should identify the agent) and passes it to the server, along with the data used for attestation (such as the AWS IID). The SPIFFE ID is specified by the agent during the attestation process and needs to have the following properties:
An agent SPIFFE ID might be something like The reason that SPIRE currently requires that each instantiation of an agent is uniquely identifiable is to ensure that when SPIFFE certificates are renewed, only the agent that was issued a given certificate is able to exchange it for a new one. There is a possible short term and a possible long term solution to this problem. The possible short term solution is to design a new node attestor, which is able to pass some unique per instantiation value from the it’s agent plugin to the server plugin during the attestation process. This could just be a random number generated (and persisted for that instantiation). The SPIFFE ID could be reconstructed from that number. For example, The possible longer term solution is to redesign the node attestation process entirely to avoid the need for SPIFFE IDs to be passed by the agent to the server in a CSR during attestation. This would then allow for nodes to be assigned “random” SPIFFE IDs by the server if an instance can’t determine SPIFFE ID a priori. In both solutions the server side attestor can emit the verified selectors to be used in an attestation policy, and this can be anything asserted by the identity document (such as an AWS IAM role from an STS token). These will then be used by SPIRE to determine which other identities the agent is entitled to issue - which could include the original workload ID (eg. |
+1. would love to hear updates for this development, if any |
This may be stale, but attestation based on the aws iid has some limitations. Beyond the specific cases described here where the iid & derivatives may not be definitive for attesting a workload identity, the iid is not rotated as frequently as the IAM credential, increasing the potential of replay attacks or similar. Has anyone considered using the Vault/aws-iam-authenticator approach for workload attestation instead of iid now that sts:GetCallerIdentity works? https://github.com/kubernetes-sigs/aws-iam-authenticator#how-does-it-work |
@ajessup To confirm my understanding, if two spire-agents were to attempt to connect to a spire-server identifying themselves as Adding onto the earlier mentions of ECS, even if you get past the expectation of requiring the containers to reach out to the EC2 IMDS, if multiple ECS tasks (containing a spire-agent) are co-located on the same instance, those agents will then share the same IID, thus rejecting one of them. Beyond that, multiple disparate workloads could end up on the same instance (ignoring complex placement strategies) so I think that means spire-server would need registrations of every combination of workload and instance. I'm new to SPIRE so tell me if I've made an error in either of the last two paragraphs. I've been thinking about this for aws/aws-app-mesh-roadmap#68 where @evan2645 had chimed in months ago. I wondered if there could be some kind of composite attestation, where the EC2 IID is verifiable, and the task metadata is used for identification of a task; though the latter can't be verified by spire-server without a signature. That aside, I like the suggestion by @brianwolfe to use a strategy similar to the IAM authenticator, that'll work on fargate too. Since a role is insufficient to identify a task, It would have to be used alongside either some other unique metadata per ECS task or the "short term solution" discussed above. |
I think there's some confusion here that my original wording of this issue likely contributed to. But just to clear things up, in the current design of spire, agents must be assigned to a single uniquely identified node, but workloads do not. It is quite possible to register a workload in SPIRE with a policy that it may be assigned to a workload on any node that matches a certain set of conditions (such as a particular AWS tag). This is, as you point out, necessary when workloads are running on a container scheduler like ECS or Kubernetes, since a workload may well be scheduled on any or several of a large number of nodes at any given time. More on this below. Generally this isn't a problem if you're operating in an environment where there is an agent that can run on each node, and can have a way to prove it's unique identity the SPIRE server. Where this becomes problematic is when there is no "node" as such (eg. Fargate, Lambda) and the entity that is acting as the agent can't uniquely identify a single instantiation of itself. This is due to a set of assumptions in the API that SPIRE Agents use to authenticate to SPIRE Servers (the Node API). The SPIRE project has been busy refactoring the SPIRE APIs to address this, as well as several other limitations (see #1057 for the latest). These APIs should significantly simplify "agentless" SPIRE deployments, but I'll let @azdagron keep me honest there.
Note that the expectation here is that a SPIRE Agent is per-node (or more formally, per-kernel), and can service multiple workloads running on that node. In linux this is generally accomplished with a system deamon, and in Kubernetes with a Deamonset resource to control placement (I don't know if ECS supports similar "Guarantee N copies per node" placement rules). Individual workloads then call the Workload API exposed by the Agent over a unix domain socket (in Kubernetes, this is mounted into each pod). The agent then inspects the PID and other metadata of the calling process to determine exactly which container is calling it.
If you're able to get an agent on each node (if you're running on EC2, it might be feasible to have it start as part of the base VM image, for example) and there's some provable element of each node (see here for the complete list of selectors you can use https://github.com/spiffe/spire/blob/master/doc/plugin_server_nodeattestor_aws_iid.md), then you can define your workload in terms of a node selector (say, the node must have a particular instance tag) as well one or more workload-specific selectors (say a docker image name).
Without knowing the intricacies of ECS, I wonder if it would be possible to write a SPIRE Agent plugin that is an ECS equivalent of https://github.com/spiffe/spire/blob/master/doc/plugin_agent_workloadattestor_k8s.md? The way that plugin works is to maps the metadata that the SPIRE Agent retrieves from the workload via the unix domain socket (UID, PID, etc.) and Kubernetes metadata (eg. k8s namespace or pod label) from the kublet process running on the same node, and exposes these attributes as selectors in registration policies. I wonder if there's a similar node local component in ECS that understands task metadata and can map that to unix primitives?
I suspect once the refactored APIs are in place, an STS-based attestor will become very attractive. |
Thanks Andrew. I think that largely aligns with my current understanding. I probably contributed to some confusion, I didn't mean to suggest that workloads are uniquely identifiable, but I was hung up on the idea of registering every combination of unique agents and workloads so any workload task could be placed anywhere. I don't know how people are actually using SPIRE, so it's hard for me to form an intuition on workload selectors. But based on how the k8s registrar works, it seems like the expectation is that workloads can be registered as they come up, and as long as the node and workload attestations succeed, you're good.
ECS does have a daemon mode for ECS services, but they're not intended for runtime dependencies (or however you would categorize it), rather things like log exporters. Customers do control their instances when using ECS, and some do manage their own images and daemons, but it drastically raises the barrier to entry, in addition to introducing relationships that are complicated or impossible to manage with ECS APIs. My mind had gone straight to treating each task as a node and having a spire-agent container alongside each workload. Then customers can model it like any other container and set up order and such. But that's what brought me to the concerns around attesting even a node in that environment, let alone the workload.
Task metadata does provide a whole bunch of information on both the task and the node. But it doesn't map to unix primitives. As I mentioned above, spire-server can verify an IID, but there's no equivalent for task metadata. It's certainly possible for:
But, in order for that to not be spoofed, presumably you'd need some complex logic to sign the task payload and verify it on spire-server. I haven't thought about it much, but I'm happy to be proven wrong. Of course, the "right" solution is one that works across ECS and Fargate, likely relying on a combination of STS and task metadata. The session name for task roles at least do have the task id. |
On Fri, Aug 28, 2020 at 5:20 PM Efe Selcuk ***@***.***> wrote:
Thanks Andrew. I think that largely aligns with my current understanding.
I probably contributed to some confusion, I didn't mean to suggest that
workloads are uniquely identifiable, but I was hung up on the idea of
registering every combination of unique agents and workloads so any
workload task could be placed anywhere.
I don't know how people are actually using SPIRE, so it's hard for me to
form an intuition on workload selectors. But based on how the k8s registrar
works, it seems like the expectation is that workloads can be registered as
they come up, and as long as the node and workload attestations succeed,
you're good.
Broadly, yes. SPIRE requires each SPIFFE ID be registered before it is
identified. The ID (eg. spiffe://acme.com/us-west/broker) is specific, but
the selectors (ie. conditions that must be matched by a workload to be
entitled to that ID) can be generic.
I don't know if ECS supports similar "Guarantee N copies per node"
placement rules
ECS does have a daemon mode for ECS services, but they're not intended for
runtime dependencies (or however you would categorize it), rather things
like log exporters. Customers do control their instances when using ECS,
and some do manage their own images and daemons, but it drastically raises
the barrier to entry, in addition to introducing relationships that are
complicated or impossible to manage with ECS APIs.
If each workload in a scheduler needs to attest directly with the SPIRE
Server (rather than with an Agent) this will be cumbersome with the v1
SPIRE APIs. The refactored APIs will make this design pattern easier,
though there's still some overhead with this approach.
The value of having an agent running on each node, servicing every workload
on that node is:
(1) Availability - the agent can pre-cache requests from the server on boot
and serve them directly when the workload starts, which means there's less
delay for a workload dynamically scheduled on a node to get an identity
document. It also means that workloads can continue largely to operate even
in the event of transient failures between the Agent and Server.
(2) Kernel level attestation - once attested itself, the SPIRE Agent is
able to use the kernel as a trusted third party to validate kernel level
metadata (as well as communicate with demons like docker and kubelet via
local side-channels). This is what allows for the rich set of local
selectors with SPIRE, since the Agent can assert these properties on the
Agent's behalf. When a workload attests directly to a server, attestation
is limited to those properties it can prove about itself through some other
trusted third party (like AWS IAM).
Of course, in a Fargate/Lambda environment or similar where customers can't
reason about a Node at all, it may not be an option.
My mind had gone straight to treating each task as a node and having a
spire-agent container alongside each workload. Then customers can model it
like any other container and set up order and such. But that's what brought
me to the concerns around attesting even a node in that environment, let
alone the workload.
I wonder if there's a similar node local component in ECS that understands
task metadata and can map that to unix primitives?
Task metadata does provide a whole bunch of information on both the task
and the node
<https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4.html#task-metadata-endpoint-v4-examples>.
But it doesn't map to unix primitives.
As I mentioned above, spire-server can *verify* an IID, but there's no
equivalent for task metadata. It's certainly possible for:
1. spire-agent to use its task metadata for node attestation data
2. spire-server reaches out to the appropriate ECS api to describe the
tasks running on an instance, and verify that the attested task is indeed
present
But, in order for that to not be spoofed, presumably you'd need some
complex logic to sign the task payload and verify it on spire-server. I
haven't thought about it much, but I'm happy to be proven wrong.
The Agent is restricted by the Server to issuing identities that can run on
the same node as the Agent. Beyond that though, the Server trusts the Agent
to perform attestation of the workload on the node correctly (there's no
other party to trust to provide proof). So if there's an agent and a
node-local API to the local ECS scheduler, and that API allows for mapping
kernel metadata (like PID) to ECS metadata (like Task ID) then this should
be all you need.
Of course, the "right" solution is one that works across ECS and Fargate,
likely relying on a combination of STS and task metadata. The session name
for task roles at least do have the task id.
Beyond STS, is there any kind of signed identity document issued by ECS to
a workload (an equivalent of Kubernetes' Service Account Tokens, for
example) that could be used to prove properties of a task that a SPIRE
Server could independently verify?
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#558 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACHGYY4QWAR7HHIU6PADGTSDBCVPANCNFSM4FOMHTXQ>
.
|
Appreciate the thoughtful response.
I was going down the thought process of whether the agent could have a mode where the user is guaranteeing that it will be co-located exclusively with one workload (as you would have in an ECS task). If we accept the threat boundary, then it can just vend the SVID to whichever workload asks for it. ^ This is setting aside the benefits of a persistent agent you've mentioned above, and ignoring the overhead of an agent just for SVID retrieval. It leads naturally into the "agentless" discussion.
The ecs-agent that manages the containers may have that information available (though it wouldn't make much sense to integrate with it). Kernel metadata isn't mapped to task metadata, but docker metadata is. It may be doable to attest the node using EC2 IIDs, and workloads using docker labels (which includes task info). That being said, an architecture where containers have hard dependencies on software running on the instance isn't something we'd recommend.
As far as I know, there isn't an equivalent in ECS today. I'm starting to reach out to folks in the know to see how this might be solved, but at least we do have some expertise from solving that similar problem in kubernetes. The other option I'm thinking about is using something like a private key stored in Secrets Manager, which ECS can provide to a container. That's more overhead to set up, but tractable. |
@ajessup A follow-up as I re-read this
Is this a delivery optimization (so multiple agents aren't caching the same SVIDs)? Or is it a security concern (something like, every instance of a workload (even if those workloads are homogenous, like replicas of a service) should have its own SVID)? |
It is a security concern, but just to reduce the probability of an attacker stealing and replaying certificates. I believe there is nothing stopping multiple instances of the same workload from getting identical SVIDs if they have identical selectors right now.
This is the approach used in the Square implementation of SPIFFE for lambda functions. https://developer.squareup.com/blog/providing-mtls-identities-to-lambdas/ |
Hi @ajessup! We haven't seen movement on this issue for a while. Is this discussion still relevant given the direction we've headed for serverless architecture support? If there are still specific integrations in mind, I'd suggest opening a new issue to discuss the specific integration that we can scope independently. I'll go ahead and close this for now, pending any new discussion. |
The current design of SPIRE assumes that each Agent is running on a node that can be identified uniquely (eg. by a join token or an EC2 Instance ID), even if the workload identified by an Agent may span multiple nodes.
In some cases however - while properties of the infrastructure can be attested, there may be no (verifiable) way to identify a unique node an Agent is running on.
As an example, a SPIRE Agent deployed to ECS running AWS Fargate will not be able to retrieve an instance ID (though it can retrieve an STS token that will verify the IAM roles associated with the service).
In theory, it should be possible to attest AWS workloads based on IAM roles encoded in STS tokens in the same way the
aws-iid
module attests workloads based on Instance Identity Documents. But the requirement that each node be uniquely identified prevents this.The scope of this issue to discuss motivating use cases and possible solutions to this problem in SPIRE.
cc // @mlakewood @evan2645 @grittershub
The text was updated successfully, but these errors were encountered: