Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing flakiness with oidc-token using identity.entity.aliases.<mount accessor>.metadata.<metadata key> #11798

Open
sganapathy1312 opened this issue Jun 9, 2021 · 15 comments
Labels
auth/approle bug Used to indicate a potential bug core Issues and Pull-Requests specific to Vault Core internal Mark for internal discussion/implementation

Comments

@sganapathy1312
Copy link

sganapathy1312 commented Jun 9, 2021

Describe the bug

Sometimes custom claims populated via identity.entity.aliases.<app_role_mount accessor>.metadata.<metadata key> , where metadata is stamped on the secret-id of the aprole, don't get populated. There is no pattern on when this happens. Sometimes it happens on a successive curl request to the token endpoint exposed by the oidc-role using the same vault token that returned an oidc token with all claims correctly set. Sometimes, it happens on a brand new token obtained by logging in using approle creds. Once it happens, successive iterations do not populate those claims at all unless a new vault token is obtained.

To Reproduce

Cannot describe a deterministic way to repro as this problem doesn't appear to occur in a deterministic manner. Instead will try our best to describe our setup

Overview
We have a custom oidc-token template to identify our agent pollers looking for work from a central server. All pollers use the same approle but each poller has its own secret-id. This secret-id is stamped with a meta data for uniquely identifying the poller. For ex. Poller-A could have poller-a.json as its secret's meta data

{
  "metadata": "{ \"resource_id\": \"foobar\" }"
}

and its secret created and propagated via curl --header "X-Vault-Token: ..." --request POST --data @poller-a.json https://my-vault-endpoint/my-approle-role-url/secret-id. Similarly, we have a poller-b, whose metadata could look like

{
  "metadata": "{ \"resource_id\": \"barbaz\" }"
}

and its secret-id could have been generated by curl --header "X-Vault-Token: ..." --request POST --data @poller-b.json https://my-vault-endpoint/my-approle-role-url/secret-id

Details

  • Login to vault using one of the poller's approles vault write <my-app-role-path> role_id=... secret_id=...
  • Verify that the vault-token retrieved from above has token_meta_resource_id <my-resource-id>
  • Send a curl request to the oidc-role's token endpoint stamped with the vault token from above curl --header "X-Vault-Token: ..." https://<my-vault-endpoint>/v1/identity/oidc/token/<my-role>
    • <My role> is defined in tf such that there is one per approle mountpoint and we have one approle mountpoint per env stages. In broad strokes,
resource "vault_identity_oidc_role" "my-svc" {  
   for_each = local.my-svc-envs
   name     = "my-svc-${each.key}"
   key      = vault_identity_oidc_key.key[each.key].name
   ttl        = my-ttl  
   template = XXX;EOF
{  
    <other claims>
   "entity_data": { 
       <other claims>   
       "resource_id": {{identity.entity.aliases.${vault_auth_backend.app[each.key].accessor}.metadata.resource_id}}  }}  
EOF}
  • Now decrypt the received oidc-token via jwt decode <my-token>

Repeat the above test in various flavors

  • Round robin across poller-A/poller-B vault approles.
  • Repeat the curl for fetching the oidc-token and the jwt decode for the same token which had previously returned a valid resource_id
    Additionally, performed this test as a golang program which runs all steps above using jose library for decoding the oidc and creating vault clients from scratch every 30 seconds.

The manual tests showed quite a few instances of resource_id not getting populated and there was no pattern to when it occured. Sometimes it would occur for a token for which resource_id previously got populated. Sometimes it occurred for a brand new token

golang tests are run for just one of the pollers in tight loop and every 30 seconds. When run on a tight loop, we saw 3 failures out of 4000 odd. 30 second interval test is ongoing. No failures so far after 36 iterations.

Expected behavior
Expect the resource_id to be set every time

Environment:

  • Vault Server Version (retrieve with vault status):
<redacted>@<redacted> ~ % vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.6.2
Storage Type           postgresql
Cluster Name           vault-cluster-140e09ca
Cluster ID             62dba0b3-dedb-1844-1a16-e873f9cf4847
HA Enabled             true
HA Cluster             https://vault.vault:8201
HA Mode                standby
Active Node Address    https://vault.vault:8200
  • Vault CLI Version (retrieve with vault version):
<redacted>@<redacted> ~ % vault version
Vault v1.6.0 ('7ce0bd9691998e0443bc77e98b1e2a4ab1e965d4+CHANGES')
@sganapathy1312
Copy link
Author

Poller-A and Poller-B have the same role-id. They only differ in their secrets with each secret-id having its own metadata

@sganapathy1312
Copy link
Author

Reading the code, one thing that struck me was given that all our pollers use the same approle { and hence the same entity }, could it be possible that the entity that is retrieved via txn.First from the entities table in the line linked above somehow return an entity whose secret-id didn't end up setting the resourceId field.

This is just me speculating based on reading through the code. Looking forward to your responses.

@sganapathy1312
Copy link
Author

The golang based test further lends credence to the above theory.

I basically ran 2 go routines one running poller-A and the other running Poller-B. The test is super simple, run 2 go routines that implement the above steps

  • Login to vault using the respective approles.
  • Fetch the oidc-token
  • Decrypt it using jose

Instead of seeing 2 distinct resource_ids we just see 1 most of the time in our stdout logs.

At this point, I'm 70% sure this is the bug. Could you please confirm the same? We thought we get by with using 1 approle and 1 secret-id per poller. But looks like we need 1 approle per poller.

@peterverraedt
Copy link
Contributor

I'm encountering the same bug in a different scenario, where a single approle role_id is used to create secret_ids for different hosts, to run a configuration management tool on each host reading secrets from vault using a token with its own secret_id. To restrict reading of certain paths in vault, we stamp metadata identifying the host on each secret_id, and we assign the role_id to a policy where this metadata is used. When running our configuration management tool on multiple servers at the same time, it seems that these policies are not correctly applied.

The following steps reproduce this error:

  1. create approle role_id with policy X
  2. create secret_id A with metadata {"key": "A"}
  3. create secret_id B with metadata {"key": "B"}
  4. create policy X to allow read permissions on secret/{{identity.entity.aliases.APPROLE ACCESSOR.metadata.key}}/*
  5. create token A with secret_id A
  6. read secret/A/test successfully (no 403 error) with token A
  7. create token B with secret_id B
  8. read secret/B/test successfully with token B
  9. read secret/A/test with token A results in a 403 error
  10. read secret/B/test with token A successfully

@sganapathy1312
Copy link
Author

Checking in to see about this and #11803

@pmmukh
Copy link
Contributor

pmmukh commented Sep 15, 2021

Hi all! Thanks for submitting this issue, and figuring out the root cause for it!
For the use case of multiple secretIDs on the same role with access to different paths, that is actually not really supported, in the sense that one RoleID is expected to correspond to one entity and one entity alias. The use cases mentioned here, we do think is a bit of an anti-pattern, as different secret paths should necessitate different role-ids.
That being said, we do support secretIDs having metadata that is applied to the entity alias level, and is something that our users ( y'all ) are using. And given this is a possible way to use Approles, this is also a valid bug resulting from that flow. As such, we do want to figure out a way forward from here, so that the alias metadata templating behaves in a more deterministic way.
The linked PR is not really an approach we want to take, as that creates multiple entity aliases per approle, which breaks from our current mapping model, and so will be closing that PR out. I'lll be marking this issue as internal since I think we want to figure out and implement this fix internally.

Also as a question for the people affected by this problem, is there any specific reason to not use multiple approles in this scenario, instead of 1 approle and multiple secrets ? I understand management overhead is one reason to not, but interested to hear if there are any others.

@pmmukh pmmukh added auth/approle core Issues and Pull-Requests specific to Vault Core internal Mark for internal discussion/implementation labels Sep 15, 2021
@nvx
Copy link
Contributor

nvx commented Sep 16, 2021

If #10682 is merged, then the secret-id metadata could simply be added to the token metadata and then used in identity templating via that mechanism.

Single approle multiple secret ids makes sense logically as a easy way to group N servers in X role. This is especially useful for clustered services where you may have a lot of instances of what's effectively the same server. Allowing metadata to be defined per secret id allows defining the hostname of the box as metadata, which then can be used in eg the PKI secrets engine to allow that server to get a certificate for its specific hostname.

A single approle for this sort of use-case also means provisioning of policies a role has access to (which changes relatively infrequently and is a very privileged operation) and where it can be used (I use CIDRs on the approle to limit eg production approle tokens to only be able to be used by things on the production network range) can be separated from provisioning secret-ids which by necessity happens a lot more often (ie, whenever spinning up a new instance of a server and is done by more automated tools like Terraform/Atlantis/Jenkins/etc). The capability to use these sorts of safeguards as well as separation of privilege (between BAU scaling up and down of services, vs changing privileges that a system has access to) are very important in most organisations.

I do like the idea of not having to have an individual entity for each server (which would be the case having multiple approles or what the PR mentioned earlier would have caused) since this would cause a lot of entities to get created that are mostly meaningless and are not automatically cleaned up if eg the approle or secretid are deleted (eg when destroying the specific server).

For more static infrastructure it's less big of a deal, but when you're running immutable infrastructure by necessity you're spinning up and down VMs a lot, so management overhead and safeguards matter a lot.

@pmmukh
Copy link
Contributor

pmmukh commented Sep 16, 2021

Hi @nvx ! Thanks for the detailed explanation of use cases, its super valuable as we figure out how to solve for this!
For the usage of Approles, I believe we are aligned in how they should be distributed, as in 1 application should have 1 role-id, with each server instance having a separate secret-id. And that the role-id i.e the application should control the priveleges of all the servers of the application.
One use case mentioned here that I didn't particularly understand, was the accessing of different secret paths in the repro steps here #11798 (comment), cause my understanding is that different instances of an app should access the same paths in Vault, so while those repro steps do show a bug I'm not sure why that access pattern would happen.
The other use cases I think we agree is suboptimal right now, where the instances need custom claims on their jwt tokens, or custom certs from the pki engine, and want to do templating for that. That's something that I believe will currently require 1 role per instance.
As far as the suggested solution goes, that approach does look interesting. I did find that in the past we have been opposed to this path #5916, but I'm going to follow up with the team and see where we stand on it today / if anything has changed regarding that. This is all great insight into the issue, so thanks a ton!

@nvx
Copy link
Contributor

nvx commented Sep 16, 2021

One use case mentioned here that I didn't particularly understand, was the accessing of different secret paths in the repro steps here #11798 (comment), cause my understanding is that different instances of an app should access the same paths in Vault, so while those repro steps do show a bug I'm not sure why that access pattern would happen.

Identity templating can be used in other places than request URLs, but the case that stands out for me where you would want it in a request URL would be when storing per-server secrets in Vault (often clustered servers have the same sets of secrets across all servers, but some things for eg sidecar services that run on the host for other management purposes/etc than the actual clustered application itself) - I personally use this pattern to store secrets in the KVv2 secrets engine with the hostname of the server in the path go figure.

@pmmukh
Copy link
Contributor

pmmukh commented Sep 17, 2021

Aha sidecar services is an interesting pattern, and one I didn't think about, I can see how secretid metadata templating would be useful to solve problems with that, will be sure to keep this in mind as we think about this!

@joeyberkovitz
Copy link

@pmmukh - any chance there have been any updates related to this?
Are there any plans to support the templating use case - most specifically in the case of the PKI engine? I think it would be fairly inconvenient to have to create one AppRole per host if every host will require exactly the same Vault access with the only variation being what hostname they're allowed to use in their requests.

If this won't be supported, please adjust the documentation to indicate that either an AppRole role should only be used by consumers that require exactly the same permissions or that the secret-id must always be generated with the same metadata for a given role. If this will be the case, it may just make more sense to move the metadata to the role to avoid any further confusion.

@UXabre
Copy link

UXabre commented Oct 18, 2021

I continued my work and effort on merging PR #10682 but my use-case is mainly with OIDC/JWT backends but I guess that anything that supports mapping metadata on the token could be used (I'm just not too familiar yet with other auth backends, sorry)

@heatherezell
Copy link
Contributor

Hi folks, is this still an issue with newer versions of Vault? Please let me know and I can bubble it up accordingly. Thanks!

@tsipinakis
Copy link

tsipinakis commented Sep 13, 2024

@heatherezell I ran into this issue as well - it is not clear from the documentation that the metadata field in the secret-ids are not meant to be used for path templating. The fact that they are supported explicitly in hcl encourages it. From reading the docs I had expected the behaviour that one secret-id/metadata combination creates a separate entity and it was even more misleading in the fact that it worked until it didn't when something conflicted.

In my opinion there are 2 ways to take this issue:

  1. Make it clear this is not supported by removing support for using this metadata in hcl Move the metadata from the secret to the role
  2. Embrace it and add an option to use the accessor as the alias name

@ssams
Copy link

ssams commented Nov 6, 2024

I also just stumbled into this, via #12797 - basically got the same scenario, and assumed I could create one approle, then add multiple secrets with different metadata and use templated policies to restrict access for each secret (which represents different machines/deployments in my case). Only after running into errors I noticed via token lookups that all of them share the same entity ID and alias, completely breaking the expectation of being able to use the secret's metadata.

Not sure how to workaround this without creating one approle per deployment instead? which would lead to a lot of approles, which doesn't seem like it's the intended way either? I'd expect to simply be able to use secret/token metadata in the policies if one can assign it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auth/approle bug Used to indicate a potential bug core Issues and Pull-Requests specific to Vault Core internal Mark for internal discussion/implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants