Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Backend Latency with dynamic secrets #687

Closed
jantman opened this issue Oct 9, 2015 · 30 comments
Closed

AWS Backend Latency with dynamic secrets #687

jantman opened this issue Oct 9, 2015 · 30 comments
Milestone

Comments

@jantman
Copy link
Contributor

jantman commented Oct 9, 2015

I'm using Vault's AWS dynamic backend to get credentials for TerraForm to use when applying. My deployment process is driven by Rake, so I use the 'vault' rubygem to get credentials from Vault, and then pass them in to TerraForm as -vars on the generated command line.

When I do this (disregarding any previoous leases and simply using read()), TerraForm always fails with an error like:

Error refreshing state: 1 error(s) occurred:

* 1 error(s) occurred:

* InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: [76d0dd7b-6e8a-11e5-bd2c-0b36ce8a1c30]

When using TF_LOG=1 this seems to always happen just after Initializing CloudWatch SDK connection but I'm not sure if it's related.

However, when I use those same credentials locally, they always work. Going on this hunch, I inserted a 5 second sleep between the Vault read (that dynamically generates the credentials) and the terraform apply; this causes it to always work now.

  1. Is the latency between Vault returning dynamic AWS credentials, and their being fully-functional, a known issue?
  2. Is there any way to know what this seeming propagation latency is?
  3. Is there any way for Vault to confirm that the user/role is active and fully-working before returning it to the client?
@jefferai
Copy link
Member

jefferai commented Oct 9, 2015

We're using Amazon's official libraries (https://github.com/aws/aws-sdk-go). The SDK client is returning success. We could try to work around it, but I would bet that this is really something that needs to be fixed there, because this would affect anyone using the SDK. If the credentials aren't valid, the SDK client probably shouldn't return success.

Any chance you'd be willing to open an issue with them? Otherwise I can work through that angle.

@urq
Copy link

urq commented Oct 10, 2015

I spoke with an AWS engineer yesterday about this specific problem. The underlying issue is that the IAM API is (only) eventually consistent, so they (1) create and return your IAM credentials, and (2) propagate that information to relevant APIs on the backend, in that order. If you manage to use the credentials between (1) and (2), you're in for a bad time, and AWS is unlikely to address this any time soon since most folks are using static IAM credentials.

The good news is that this engineer indicated that there may be another way: the AWS STS API. In a lot of ways this API is closer to what Vault is trying to do anyway--issue temporary credentials--and, according to this engineer, is fast and consistent.

Browsing the docs, I see a couple limitations: (1) integrating this would break Vaut's current AWS API with end users since the returned credentail object has an extra parameter, a session token, in addition to the access key and secret key, and (2) the credentials can be valid for 15 minutes to 1 hour, so those folks needing shorter or longer leases than that span are SOL.

If people really need the low latency requirement we could explore making a new backend, or modifying or replacing the existing backend with STS instead of IAM. Thoughts?

@jefferai
Copy link
Member

Did the engineer happen to give any sort of idea about how long the eventually consistent API takes to be consistent? E.g. 99.9% in 4 seconds, or something like that? The STS time limitation is bothersome; if Vault can insert a reasonable delay before returning credentials, that might be best.

@jantman
Copy link
Contributor Author

jantman commented Oct 10, 2015

Hmm, very interesting. That makes sense to me (eventually consistent) and certainly agrees with what I'm seeing.

As to integration and the extra parameter, it seems to me that if Vault went the STS route, it should be a separate secret backend (i.e. a new STS backend).

This brings up an interesting question for me: CloudFormation can create IAM resources (users, roles, policies, groups, instance profiles, access keys, etc.) and use them on other resources in the template. I haven't yet seen a case where a CF stack fails because the IAM resources in it haven't propagated yet. I wonder if that's just luck/enough latency in CF itself (i.e. an Instance Profile gets created, and there's enough latency between that and creating a Launch Configuration that uses it, that this problem isn't seen) or if they do something special on the backend to validate propagation before moving on to dependent resources?

The other thing that would come to mind (which I know might be difficult or impossible depending on how the propagation is handled on the backend) is that if IAM resource creation is only eventually consistent, shouldn't there be a way to query whether or not a resource is propagated? Or have a "status" value in IAM API responses?

If you think it will do some good, I'd be happy to reach out to AWS support or our account rep about this, but most times I've contacted AWS about a bug or feature request, the only response I got was that they'd added it to some secret internal list of requests, and I never hear anything more.

@jefferai
Copy link
Member

@urq Is it a fair assumption that you have the ability to adjust parameters in the Vault calls you're making?

If so, one easy way to fix this may be to add a parameter supporting a configurable delay; default to zero, but settable in seconds. Then you can just set that parameter on your calls to whatever value works for you, but for others users that aren't running into this problem there is no change in behavior or extra delay.

@urq
Copy link

urq commented Oct 12, 2015

@jefferai The engineer didn't indicate how long it would take to get consistency, though I would guess it's some log-normal distribution, and in my experience with a mean around 5 seconds. We could ask AWS or do some empirical testing on different APIs to get an estimate of that distribution.

More fundamentally though, the issue with slapping a sleep time on it is that you will have some requests that exceed the sleep time and will then need to be handled by the client. Additionally you make all calls at least as slow as the sleep time, where otherwise some calls might return faster. I would recommend implementing retries instead.

The janky way we're doing it now is relying on the fact that the S3 API returns an InvalidAccessKeyId error until the IAM credentials are propagated, and then AccessDenied once they have propagated. So we poll the API until we get the latter error. We also sprinkle a small sleep on the end before returning just in case. Gross, I know, but it's working right now. To @jantman 's point, this API looks like it might tell you whether the key were active or inactive, but I haven't tested that "active" corresponds to "totally consistent across AWS APIs."

A nice-to-have would be to implement that logic on the server side to reduce the complexity of the clients. If this isn't something we want to put time into, I can keep rolling with our janky retry logic.

@jefferai
Copy link
Member

@urq Yeah, I know a configurable sleep period is janky and gross, I figured it might be an easy bandage until someone has time to implement something better. I'll take a look at the IAM API though -- it sounds like if it works the same way as the S3 API, that would be a decent way to check the status of credentials.

@jefferai jefferai added this to the 0.4.0 milestone Oct 12, 2015
@jefferai
Copy link
Member

I'm scheduling this for 0.4 but that may slip. I'll try to get to it soon though.

@jantman
Copy link
Contributor Author

jantman commented Oct 20, 2015

We happen to have an AWS consultant (someone actually from AWS; I don't know his official title) on site this week, so I took the opportunity to run this past him. His take was in agreement with what's been said above:

  1. IAM user creation is eventually consistent on a service level. He also mentioned that replication works as one would logically expect - first within an AZ, then to other AZs, then to other regions.
  2. There's no feasible way of determining when credentials are active, save for testing them against various services. He didn't go into any details, but implied that however the data is replicated, there's no low-overhead method of detecting when replication is complete.
  3. STS might help this problem, but it's not clear if the help would be because STS does some sort of internal consistency checks, or because using AssumeRole simply functions differently behind the scenes.

He did mention, however, that any delay longer than "a few seconds", specifically anything >5s, should be unusual.

@jefferai
Copy link
Member

@jantman Thanks for this! It's really useful to know that there is no real way of determining when things are fine except testing against random service APIs, which is as janky as adding sleep.

I think the immediate way forward will be to add a parameter allowing for a sleep time before returning the credentials.

@cetex
Copy link

cetex commented Oct 25, 2015

Hi!

I've hit this as well and i'm since a month or two (randomly) solving it with a sleep of 7 seconds. sometimes 5 seconds is enough, sometimes 10 seconds is needed.

I've contacted AWS and this is what they say:
--- SNIP ---
To manage eventual consistency, there are two methods described in the following link which also contains more details about this issue:
http://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency

One of these methods is to add wait time between subsequent commands as you have implemented during testing period. The other method is to confirm creation of resource before run the next command. There is command which could be very useful to determine the successful creation of the user by waiting until 200 response is received when polling with get-user. It will poll every 1 seconds until a successful state has been reached. This will exit with a return code of 255 after 20 failed checks.

#aws iam wait user-exists --user-name USERID

For more information about this Command, please check the following link:
http://docs.aws.amazon.com/cli/latest/reference/iam/wait/user-exists.html

--- SNIP ---

So, I'm guessing that if you added polling of the account into the user-creation the results would be more consistent.

If account creation is still "unstable" after that i guess a sleep is the way to go.

Also, the STS would be a nice thing to have if it's possible to renew STS credentials in aws, but i guess that should be implemented as a new API in vault, not replace the current AWS API.

@jefferai
Copy link
Member

I'm not going to get to this in the next couple of days, but if anyone interested wants to figure out which API the user-exists command is hitting, that would be useful. It's curious that they told you to use that after telling other users (further up in the thread) that there is no official way to determine whether things are yet consistent.

@jantman
Copy link
Contributor Author

jantman commented Oct 28, 2015

FWIW, I'd asked our AWS consultant about STS. Here was his response:

I got more information about STS. Like you observed, STS is faster to replicate the data compared to IAM. STS replicates data only within the region while IAM has to replicate the user to other regions.

@jefferai re: the aws iam wait user-exists --user-name foo this is actually implemented in generated code in boto-core PR 642. It appears that the logic is to poll IAM's GetUser action every second until it returns a HTTP 200.

@jefferai
Copy link
Member

Yuck, but there's no reasonable better way given the constraints. :-)

@jefferai
Copy link
Member

This is actually quite ugly, because the boto PR indicates that they expect a 404 when the user isn't ready, which according to the docs is returned for either MalformedQueryString or NoSuchEntity. I suppose if it's not consistent yet it's reasonable to think that there's NoSuchEntity, but that also means that you have no way of knowing if a query for any particular user is simply because they're not consistent (so will eventually return 200) or never existed in the first place and will always return 404.

It's certainly something that can be worked with, but...yuck.

@jefferai
Copy link
Member

jefferai commented Nov 4, 2015

Any chance I can corral some of you into testing a fix?

@jefferai
Copy link
Member

jefferai commented Nov 5, 2015

If some interested people can check that that PR doesn't break anything and/or fixes their problem, I'd appreciate it.

jefferai added a commit that referenced this issue Nov 11, 2015
@jefferai
Copy link
Member

@jantman @cetex @urq Any chance of you guys testing this in the next couple of days?

@DanyC97
Copy link

DanyC97 commented Dec 2, 2015

i'll add that i had to add a sleep using the boto3 before creating the instance using the IamInstanceProfile ( i iterate 3x 5 sec)

jefferai added a commit that referenced this issue Dec 4, 2015
@jefferai
Copy link
Member

jefferai commented Dec 4, 2015

So, bad news: using the official AWS client, after calling client.CreateAccessKey, client.GetUser immediately returns no error + a user:

guo.User is {
  Arn: "arn:aws:iam::966302050039:user/vault-root-1449266178-8081",
  CreateDate: 2015-12-04 21:56:19 +0000 UTC,
  Path: "/",
  UserId: "AIDAIWFO7Y52PW3YUNLD6",
  UserName: "vault-root-1449266178-8081"
}

That's putting a panic in my loop. So it seems like GetUser has changed. Or, the Go SDK implements it differently than boto.

@jefferai
Copy link
Member

jefferai commented Dec 4, 2015

We have someone in contact with the AWS team, so will see what they say...

@urq
Copy link

urq commented Dec 7, 2015

Left a comment on the last commit. Just echoing here, couldn't get it to work. As @jefferai said, it returns immediately with no error, which I think is because iam.GetUser is not the right signal to indicate that the credentials have propagated.

If I'm not mistaken about the above and we are indeed in a pickle, here are a few alternatives that could provide the fix (from least complex to most complex):

  1. Do nothing. Clearly document the problem and ask clients using these keys to just add some retry logic to their AWS calls.
  2. Sleep for a ridiculous amount of time (10 seconds) and then return the keys. However, just because we sleep for 10 seconds doesn't guarantee that the keys will be valid, just that it will most likely be valid (who knows about P999).
  3. Revisit the STS backend ( addressed in Use STS for AWS #295 ). I know we'd use that in a heartbeat.
  4. Decompose each IAM policy into the endpoints it allows access to, and test each of those endpoints for the validity of the credentials before returning them to the vault enduser. This could get nasty and slow for large IAM policies (think "*").

I'm partial to (1) in the short term and then (3) in the longer term. Looking at the integration, it doesn't look like an STS backend would be too difficult to figure out.

@jefferai
Copy link
Member

jefferai commented Dec 7, 2015

@urq I'm mostly in agreement with you. I'm interested to see what our own contact with the AWS team will produce, but I think in the short term it will have to be a combination of clear documentation and possibly STS. It would help if someone was willing to implement STS, since lack of time is a big factor personally.

Also, another issue is that there are many asks right now against AWS -- other people want credentials and/or auth integrated with KMS for instance -- and it'd be nice to figure out if any future AWS backend would be able to handle multiple types of invocation/use/operation, with the goal of not duplicating a bunch of AWS code across multiple backends. I commented with roughly the same info on #805 and a discussion has started there on various AWS needs. Community consensus on needs and technologies would go a long way towards figuring out the longer-term right way forward here.

@jefferai
Copy link
Member

@urq @jantman Any chance you can look at the changes in #927 and tell me if you think this covers your needs? Earlier people had suggested that STS would have to be its own backend, but this seems to add it pretty cleanly to the existing one. I could use some second/third opinions though.

@urq
Copy link

urq commented Jan 13, 2016

@jefferai yep, it meets my needs (the first commit in that PR is mine, and @dgromov is my colleague). I've reviewed the code and tested it in our staging environment. Works great without the wait!

@jefferai
Copy link
Member

@urq Ah, cool. I'll see if @jantman can check it out. I'm not super big on AWS domain knowledge so my preference here is to make sure it works (API/functionality) for at least two distinct customers :-)

@jantman
Copy link
Contributor Author

jantman commented Jan 14, 2016

@jefferai apologies for not responding for so long; I've been on vacation for a while and then got pretty swamped with work when I got back.

I'll test/check out #927 in the morning. But I can say that if it implements working STS functionality, that should solve this problem for my use case.

I'll update further tomorrow.

@jefferai
Copy link
Member

@jantman no rush! We are probably two to three weeks away from a 0.5 code freeze since we are hoping to base it off Go 1.6, so there's certainly time. Many thanks!

@jantman
Copy link
Contributor Author

jantman commented Jan 14, 2016

I commented on #927; overall looks very good and seems to solve this for me.

@jefferai
Copy link
Member

Closing this in favor of #927 which was just merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants