-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Backend Latency with dynamic secrets #687
Comments
We're using Amazon's official libraries (https://github.com/aws/aws-sdk-go). The SDK client is returning success. We could try to work around it, but I would bet that this is really something that needs to be fixed there, because this would affect anyone using the SDK. If the credentials aren't valid, the SDK client probably shouldn't return success. Any chance you'd be willing to open an issue with them? Otherwise I can work through that angle. |
I spoke with an AWS engineer yesterday about this specific problem. The underlying issue is that the IAM API is (only) eventually consistent, so they (1) create and return your IAM credentials, and (2) propagate that information to relevant APIs on the backend, in that order. If you manage to use the credentials between (1) and (2), you're in for a bad time, and AWS is unlikely to address this any time soon since most folks are using static IAM credentials. The good news is that this engineer indicated that there may be another way: the AWS STS API. In a lot of ways this API is closer to what Vault is trying to do anyway--issue temporary credentials--and, according to this engineer, is fast and consistent. Browsing the docs, I see a couple limitations: (1) integrating this would break Vaut's current AWS API with end users since the returned credentail object has an extra parameter, a session token, in addition to the access key and secret key, and (2) the credentials can be valid for 15 minutes to 1 hour, so those folks needing shorter or longer leases than that span are SOL. If people really need the low latency requirement we could explore making a new backend, or modifying or replacing the existing backend with STS instead of IAM. Thoughts? |
Did the engineer happen to give any sort of idea about how long the eventually consistent API takes to be consistent? E.g. 99.9% in 4 seconds, or something like that? The STS time limitation is bothersome; if Vault can insert a reasonable delay before returning credentials, that might be best. |
Hmm, very interesting. That makes sense to me (eventually consistent) and certainly agrees with what I'm seeing. As to integration and the extra parameter, it seems to me that if Vault went the STS route, it should be a separate secret backend (i.e. a new STS backend). This brings up an interesting question for me: CloudFormation can create IAM resources (users, roles, policies, groups, instance profiles, access keys, etc.) and use them on other resources in the template. I haven't yet seen a case where a CF stack fails because the IAM resources in it haven't propagated yet. I wonder if that's just luck/enough latency in CF itself (i.e. an Instance Profile gets created, and there's enough latency between that and creating a Launch Configuration that uses it, that this problem isn't seen) or if they do something special on the backend to validate propagation before moving on to dependent resources? The other thing that would come to mind (which I know might be difficult or impossible depending on how the propagation is handled on the backend) is that if IAM resource creation is only eventually consistent, shouldn't there be a way to query whether or not a resource is propagated? Or have a "status" value in IAM API responses? If you think it will do some good, I'd be happy to reach out to AWS support or our account rep about this, but most times I've contacted AWS about a bug or feature request, the only response I got was that they'd added it to some secret internal list of requests, and I never hear anything more. |
@urq Is it a fair assumption that you have the ability to adjust parameters in the Vault calls you're making? If so, one easy way to fix this may be to add a parameter supporting a configurable delay; default to zero, but settable in seconds. Then you can just set that parameter on your calls to whatever value works for you, but for others users that aren't running into this problem there is no change in behavior or extra delay. |
@jefferai The engineer didn't indicate how long it would take to get consistency, though I would guess it's some log-normal distribution, and in my experience with a mean around 5 seconds. We could ask AWS or do some empirical testing on different APIs to get an estimate of that distribution. More fundamentally though, the issue with slapping a sleep time on it is that you will have some requests that exceed the sleep time and will then need to be handled by the client. Additionally you make all calls at least as slow as the sleep time, where otherwise some calls might return faster. I would recommend implementing retries instead. The janky way we're doing it now is relying on the fact that the S3 API returns an A nice-to-have would be to implement that logic on the server side to reduce the complexity of the clients. If this isn't something we want to put time into, I can keep rolling with our janky retry logic. |
@urq Yeah, I know a configurable sleep period is janky and gross, I figured it might be an easy bandage until someone has time to implement something better. I'll take a look at the IAM API though -- it sounds like if it works the same way as the S3 API, that would be a decent way to check the status of credentials. |
I'm scheduling this for 0.4 but that may slip. I'll try to get to it soon though. |
We happen to have an AWS consultant (someone actually from AWS; I don't know his official title) on site this week, so I took the opportunity to run this past him. His take was in agreement with what's been said above:
He did mention, however, that any delay longer than "a few seconds", specifically anything >5s, should be unusual. |
@jantman Thanks for this! It's really useful to know that there is no real way of determining when things are fine except testing against random service APIs, which is as janky as adding sleep. I think the immediate way forward will be to add a parameter allowing for a sleep time before returning the credentials. |
Hi! I've hit this as well and i'm since a month or two (randomly) solving it with a sleep of 7 seconds. sometimes 5 seconds is enough, sometimes 10 seconds is needed. I've contacted AWS and this is what they say: One of these methods is to add wait time between subsequent commands as you have implemented during testing period. The other method is to confirm creation of resource before run the next command. There is command which could be very useful to determine the successful creation of the user by waiting until 200 response is received when polling with get-user. It will poll every 1 seconds until a successful state has been reached. This will exit with a return code of 255 after 20 failed checks. #aws iam wait user-exists --user-name USERID For more information about this Command, please check the following link: --- SNIP --- So, I'm guessing that if you added polling of the account into the user-creation the results would be more consistent. If account creation is still "unstable" after that i guess a sleep is the way to go. Also, the STS would be a nice thing to have if it's possible to renew STS credentials in aws, but i guess that should be implemented as a new API in vault, not replace the current AWS API. |
I'm not going to get to this in the next couple of days, but if anyone interested wants to figure out which API the user-exists command is hitting, that would be useful. It's curious that they told you to use that after telling other users (further up in the thread) that there is no official way to determine whether things are yet consistent. |
FWIW, I'd asked our AWS consultant about STS. Here was his response:
@jefferai re: the |
Yuck, but there's no reasonable better way given the constraints. :-) |
This is actually quite ugly, because the boto PR indicates that they expect a 404 when the user isn't ready, which according to the docs is returned for either MalformedQueryString or NoSuchEntity. I suppose if it's not consistent yet it's reasonable to think that there's NoSuchEntity, but that also means that you have no way of knowing if a query for any particular user is simply because they're not consistent (so will eventually return 200) or never existed in the first place and will always return 404. It's certainly something that can be worked with, but...yuck. |
Any chance I can corral some of you into testing a fix? |
If some interested people can check that that PR doesn't break anything and/or fixes their problem, I'd appreciate it. |
i'll add that i had to add a sleep using the boto3 before creating the instance using the IamInstanceProfile ( i iterate 3x 5 sec) |
So, bad news: using the official AWS client, after calling
That's putting a panic in my loop. So it seems like GetUser has changed. Or, the Go SDK implements it differently than boto. |
We have someone in contact with the AWS team, so will see what they say... |
Left a comment on the last commit. Just echoing here, couldn't get it to work. As @jefferai said, it returns immediately with no error, which I think is because If I'm not mistaken about the above and we are indeed in a pickle, here are a few alternatives that could provide the fix (from least complex to most complex):
I'm partial to (1) in the short term and then (3) in the longer term. Looking at the integration, it doesn't look like an STS backend would be too difficult to figure out. |
@urq I'm mostly in agreement with you. I'm interested to see what our own contact with the AWS team will produce, but I think in the short term it will have to be a combination of clear documentation and possibly STS. It would help if someone was willing to implement STS, since lack of time is a big factor personally. Also, another issue is that there are many asks right now against AWS -- other people want credentials and/or auth integrated with KMS for instance -- and it'd be nice to figure out if any future AWS backend would be able to handle multiple types of invocation/use/operation, with the goal of not duplicating a bunch of AWS code across multiple backends. I commented with roughly the same info on #805 and a discussion has started there on various AWS needs. Community consensus on needs and technologies would go a long way towards figuring out the longer-term right way forward here. |
@jefferai apologies for not responding for so long; I've been on vacation for a while and then got pretty swamped with work when I got back. I'll test/check out #927 in the morning. But I can say that if it implements working STS functionality, that should solve this problem for my use case. I'll update further tomorrow. |
@jantman no rush! We are probably two to three weeks away from a 0.5 code freeze since we are hoping to base it off Go 1.6, so there's certainly time. Many thanks! |
I commented on #927; overall looks very good and seems to solve this for me. |
Closing this in favor of #927 which was just merged. |
I'm using Vault's AWS dynamic backend to get credentials for TerraForm to use when applying. My deployment process is driven by Rake, so I use the 'vault' rubygem to get credentials from Vault, and then pass them in to TerraForm as
-var
s on the generated command line.When I do this (disregarding any previoous leases and simply using
read()
), TerraForm always fails with an error like:When using
TF_LOG=1
this seems to always happen just afterInitializing CloudWatch SDK connection
but I'm not sure if it's related.However, when I use those same credentials locally, they always work. Going on this hunch, I inserted a 5 second sleep between the Vault read (that dynamically generates the credentials) and the
terraform apply
; this causes it to always work now.The text was updated successfully, but these errors were encountered: