-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExpiredTokenException Errors #220
Comments
@cstewart87 I've honestly never seen this issue before... How are you running awslimitchecker, and how are you getting your credentials? Could you please post the information that's requested in the Issue Template, specifically your awslimitchecker version, how you installed it and your Python version, and the output of a problematic run (ideally with All of the credential handling is performed by boto, and I'm not sure there's a real way to do this... but I might be able to get a better idea with the above information... |
@jantman I've been working on this issue with Curtis, and it looks like the problem may be in the _boto3_connection_kwargs method from the Connectable class. The credentials is global to the class, and once it's set I do not see a place where it gets reset, regardless of whether a new instance is created. Is there a reason that credentials is not an instance variable? class Connectable(object):
|
I would add that out application is multi threaded with each thread handling a different account. I suspect maybe the credentials is getting overwritten with credentials for another account? |
I'm sorry for the issues you're experiencing. First off, awslimitchecker was never written with multi-threaded apps in mind; I have no idea what code paths may be there that aren't thread-safe. I can say, for certain, that this is also the default behavior of boto3/botocore, the underlying libraries; boto3 uses a global to store its session information and credentials. There may be a way to override this behavior, but I'm not sure offhand, and awslimitchecker certainly doesn't do it now. It was also really not written with multiple connections or long-running processes in mind. There's a way around this, but it's also not thread-safe. The workaround I use in the integration tests, which make multiple connections in the same process, is just to reset the globals back to None between connections:
I've honestly never thought of a use case where awslimitchecker would be doing so much work that multithreading would be required. My gut reaction would be to suggest distributing the work among a queue of separate processes, and collecting the results. Also, I don't know if this application is purely internal or customer-facing, but it's probably worth a reminder that awslimitchecker is licensed under the Affero GPL (AGPL) v3; anyone who can interact with the program over a network or is receiving the results of awslimitchecker is entitled to a clear attribution aod source code offer. |
@jantman we have plugged awslimitchecker into an existing framework running 50+ other checks that run across multiple accounts. There are plans in place to distribute the workload using Lambda functions, but it's not going to happen soon so we would need to do something in the interim. As far as thread-safeness, there is the issue of the boto3 default session which cause occasional errors like "KeyError: 'endpoint_resolver'". The fix would be to create a session per thread rather than using the default session, but this is not a pressing issue right now for us because these are very infrequent and the framework compensates. I think the very frequent errors we are having with awslimitchecker might be due to collisions with accounts running in parallel and resetting the credentials field. I understand the concern about making the credentials an instance var. We do all awslimitchecker calls for the same account using the same instance, so it would be OK here but other implementations could have an impact. An alternative might be to replace the single instance of credentials with a dict where the key was the account name, and to add a check on Credentials.expiration to Connectable._boto3_connection_kwargs(). I think that would resolve our issue without impacting anyone else. We can submit a PR for this change, but want to verify the solution with you before proceeding. |
Before I got a lot farther on this, I do want to ask again explicitly... is your use case checking for internal purposes, or is this information that's somehow used for or given to clients? (i.e. you're not with CloudCheckr/CloudHealth/etc. are you?) Hmm.... I'm not sure how I feel about the dict idea... my gut reaction is that someone else with a credential-related issue will end up being bitten by it... That being said, I'm only one person, and I don't necessarily always have the best ideas. Without digging into the code, just from memory, I think I like the idea of adding a check on Credentials.expiration in the _boto3_connection_kwargs() method. So... here's a thought: Maybe, what is now Even though this project is up to almost 10k downloads/month, I'm still the only consistent contributor. I certainly don't have all the answers or best ideas, and it's really unfortunate for me to be working on something like this (it's my most visible/popular personal project by orders of magnitude) with no other humans to bounce ideas off of and sanity check me. @aebie and @cstewart87 if you think my suggestion is either crazy, unnecessary or unfortunately complicated, tell me so, and I guess we'll give the dict idea a try and see if anyone has issues with it. I'd really appreciate and be happy for the level of opinion and honesty you'd give someone on your own team. Thanks! |
@jantman I think the idea you suggest is a more correct fix for this issue, and you make it both thread safe an allow for long running connections, but it would be a more complex change. Wondering if it would be better to construct the boto3 session in the awslimitchecker as well as the credentials in the same manner. We have someone on this side who can do the change once we nail down the design. |
Constructing the session and creds in awslimitchecker itself feels a bit strange to me, as they're pretty closely tied to the rest of the logic in Connectable. I'll try and give this some more thought later today, and see if I can work out what design feels best to me. I've gone through this with everyone who's requested support or opened issues and appears to be using awslimitchecker at scale, but can you confirm that you and/or your team and employer understand the license this code is released under, and the implications that has if you wish to use it as part of a SaaS or hosted solution? |
@jantman sorry, I didn't answer you first question. Curtis and I are using awslimitchecker for internal purposes only. We are not with with CloudCheckr/CloudHealth/etc. |
Ok, thanks for clarifying. I'd have nothing wrong with commercial services using the code provided they abide by the license... but that doesn't seem terribly likely, and unfortunately there's a lot of confusion and misunderstanding around the AGPL. Sorry for pushing the issue, but this is purely a personal-time project for me... Ok, so as to the question at hand, I've given this a bit more thought... and I think your last suggestion about moving the creds and boto3 session construction into AwsLimitChecker itself is the right way to go. My thoughts/theory: Right now, the TrustedAdvisor class and all of the services.* classes (via base._AwsService) utilize the Connectable mixin, and take a whole bunch of keyword arguments related to connections ( All of those classes are only instantiated in one place (aside from the unit tests), checker.AwsLimitChecker.init(). So, I think this would be my plan:
So the signature of def __init__(self, warning_threshold, critical_threshold,
profile_name=None, account_id=None, account_role=None,
region=None, external_id=None, mfa_serial_number=None,
mfa_token=None): to: def __init__(self, warning_threshold, critical_threshold,
boto_connection_kwargs): If you have someone who can try the implementation of this, that's great. If there are some things that I need to fix up, I can do that, but I probably won't be able to get around to it until this weekend. That being said, I can probably cut a new release this weekend. A few things to keep in mind, though I can do some of these if need be:
Thanks so much for all of your help with this!!!! If you or whoever is doing the work have any questions, feel free to ask, and if there's anything that's unclear I'm happy to rework the PR a bit if needed. |
Thanks @jantman. I think we have a person to do the work on this side. I have a ticket connected to this issue and will pass it to him. |
Great, thanks so much! Please ask him to sync up with me on Friday about how the work is going, as I'm probably going to be dedicating a bunch of time this weekend to working through a bunch of issues and planning a release. |
@jantman Wanted to let you know I heard from Will today and he is in the middle of a release right now so will not be able to engage until Monday. |
@aebie Ok, no worries. I'm going to try and spend some time this weekend working on the next release (the 0.7.0 milestone), depending on how busy I am with personal stuff. I doubt I'll get through all that and also tackle this myself, but there's a slim chance. I'm also going to spend some time debating whether the next release should just be 1.0.0, as it seems like enough people have been running this long enough that it's probably time for that... |
Also, that you and your colleagues so much for working through this with me. This is my only personal project that gets much use or attention, so it's always wonderful to know that others are making use of it, and willing to contribute back! |
Agh, this is what I get for trying to solve a bug myself before reading into the open issues for the project... I hit this the other day as well. I put together a patch to have the Connectable class refresh credentials if they are expired. I agree that the approach you all have been discussing is a better solution though. Patch below in case anyone wants it in the near-term. Thanks again for your work on this project.
|
@spulec :( I've done that many times myself. Sorry you were hitting this bug as well... but I think that logic you included in that patch is probably good to include anyway, even with the refactoring we've discussed. Thanks for sharing it! |
Right now, this is the last open issue before 0.7.0. Once this is fixed and complete, I'll be cutting that release. |
@jantman I've contacted Will and he should be engaging on this thread. I work in a remote location so do not know what his vacation plans are, but I will be out after today until the 29th. |
Ok. Well enjoy your time off! |
@aebie just wanted to check up on this. Everything else for the 0.7.0 milestone is staged in the develop branch and ready for release, so once this is done the release will go out. |
@jantman Will had a fist pass at the code last Friday. I need to track him down and get him to do a PR here, I will do that today. |
@jantman - I've got a first cut at this, but I'm having difficulty getting the tests to pass. You can take a look here. How does the 'foo' profile get setup during tests? When I run the tests, it checks my local ~/.aws/config file, and fails because that profile_name doesn't exist. |
@willusher I'm so sorry, the notifications for this only go to my personal email, and I've been so busy at work after being on vacation for a week that I haven't even checked it until now... The "foo" profile shouldn't get setup; the unit tests should be mocking out every call to boto/boto3. Take a look at the tests for connectable; essentially what you'll have to do is move the tests for |
@jantman Was talking to Will today and testing his code and functionally it looks good but the unit test changes are pretty extensive. He's churning through them but wanted to let you the status. |
@aebie ok, thanks so much for the update! |
Move creds and boto3 session construction into AwsLimitChecker to avoid session timeouts in long-lived sessions. Fixes GH issue jantman#220. See thread for detailed discussion of this change. Thanks to @jantman for all the help on this. ---- This contribution is being made under the same license as the awslimitchecker project (AGPL, or any subsequent version of that license as adopted by awslimitchecker), and may perpetually be included in and distributed with awslimitchecker.
Move creds and boto3 session construction into AwsLimitChecker to avoid session timeouts in long-lived sessions. Fixes GH issue jantman#220. See thread for detailed discussion of this change. Thanks to @jantman for all the help on this. ---- This contribution is being made under the same license as the awslimitchecker project (AGPL, or any subsequent version of that license as adopted by awslimitchecker), and may perpetually be included in and distributed with awslimitchecker.
Move creds and boto3 session construction into AwsLimitChecker to avoid session timeouts in long-lived sessions. Fixes GH issue jantman#220. See thread for detailed discussion of this change. Thanks to @jantman for all the help on this. ---- This contribution is being made under the same license as the awslimitchecker project (AGPL, or any subsequent version of that license as adopted by awslimitchecker), and may perpetually be included in and distributed with awslimitchecker.
I've merged the PR for this (with some minor changes) to develop. That was the last issue for 0.7.0, so I'm going to go through the release checklist now, and the release should be cut this morning. I'll update here when it is, to confirm. For clarity and the record, I'll reproduce the tail end of the changelog entry here:
@spulec A fix for the long-lived credentials is being released in 0.7.0, which should go out today. The gist of it is that connection parameters (and STS credentials) will now be an instance variable of the AwsLimitChecker class itself, so every time you construct a new one, it will get new credentials. Can you confirm if this fixes the bug for you? The implementation does not include your patch, so there's still no explicit checking of expiration time. If this fix works for you, cool! If not, please open a new issue for that, and in the meantime you can catch ExpiredTokenException and construct a new AwsLimitChecker class. |
This has been released in 0.7.0 and is live on pypi. |
That works for our current use case. Thanks all! |
Awesome, glad this made it out in the latest release! This should help us out a lot. Thanks for your support @jantman! |
And thanks you @jantman for this product. |
Hi, is there any way to refresh the boto connection? We have some long running processes that are causing this connection to expire. Or, is there any way to pass a connection length?
The text was updated successfully, but these errors were encountered: